Engineering at Scale

Google Spanner's Protobuf Integration: A Pleasant Surprise

One of the most pleasant surprises was discovering how well Google Spanner integrates with Protobuf. Unlike other databases where you'd serialize protos to bytes and lose all query capabilities, Spanner has native support for Protobuf columns. You can define them directly in your schema by using special PROTO type, and Spanner will understand the sub structure of your protobuf column. Here is an example of a table that has a PROTO column.

CREATE TABLE events (
  event_id STRING(36) NOT NULL,
  event_data PROTO<mypackage.EventPayload>
) PRIMARY KEY (event_id);

Before you can use your protobuf definition in Spanner tables, you need to generate descriptor files from your .proto definitions and load them into the database. Spanner needs these to understand your protobuf schema which allows it to query it. Here is an example of how to generate the descriptor file using protoc:

protoc --descriptor_set_out=my_protos.pb \
       --include_imports \
       mypackage/events.proto

The --include_imports flag is important because it bundles all dependencies into the descriptor file. I recommend appending a hash of all of your proto definitions at the end of your descriptor file, so they are tightly coupled and you can know that the descriptor contains the right data.

Once you have the descriptor file(my_protos.pb), you load it into Spanner using a DDL statement with the proto bundle:

CREATE PROTO BUNDLE (
  mypackage.EventPayload,
  mypackage.UserAction
)

As your protobuf definitions evolve, you can update the bundle with new versions. This is crucial for maintaining backward compatibility as you add new fields or message types. Reloading the same protobuf definition, will override the old one automatically.

ALTER PROTO BUNDLE UPDATE (
  mypackage.EventPayload,
  mypackage.UserAction,
  mypackage.NewMessageType  -- Adding a new proto type
)

PROTO type works particularly well with generated columns. For example, you can extract a timestamp from your protobuf object making it indexable without duplicating data. Just remember you can only index on primitive or ENUM data types:

CREATE TABLE events (
  event_id STRING(36) NOT NULL,
  event_data PROTO<mypackage.EventPayload>,
  event_timestamp TIMESTAMP AS (event_data.timestamp) STORED
) PRIMARY KEY (event_id);

In practice, the right approach is to load your bundles through one of the database migration tools. In my last project, I used a schema internal migration tool, which I extended to accept bundle files as well. In Go it would look something like this:

ctx := context.Background()
adminClient, _ := database.NewDatabaseAdminClient(ctx)

// First, read your generated descriptor file
descriptorBytes, err := os.ReadFile("my_protos.pb")
if err != nil {
    return err
}

// Load the proto bundle into Spanner
op, err := adminClient.UpdateDatabaseDdl(ctx, &adminpb.UpdateDatabaseDdlRequest{
    Database: "projects/my-project/instances/my-instance/databases/my-db",
    Statements: []string{
        "CREATE PROTO BUNDLE (" +
        "  mypackage.EventPayload," +
        "  mypackage.UserAction" +
        ")",
    },
    ProtoDescriptors: descriptorBytes,
})
if err != nil {
    return err
}
if err := op.Wait(ctx); err != nil {
    return err
}

Spanner lets you query into proto fields with SQL, the capabilities are surprisingly flexible.

You can access nested fields directly like:

SELECT event_data.user_id FROM events

filter on proto fields, or work with repeated fields using array operations like

SELECT * FROM events WHERE event_data.user_id = 'user123'
SELECT field FROM events, UNNEST(event_data.tags) AS field
 
Spanner even supports checking if optional fields are set and getting specific array elements:
 
```sql
SELECT event_data.has_metadata FROM events
SELECT event_data.actions[OFFSET(0)] FROM events

General recommendation is to do all protobuf object construction in your application code where you have type safety and proper tooling, then pass them to Spanner as complete units.

event := &mypackage.EventPayload{
    UserId: "user123",
    Timestamp: timestamppb.Now(),
    Action: "clicked_button",
    Metadata: &mypackage.Metadata{
        Source: "web_app",
        Version: "2.1.0",
    },
}

_, err := client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
    stmt := spanner.Statement{
        SQL: "INSERT INTO events (event_id, user_id, event_data) VALUES (@eventId, @userId, @eventData)",
        Params: map[string]interface{}{
            "eventId": uuid.New().String(),
            "userId": "user123",
            "eventData": event, // Pass the complete proto directly
        },
    }
    _, err := txn.Update(ctx, stmt)
    return err
})

Spanner supports updating different parts of your protobuf directly through SQL and dot notation:

	_, err = client.ReadWriteTransaction(ctx, func(ctx context.Context, txn *spanner.ReadWriteTransaction) error {
		stmt := spanner.Statement{
			SQL: `UPDATE events SET event_data.metadata.source = @newSource WHERE event_id = 321`,
			Params: map[string]interface{}{
				"newSource": "mobile",
			},
		}

		rowCount, err := txn.Update(ctx, stmt)
		if err != nil {
			return err
		}

All in all, Google Spanner is now my top choice for storing events as part of building event driven architectures.

If you want to read more on this topic, check out Google's official Spanner/Protobuf reference.

Also lots of good examples of how to query Spanner here

Have fun hacking!