AWS Big Data Blog
Introducing Protocol buffers (protobuf) schema support in AWS Glue Schema Registry
AWS Glue Schema Registry now supports Protocol buffers (protobuf) schemas in addition to JSON and Avro schemas. This allows application teams to use protobuf schemas to govern the evolution of streaming data and centrally control data quality from data streams to data lake. AWS Glue Schema Registry provides an open-source library that includes Apache-licensed serializers and deserializers for protobuf that integrate with Java applications developed for Apache Kafka, Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, and Kafka Streams. Similar to Avro and JSON schemas, Protocol buffers schemas also support compatibility modes, schema sourcing via metadata, auto-registration of schemas, and AWS Identity and Access Management (IAM) compatibility.
In this post, we focus on Protocol buffers schema support in AWS Glue Schema Registry and how to use Protocol buffers schemas in stream processing Java applications that integrate with Apache Kafka, Amazon Managed Streaming for Apache Kafka and Amazon Kinesis Data Streams
Introduction to Protocol buffers
Protocol buffers is a language and platform-neutral, extensible mechanism for serializing and deserializing structured data for use in communications protocols and data storage. A protobuf message format is defined in the .proto
file. Protobuf is recommended over other data formats when you need language interoperability, faster serialization and deserialization, type safety, schema adherence between data producer and consumer applications, and reduced coding effort. With protobuf, you can use generated code from the schema using the protobuf compiler (protoc
) to easily write and read your data to and from data streams using a variety of languages. You can also use build tools plugins such as Maven and Gradle to generate code from protobuf schemas as part of your CI/CD pipelines. We use the following schema for code examples in this post, which defines an employee with a gRPC service definition to find an employee by ID:
Employee.proto
AWS Glue Schema Registry supports both proto2 and proto3 syntax. The preceding protobuf schema using version 2 contains three message types: Employee
, Team
, and Project
using scalar, composite, and enumeration data types. Each field in the message definitions has a unique number, which is used to identify fields in the message binary format, and should not be changed once your message type is in use. In a proto2 message, a field can be required, optional, or repeated; in proto3, the options are repeated and optional. The package declaration makes sure generated code is namespaced to avoid any collisions. In addition to scalar, composite, and enumeration types, AWS Glue Schema Registry also supports protobuf schemas with common types such as Money
, PhoneNumber
,Timestamp
, Duration
, and nullable types such as BoolValue
and Int32Value
. It also supports protobuf schemas with gRPC service definitions with compatibility rules, such as EmployeeSearch
, in the preceding schema. To learn more about the Protocol buffers, refer to its documentation.
Supported Protocol buffers specification and features
AWS Glue Schema Registry supports all the features of Protocol buffers for versions 2 and 3 except for groups, extensions, and importing definitions. AWS Glue Schema Registry APIs and its open-source library supports the latest protobuf runtime version. The protobuf schema operations in AWS Glue Schema Registry are supported via the AWS Management Console, AWS Command Line Interface (AWS CLI), AWS Glue Schema Registry API, AWS SDK, and AWS CloudFormation.
How AWS Glue Schema Registry works
The following diagram illustrates a high-level view of how AWS Glue Schema Registry works. AWS Glue Schema Registry allows you to register and evolve JSON, Apache Avro, and Protocol buffers schemas with compatibility modes. You can register multiple versions of each schema as the business needs or stream processing application’s requirements evolve. The AWS Glue Schema Registry open-source library provides JSON, Avro, and protobuf serializers and deserializers that you configure in producer and consumer stream processing applications, as shown in the following diagram. The open-source library also supports optional compression and caching configuration to save on data transfers.
To accommodate various business use cases, AWS Glue Schema Registry supports multiple compatibility modes. For example, if a consumer application is updated to a new schema version but is still able to consume and process messages based on the previous version of the same schema, then the schema is backward-compatible. However, if a schema version has bumped up in the producer application and the consumer application is not updated yet but can still consume and process the old and new message, then the schema is configured as forward-compatible. For more information, refer to How the Schema Registry Works.
Create a Protocol buffers schema in AWS Glue Schema Registry
In this section, we create a protobuf schema in AWS Glue Schema Registry via the console and AWS CLI.
Create a schema via the console
Make sure you have the required AWS Glue Schema Registry IAM permissions.
- On the AWS Glue console, choose Schema registries in the navigation pane.
- Click Add registry.
- For Registry name, enter
employee-schema-registry
. - Click Add Registry.
- After the registry is created, click Add schema to register a new schema.
- For Schema name, enter
Employee.proto
.
The schema must be either Employee.proto
or Employee
if the protobuf schema doesn’t have the options option java_multiple_files = true;
and option java_outer_classname = "<Outer class name>";
and if you decide to use protobuf schema generated code (POJOs) in your stream processing applications. We cover this with an example in a subsequent section of this post. For more information on protobuf options, refer to Options.
- For Registry, choose the
registry employee-schema-registry
. - For Data format, choose Protocol buffers.
- For Compatibility mode, choose Backward.
You can choose other compatibility modes as per your use case.
- For First schema version, enter the preceding protobuf schema, then click Create schema and version.
After the schema is registered successfully, its status will be Available
, as shown in the following screenshot.
Create a schema via the AWS CLI
Make sure you have IAM credentials with AWS Glue Schema Registry permissions.
- Run the following AWS CLI command to create a schema registry
employee-schema-registry
(for this post, we use the Regionus-east-2
):
The AWS CLI command returns the newly created schema registry ARN in response.
- Copy the
RegistryArn
value from the response to use in the following AWS CLI command. - In the following command, use the preceding protobuf schema and schema name
Employee.proto
:
You can also use AWS CloudFormation to create schemas in AWS Glue Schema Registry.
Using a Protocol buffers schema with Amazon MSK and Kinesis Data Streams
Like Apache Avro’s SpecificRecord
and GenericRecord
, protobuf also supports working with POJOs to ensure type safety and DynamicMessage
to create generic data producer and consumer applications. The following examples showcase the use of a protobuf schema registered in AWS Glue Schema Registry with Kafka and Kinesis Data Streams producer and consumer applications.
Use a protobuf schema with Amazon MSK
Create an Amazon MSK or Apache Kafka cluster with a topic called protobuf-demo-topic
. If creating an Amazon MSK cluster, you can use the console. For instructions, refer to Getting Started Using Amazon MSK.
Use protobuf schema-generated POJOs
To use protobuf schema-generated POJOs, complete the following steps:
- Install the protobuf compiler (
protoc
) on your local machine from GitHub and add it in the PATH variable. - Add the following plugin configuration to your application’s pom.xml file. We use the xolstice protobuf Maven plugin for this post to generate code from the protobuf schema.
- Add the following dependencies to your application’s pom.xml file:
- Create a schema
registry employee-schema-registry
in AWS Glue Schema Registry and register theEmployee.proto
protobuf schema with it. Name your schemaEmployee.proto
(orEmployee
). - Run the following command to generate the code from
Employee.proto
. Make sure you have the schema file in the${basedir}/src/main/resources/proto
directory or change it as per your application directory structure in the application’s pom.xml<protoSourceRoot>
tag value:
Next, we configure the Kafka producer publishing protobuf messages to the Kafka topic on Amazon MSK.
- Configure the Kafka producer properties:
The VALUE_SERIALIZER_CLASS_CONFIG
configuration specifies the AWS Glue Schema Registry serializer, which serializes the protobuf message.
- Use the schema-generated code (POJOs) to create a protobuf message:
- Publish the protobuf messages to the
protobuf-demo-topic
topic on Amazon MSK: - Start the Kafka producer:
- In the Kafka consumer application’s pom.xml, add the same plugin and dependencies as the Kafka producer’s pom.xml.
Next, we configure the Kafka consumer consuming protobuf messages from the Kafka topic on Amazon MSK.
- Configure the Kafka consumer properties:
The VALUE_DESERIALIZER_CLASS_CONFIG
config specifies the AWS Glue Schema Registry deserializer that deserializes the protobuf messages.
- Consume the protobuf message (as a POJO) from the
protobuf-demo-topic
topic on Amazon MSK: - Start the Kafka consumer:
Use protobuf’s DynamicMessage
You can use DynamicMessage
to create generic producer and consumer applications without generating the code from the protobuf schema. To use DynamicMessage
, you first need to create a protobuf schema file descriptor.
- Generate a file descriptor from the protobuf schema using the following command:
The option --descritor_set_out
has the descriptor file name that this command generates. The protobuf schema Employee.proto
is in the proto
directory.
- Make sure you have created a schema registry and registered the preceding protobuf schema with it.
Now we configure the Kafka producer publishing DynamicMessage
to the Kafka topic on Amazon MSK.
- Create the Kafka producer configuration. The
PROTOBUF_MESSAGE_TYPE
configuration isDYNAMIC_MESSAGE
instead of POJO. - Create protobuf dynamic messages and publish them to the Kafka topic on Amazon MSK:
- Create a descriptor using the
Employeeproto.desc
file that we generated from theEmployee.proto
schema file in the previous steps: - Start the Kafka producer:
Now we configure the Kafka consumer consuming dynamic messages from the Kaka topic on Amazon MSK.
- Enter the following Kafka consumer configuration:
- Consume protobuf dynamic messages from the Kafka topic
protobuf-demo-topic
. Because we’re usingDYNAMIC_MESSAGE
, the retrieved objects are of typeDynamicMessage
. - Start the Kafka consumer:
Use a protobuf schema with Kinesis Data Streams
You can use the protobuf schema-generated POJOs with the Kinesis Producer Library (KPL) and Kinesis Client Library (KCL).
- Install the protobuf compiler (
protoc
) on your local machine from GitHub and add it in thePATH
variable. - Add the following plugin configuration to your application’s pom.xml file. We’re using the xolstice protobuf Maven plugin for this post to generate code from the protobuf schema.
- Because the KPL and KCL latest versions have the AWS Glue Schema Registry open-source library (
schema-registry-serde
) and protobuf runtime (protobuf-java
) included, you only need to add the following dependencies to your application’s pom.xml: - Create a schema registry
employee-schema-registry
and register theEmployee.proto
protobuf schema with it. Name your schemaEmployee.proto
(orEmployee
). - Run the following command to generate the code from
Employee.proto
. Make sure you have the schema file in the${basedir}/src/main/resources/proto
directory or change it as per your application directory structure in the application’s pom.xml<protoSourceRoot>
tag value.
The following Kinesis producer code with the KPL uses the Schema Registry open-source library to publish protobuf messages to Kinesis Data Streams.
- Start the Kinesis Data Streams producer:
- Configure the Kinesis producer:
- Create a protobuf message using schema-generated code (POJOs):
- Create the schema definition from
Employee.proto
:
The following is the Kinesis consumer code with the KCL using the Schema Registry open-source library to consume protobuf messages from the Kinesis Data Streams.
- Initialize the application:
- Consume protobuf messages from Kinesis Data Streams:
- Start the Kinesis Data Streams consumer:
Enhance your protobuf schema
We covered examples of data producer and consumer applications integrating with Amazon MSK, Apache Kafka, and Kinesis Data Streams, and using a Protocol buffers schema registered with AWS Glue Schema Registry. You can further enhance these examples with schema evolution using the following rules, which are supported by AWS Glue Schema Registry. For example, the following protobuf schema shown is a backward-compatible updated version of Employee.proto
. We have added another gRPC service definition CreateEmployee
under EmployeeSearch
and added an Optional
field in the Employee
message type. If you upgrade the consumer application with this version of the protobuf schema, the consumer application can still consume old and new protobuf messages.
Employee.proto (version-2)
Conclusion
In this post, we introduced Protocol buffers schema support in AWS Glue Schema Registry. AWS Glue Schema Registry now supports Apache Avro, JSON, and Protocol buffers schemas with different compatible modes. The examples in this post demonstrated how to use Protocol buffers schemas registered with AWS Glue Schema Registry in stream processing applications integrated with Apache Kafka, Amazon MSK, and Kinesis Data Streams. We used the schema-generated POJOs for type safety and protobuf’s DynamicMessage
to create generic producer and consumer applications. The examples in this post contain the basic components of the stream processing pattern; you can adapt these examples to your use case needs.
To learn more, refer to the following resources:
- How the Glue Schema Registry works
- Validate, evolve, and control schemas in Amazon MSK and Amazon Kinesis Data Streams with AWS Glue Schema Registry
- Validate streaming data over Amazon MSK using schemas in cross-account AWS Glue Schema Registry
- Evolve JSON Schemas in Amazon MSK and Amazon Kinesis Data Streams with the AWS Glue Schema Registry
- Protocol buffers
About the Author
Vikas Bajaj is a Principal Solutions Architect at AWS. Vikas works with digital native customers and advises them on technology architecture and solutions to meet strategic business objectives.