AWS Developer Blog

Introducing support for Amazon S3 Select in the AWS SDK for JavaScript

We’re excited to announce support for the Amazon Simple Storage Service (Amazon S3) selectObjectContent API with event streams in the AWS SDK for JavaScript. Using Amazon S3 Select, you can query for a subset of data from an S3 object by using simple SQL expressions.

Amazon S3 streams the responses as a series of events, instead of returning the full API response all at once. This enables your applications to process the parts of the response as the application receives them. To support this new API behavior, the AWS SDK for JavaScript supports processing these events asynchronously in Node.js, without requiring your application to wait for the full response. In browser or React Native runtime environments, these events are processed after the full response is received.

Using Amazon S3 Select to query an object

Amazon S3 Select enables you to query an object that contains CSV-formatted or JSON-formatted data with simple SQL expressions. For our example, let’s use a CSV file named target-file.csv as the key, that’s uploaded to an S3 object in the bucket named my-bucket in the us-west-2 AWS Region.

The CSV file contains a common delimited list of user names and ages.

user_name,age
jsrocks,13
node4life,22
esfuture,29
...

With this CSV file, we want our application to select only users with an age greater than 20. To do this, we write an SQL expression like the following to select the user_name field for users with an age greater than 20.

SELECT user_name FROM S3Object WHERE cast(age as int) > 20

Using Amazon S3 Select to select records

We can now use the AWS SDK for JavaScript with the Amazon S3 SelectObjectContent API to select records from JSON and CSV files stored in Amazon S3.

First, we want our application to create a new Amazon S3 client for the AWS Region that our my-bucket is in. We’ll use this client to make the selectObjectContent API calls.

const S3 = require('aws-sdk/clients/s3');
const client = new S3({
	region: 'us-west-2'
});

By following the AWS SDK for JavaScript API documentation for selectObjectContent, our API request parameters could look like the following for the SQL expression we want to use.

const params = {
	Bucket: 'my-bucket,
	Key: 'target-file.csv',
	ExpressionType: 'SQL,
	Expression: 'SELECT user_name FROM S3Object WHERE cast(age as int) > 20',
	InputSerialization: {
		CSV: {
			FileHeaderInfo: 'USE',
			RecordDelimiter: '\n',
			FieldDelimiter: ','
		}
	},
	OutputSerialization: {
		CSV: {}
	}
};

Now we have everything ready to make the API call. How we handle processing the events depends on the environment the SDK is running in.

Using streams in Node.js

In Node.js, events can be processed as they arrive by reading them off of a stream. To accomplish this, Payload, the event stream member of the response, is an object mode Readable stream. Because the stream is in object mode, individual events are sent to downstream streams or emitted via the data event. This is in contrast to typical streams, which emit buffers or strings.

The following example shows how to process events and handle errors using a stream.

s3.selectObjectContent(params, (err, data) => {
	if (err) {
		// Handle error
		return;
	}

	// data.Payload is a Readable Stream
	const eventStream = data.Payload;
	
	// Read events as they are available
	eventStream.on('data', (event) => {
		if (event.Records) {
			// event.Records.Payload is a buffer containing
			// a single record, partial records, or multiple records
			process.stdout.write(event.Records.Payload.toString());
		} else if (event.Stats) {
			console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);
		} else if (event.End) {
			console.log('SelectObjectContent completed');
		}
	});

	// Handle errors encountered during the API call
	eventStream.on('error', (err) => {
		switch (err.name) {
			// Check against specific error codes that need custom handling
		}
	});

	eventStream.on('end', () => {
		// Finished receiving events from S3
	});
});

Reading events in browsers and React Native

In browsers and React Native, the SDK must wait for the entire response from Amazon S3 before it can process events. In these environments, the Payload member is an array containing all of the events returned by Amazon S3.

s3.selectObjectContent(params, (err, data) => {
	if (err) {
		switch (err.name) {
			// Check against specific error codes that need custom handling
		}
		return;
	}

	// data.Payload is a Readable Stream
	const events = data.Payload;
	
	for (const event of events) {
		if (event.Records) {
			// event.Records.Payload is a buffer containing
			// a single record, partial records, or multiple records
			process.stdout.write(event.Records.Payload.toString());
		} else if (event.Stats) {
			console.log(`Processed ${event.Stats.Details.BytesProcessed} bytes`);
		} else if (event.End) {
			console.log('SelectObjectContent completed');
		}
	}
});

Final thoughts

With Amazon S3 Select, you can create SQL expressions to select a subset of CSV or JSON records from files stored in Amazon S3. The AWS SDK for JavaScript provides the tools to use this API to process events asynchronously in Node.js, or synchronously in browser environments.

Let us know what you think about this feature in the comments below!