AWS Developer Blog

Amazon S3 Select Support in the AWS SDK for .NET

We are releasing support for Amazon S3 Select in the AWS SDK for .NET. This feature enables developers to run simple SQL queries against objects in Amazon S3. Today, if you’re frequently pulling entire objects to use portions of them, this functionality could dramatically improve performance.

S3 Select works on objects stored in CSV format or JSON format. It also works with objects that are compressed with GZIP, and with server-side encrypted objects. You can specify the format of the results as CSV or JSON, and you can determine how the records in the result are delimited.

We continue to add support for more input and output formats. To see what is currently supported, check the API docs for S3 Select.

S3 Select is great for:

  • Running queries over your database backups, whether they’re relational or NoSQL.
  • Parsing configuration.
  • Diving into logs.

Prerequisites

  • A current version of the AWS SDK for .NET dependencies, those released today or later:
  • An Amazon S3 bucket, with a file you want to query:
    • You need to have read permission on these resources.
    • The file needs to be in JSON or CSV format.

Making a request

To make a request, you call the SelectObjectContent/SelectObjectContentAsync operation on an S3Client. You pass in a request object detailing your requirements.

For example, if I want to query an object that was stored as line-delimited JSON objects,

{"name": "Joe", "company": "AMAZON", "favorite_color": "blue"}
{"name": "Mike", "company": "WHOLE FOODS", "favorite_color": "green"}
{"name": "Jane", "company": "AMAZON", "favorite_color": "yellow"}
{"name": "Ash", "company": "AMAZON", "favorite_color": "white"}
{"name": "Hikari", "company": "GITHUB", "favorite_color": "cyan"}

and want JSON back, my request will look like this.

private static async Task<ISelectObjectContentEventStream> GetSelectObjectContentEventStream()
{
    var response = await Client.SelectObjectContentAsync(new SelectObjectContentRequest()
   {
        Bucket = _bucketName,
        Key = _keyName,
        ExpressionType = ExpressionType.SQL,
        Expression = "select * from S3Object s where s.COMPANY = 'AMAZON'",
        InputSerialization = new InputSerialization()
        {
            JSON = new JSONInput()
            {
                JsonType = JsonType.Lines
            }
        },
        OutputSerialization = new OutputSerialization()
        {
            JSON = new JSONOutput()
        }
     });

    return response.Payload;
}

This returns the following.

{"name":"Joe","company":"AMAZON","favorite_color":"blue"}
{"name":"Jane","company":"AMAZON","favorite_color":"yellow"}
{"name":"Ash","company":"AMAZON","favorite_color":"white"}

If I want my output in CSV format instead, perhaps with a semicolon delimiter, I change OutputSerialization to:

OutputSerialization = new OutputSerialization()
{
    CSV = new CSVOutput()
    {
        QuoteFields = QuoteFields.Always,
        FieldDelimiter = ";"
    }
}

This returns the following.

"Joe";"AMAZON";"blue"
"Jane";"AMAZON";"yellow"
"Ash";"AMAZON";"white"

Of course, if I have a CSV file as my input,

"NAME","COMPANY","FAVORITE_COLOR"
"Joe","AMAZON","blue"
"Mike","WHOLE FOODS","green"
"Jane","AMAZON","yellow"
"Ash","AMAZON","white"
"Hikari","GITHUB","cyan"

I will change InputSerialization.

InputSerialization = new InputSerialization()
{
    CSV = new CSVInput()
    {
        FileHeaderInfo = FileHeaderInfo.Use
    }
},

This yields the same result as the previous example.
To read up on these options, and to find those that fit your use case, see the S3 documentation for SelectObjectContent.

Using the response (EventStream)

Because S3 Select can be run on very large objects, it can take time for the operation to complete. To keep your applications using S3 Select responsive, this operation returns an EventStream as its response. An EventStream is different from responses returned from other service operations. It sends out “events” as S3 processes your request. These events can be consumed by either attaching event handlers to the stream and starting a background processor, or by using the provided enumerable support.

Driven by events

The SelectObjectContentEventStream comes with the following:

  • EventReceived – A generic “any-event” event.
  • ExceptionReceived – An event that’s raised when there is an exception on the stream.
  • An event that is typed to each of the “event” types:
    • RecordsEventReceived – Contains your records in the format you specify in OutputSerialization.
    • StatsEventReceived – Details the total amount of work the request performed.
    • ContinuationEventReceived- A Keep-Alive that can usually be ignored.
    • ProgressEventReceived – Contains information about the progress of a query that has started but is not yet complete.
    • EndEventReceived – Signals the end of the response.

This is useful if you’re querying a very large object and want to ensure that your program remains responsive (like a UI or a service). Or maybe you’re developing a web application and want to post asynchronous messages to your client as events become available.

Here is a sample that you can embed within a console application. It registers event handlers for the “any” event, the “Records” event, and the “End” event. It then starts the background loop by calling StartProcessing. It uses a wait handle to keep the program from exiting prematurely. The handle is “set” when an “End” event is received. The sample waits until this happens, or until five seconds elapse.

var endWaitHandle = new AutoResetEvent(false);

using (var eventStream = await GetSelectObjectContentEventStream())
{
    // Since everything happens on a background thread, exceptions are raised as events.
    // Here, we are just throwing the exception received.
    eventStream.ExceptionReceived += (sender, args) => throw args.EventStreamException;

    eventStream.EventReceived += (sender, args) =>
        Console.WriteLine($"Received {args.EventStreamEvent.GetType().Name}!");
    eventStream.RecordsEventReceived += (sender, args) =>
    {
        Console.WriteLine("The contents of the Records Event is...");
        using (var reader = new StreamReader(args.EventStreamEvent.Payload, Encoding.UTF8))
        {
            Console.Write(reader.ReadToEnd());
        }
    };
    eventStream.EndEventReceived += (sender, args) => endWaitHandle.Set();

    eventStream.StartProcessing();
    endWaitHandle.WaitOne(TimeSpan.FromSeconds(5));
}

Enumerable support

The response payload of SelectObjectContent (ISelectObjectContentEventStream) implements IEnumerable. This means you can take advantage of conventional C# constructs directly on the stream.

Enumerating doesn’t wait until the entire response is streamed. Events are processed as they become available, allowing you to process the response as the request is still ongoing.

For example, you can use a foreach loop with pattern matching.

using (var eventStream = await GetSelectObjectContentEventStream())
{
    foreach (var ev in eventStream)
    {
        Console.WriteLine($"Received {ev.GetType().Name}!");
        if (ev is RecordsEvent records)
        {
          Console.WriteLine("The contents of the Records Event is...");
            using (var reader = new StreamReader(records.Payload, Encoding.UTF8))
            {
                Console.Write(reader.ReadToEnd());
            }
        }
    }
}

Or you can use Linq.

using (var eventStream = await GetSelectObjectContentEventStream())
{
    var recordResults = eventStream
        .Where(ev => ev is RecordsEvent)
        .Cast<RecordsEvent>()
        .Select(records =>
        {
            using (var reader = new StreamReader(records.Payload, Encoding.UTF8))
            {
                return reader.ReadToEnd();
            }
        }).ToArray();
    var results = string.Join(Environment.NewLine, recordResults);
    Console.WriteLine(results);
}

Conclusion

Support for the Amazon S3 Select feature in the AWS SDK for .NET will enable .NET developers to query S3 objects using SQL, and move the data filtering to S3. This will simplify existing workflows, and increase the performance of use cases that query data stored in S3 buckets. The EventStream response is designed to be familiar and convenient to .NET developers.

For more information, take a look at the S3 developer and API docs for S3 Select.

To report issues and get help, file an issue in our Github repository.

We’re excited to see what you build!