In this module, you’ll create an Amazon Kinesis Data Firehose to deliver data from the Amazon Kinesis stream created in the first module to Amazon Simple Storage Service (Amazon S3) in batches. You’ll then use Amazon Athena to run queries against our raw data in place.

The architecture for this module builds on the Amazon Kinesis stream you created in the first module. You’ll use Amazon Kinesis Data Firehose to batch the data and deliver it to Amazon S3 to archive it. Using Amazon Athena, you’ll run ad-hoc queries against the raw data in the Amazon S3 bucket.

Time to complete module: 15 Minutes

Services used:
• Amazon Kinesis Data Firehose
• Amazon S3
• Amazon Athena

serverless-real-time-data-processing-mod-4
  • Step 1. Create an Amazon S3 bucket

    Use the console or CLI to create an S3 bucket. Keep in mind, your bucket’s name must be globally unique. We recommend using a name such wildrydes-data-yourname.


    a. From the AWS Management Console select Services then select S3 under Storage.

    b. Select + Create bucket.

    c. Provide a globally unique name for your bucket such as wildrydes-data-yourname.

    d. Select the region you've been using for your bucket.

    e. Select Next three times, and then select Create bucket.

  • Step 2. Create an Amazon Kinesis Data Firehose delivery stream

    Create an Amazon Kinesis Data Firehose delivery stream named wildrydes that is configured to source data from the wildrydes stream and deliver its contents in batches to the S3 bucket created in the previous section.


    a. From the AWS Management Console select Services then select Kinesis under Analytics.

    b. Select Create delivery stream.

    c. Enter wildrydes into Delivery stream name.

    d. Select Kinesis data stream as Source and select wildrydes as the source stream.

    e. Select Next.

    f. Leave Record transformation and Record format conversation disabled and select Next.

    g. Select Amazon S3 from Destination.

    h. Choose the bucket you created in the previous section (i.e. wildrydes-data-johndoe) from S3 bucket.

    i. Select Next.

    j. Enter 60 into Buffer interval under S3 Buffer to set the frequency of S3 deliveries once per minute.

    k. Scroll down to the bottom of the page and select Create new or Choose from IAM role. In the new tab, click Allow.

    l. Select Next. Review the delivery stream details and select Create delivery stream.

  • Step 3. Create an Amazon Athena table

    Create an Amazon Athena table to query the raw data in place on Amazon S3 using a JSON SerDe. Name the table wildrydes and include the attributes in the raw data:

    • Name (string)
    • StatusTime (timestamp)
    • Latitude (float)
    • Longitude (float)
    • Distance (float)
    • MagicPoints (int)
    • HealthPoints (int)

    a. Select Services then select Athena in the Analytics section.

    b. If prompted, select Get started and exit the first-run tutorial by hitting the x in the upper right hand corner of the modal dialog.

    c. Copy and paste the following SQL statement to create the table. Replace the YOUR_BUCKET_NAME_HERE placeholder with your bucket name (e.g. wildrydes-data-johndoe) in the LOCATION clause:

    CREATE EXTERNAL TABLE IF NOT EXISTS wildrydes (
           Name string,
           StatusTime timestamp,
           Latitude float,
           Longitude float,
           Distance float,
           HealthPoints int,
           MagicPoints int
         )
         ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
         LOCATION 's3://YOUR_BUCKET_NAME_HERE/';

    d. Select Run Query.

    If this is your first time using Amazon Athena, you may see a message instructing you to set up a query result location in Amazon S3. If so, create a folder in an existing S3 bucket or create a new S3 bucket, such as athena-query-results-yourname, and then set it as your default query result location in Athena.

    e. Verify the table wildrydes was created by ensuring it has been added to the list of tables under the sampledb database in the left navigation.

  • Step 4. Explore the batched data files

    Using the AWS Management Console, navigate to the S3 bucket that you used as your Kinesis Data Firehose delivery target. Verify that Firehose is delivering batched data files to the bucket. Download one of the files and open it in a text editor to see the contents.


    a. Select Services then select S3 in the Storage section.

    b. Enter the bucket name you created in the first section in the Search for buckets text input.

    c. Select the bucket and navigate through the year, month, day, and hour folders to ensure that files are being populated in your bucket.

    d. Select one of the files and select Download. Open the file with a text editor and explore its content.

  • Step 5. Query the data files

    Query the Amazon Athena table to see all records that have been delivered via Kinesis Data Firehose to S3.


    a. Select Services then select Athena in the Analytics section.

    b. Copy and paste the following SQL query:

    SELECT * FROM wildrydes

    c. Select Run Query.

    data-lake-query-results

    (click to zoom)

    data-lake-query-results
  • Recap & Tips


    🔑 Amazon Kinesis Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3. Amazon Athena allows us to run ad-hoc queries against the raw data using standard SQL.

    🔧 In this module, you’ve created a Kinesis Data Firehose deliery stream to deliver data from the Kinesis stream to an Amazon S3 bucket. Using Athena, you ran queries against this data on S3.