Q: What is Amazon Kinesis Data Analytics?

Amazon Kinesis Data Analytics is the easiest way to process and analyze real-time, streaming data. With Amazon Kinesis Data Analytics, you just use standard SQL to process your data streams, so you don’t have to learn any new programming languages. Simply point Kinesis Data Analytics at an incoming data stream, write your SQL queries, and specify where you want to load the results. Kinesis Data Analytics takes care of running your SQL queries continuously on data while it’s in transit and sending the results to the destinations.

Q: What is real-time stream processing and why do I need it?

Data is coming at us at lightning speeds due to an explosive growth of real-time data sources. Whether it is log data coming from mobile and web applications, purchase data from ecommerce sites, or sensor data from IoT devices, it all delivers information that can help companies learn about what their customers, organization, and business are doing right now. By having visibility into this data as it arrives, you can monitor your business in real-time and quickly leverage new business opportunities – like making promotional offers to customers based on where they might be at a specific time, or monitoring social sentiment and changing customer attitudes to identify and act on new opportunities.

To take advantage of these opportunities, you need a different set of analytics tools for collecting and analyzing real-time streaming data than what has been available traditionally for static, stored data. With traditional analytics, you gather the information, store it in a database, and analyze it hours, days, or weeks later. Analyzing real-time data requires a different approach and different tools and services. Instead of running database queries on stored data, streaming analytics platforms process the data continuously before the data is stored in a database. Streaming data flows at an incredible rate that can vary up and down all the time. Streaming analytics platforms have to be able to process this data when it arrives, often at speeds of millions of events per hour.

Q: What can I do with Kinesis Data Analytics?

You can use Kinesis Data Analytics in pretty much any use case where you are collecting data continuously in real-time and want to get information and insights in seconds or minutes rather than having to wait days or even weeks. In particular, Kinesis Data Analytics enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The three most common usage patterns are time-series analytics, real-time dashboards, and real-time alerts and notifications.

Generate Time-Series Analytics

Time-series analytics enables you to monitor and understand how your data is trending over time. With Kinesis Data Analytics, you can author SQL code that continuously generates time-series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon S3. Or, you can track the traffic to your website by calculating the number of unique website visitors every five minutes and then sending the processed results to Amazon Redshift.

Feed Real-Time Dashboards

You can build applications that compute query results and emit them to a live dashboard, enabling you to visualize the data in near real-time. For example, an application can continuously calculate business metrics such as the number of purchases from an e-commerce site, grouped by the product category, and then send the results to Amazon Redshift for visualization with a business intelligence tool of your choice. Consider another example where an application processes log data and calculates the number application errors, and then send the results to Amazon Elasticsearch Service for visualization with Kibana.

Create Real-Time Alarms and Notifications

You can build applications that send real-time alarms or notifications when certain metrics reach predefined thresholds, or, in more advanced cases, when your application detects anomalies using the machine learning algorithm we provide. For example, an application can compute the availability or success rate of a customer-facing API over time, and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Kinesis Data Streams and Amazon Simple Notification Service (SNS).

Q: How do I get started with Kinesis Data Analytics?

Sign into the Kinesis Data Analytics console and create a new stream processing application. You can also use the AWS CLI and AWS SDKs. You can build an end-to-end application in three simple steps: 1) configure incoming streaming data, 2) write your SQL queries, and 3) point to where you want the results loaded. Kinesis Data Analytics recognizes standard data formats such as JSON, CSV, and TSV, and automatically creates a baseline schema. You can refine this schema, or if your data is unstructured, you can define a new one using our intuitive schema editor. Then, the service applies the schema to the input stream and makes it look like a SQL table that is continually updated so that you can write standard SQL queries against it. You use our SQL editor to build your queries. The SQL editor comes with all the bells and whistles including syntax checking and testing against live data. We also give you templates that provide the SQL code for anything from a simple stream filter to advanced anomaly detection and top-K analysis. Kinesis Data Analytics takes care of provisioning and elastically scaling all of the infrastructure to handle any data throughput. You don’t need to plan, provision, or manage infrastructure.

Q: What are the limits of Kinesis Data Analytics?

Kinesis Data Analytics elastically scales your application to accommodate for the data throughput of your source stream and your query complexity for most scenarios. However, keep the following limits in mind while using Amazon Kinesis Data Analytics:

  • Each record can't exceed 50 KB. You can split records larger than 50 KB into multiple records when you define your input schema.
  • You can create up to five Kinesis Data Analytics applications per AWS Region in your account. You can raise this limit by submitting a service limit increase form.
  • You might have to parallelize your queries in order to keep up with the data in the stream. To do this, you can specify your input data stream to be mapped to up to 10 in-application streams.
  • The maximum number of Kinesis Processing Units (KPUs) is eight.
  • You can configure application output to persist results to up to four destinations.
  • The Amazon S3 object that stores reference data can be up to 1 GB in size.

Q: What is a Kinesis Data Analytics application?

An application is the Kinesis Data Analytics entity that you work with. Kinesis Data Analytics applications continuously read and process streaming data in real-time. You write application code using SQL to process the incoming streaming data and produce output. Then, Kinesis Data Analytics writes the output to a configured destination.

Each application consists of three primary components:

  • Input – The streaming source for your application. In the input configuration, you map the streaming source to an in-application input stream. The in-application stream is like a continuously updating table upon which you can perform SELECT and INSERT SQL operations. Each input record has an associated schema, which is applied as part of inserting the record into the in-application stream.
  • Application code – A series of SQL statements that process input and produce output. In its simplest form, application code can be a single SQL statement that selects from a streaming input and inserts results into a streaming output. It can also be a series of SQL statements where the output of one feeds into the input of the next SQL statement. Further, you can write application code to split an input stream into multiple streams and then apply additional queries to process these separate streams. 
  • Output – You can create one or more in-application streams to hold intermediate results. You can then optionally configure an application output to persist data from specific in-application streams to an external destination. 

Q: What is an in-application stream?

An in-application stream is an entity that continuously stores data in your application for you to perform the SELECT and INSERT SQL operations. You interact with an in-application stream in the same way that you would a SQL table. However, a stream differs from a table in that data is continuously updated. In your application code, you can create additional in-application streams to store intermediate query results. Finally, both your configured input and output represent themselves in your application as in-applications streams.

Q: What inputs are supported in a Kinesis Data Analytics application?

Kinesis Data Analytics supports two types of inputs: streaming data sources and reference data sources. A streaming data source is continuously generated data that is read into your application for processing. A reference data source is static data that your application uses to enrich data coming in from streaming sources. Each application can have no more than one streaming data source and no more than one reference data source. An application continuously reads and processes new data from streaming data sources, including Amazon Kinesis Data Streams or Amazon Kinesis Data Firehose. An application reads a reference data source, including Amazon S3, in its entirety for use in enriching the streaming data source through SQL JOINs.

Q: What is a reference data source?

A reference data source is static data that your application uses to enrich data coming in from streaming sources. You store reference data as an object in your S3 bucket. When the application starts, Kinesis Data Analytics reads the S3 object and creates an in-application SQL table to store the reference data. Your application code can then join it with an in-application stream. You can update the data in the SQL table by calling the UpdateApplication API.

Q: What application code is supported?

Kinesis Data Analytics supports the ANSI SQL with some extensions to the SQL standard to make it easier to work with streaming data. Additionally, Kinesis Data Analytics provides several machine learning algorithms that are exposed as SQL functions including anomaly detection, approximate top-K, and approximate distinct items.

Q: What destinations are supported?

Kinesis Data Analytics supports up to four destinations per application. You can persist SQL results to Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service (through an Amazon Kinesis Data Firehose), and Amazon Kinesis Data Streams. You can write to a destination not directly supported by Kinesis Data Analytics by sending SQL results to Amazon Kinesis Data Streams, and leveraging its integration with AWS Lambda to send to a destination of your choice.


Q: How do I set up a streaming data source?

A streaming data source can be an Amazon Kinesis data stream or an Amazon Kinesis Data Firehose delivery stream. Your Kinesis Data Analytics application continuously reads new data from streaming data sources as it arrives in real time. The data is made accessible in your SQL code through an in-application stream. An in-application stream acts like a SQL table because you can create, insert, and select from it. However, the difference is that an in-application stream is continuously updated with new data from the streaming data source.

You can use the AWS Management Console to add a streaming data source. You can learn more about sources in the Configuring Application Input section of the Kinesis Data Analytics Developer Guide.

Q: How do I set up a reference data source?

A reference data source can be an Amazon S3 object. Your Kinesis Data Analytics application reads the S3 object in its entirety when it starts running. The data is made accessible in your SQL code through a table. The most common use case for using a reference data source is to enrich the data coming from the streaming data source using a SQL JOIN.

Using the AWS CLI, you can add a reference data source by specifying the S3 bucket, object, IAM role, and associated schema. Kinesis Data Analytics loads this data when you start the application, and reloads it each time you make any update API call.

Q: What data formats are supported?

Kinesis Data Analytics detects the schema and automatically parses UTF-8 encoded JSON and CSV records using the DiscoverInputSchema API. This schema is applied to the data read from the stream as part of the insertion into an in-application stream.

For other UTF-8 encoded data that does not use a delimiter, uses a different delimiter than CSV, or in cases were the discovery API did not fully discover the schema, you can define a schema using the interactive schema editor or use string manipulation functions to structure your data. For more information, see Using the Schema Discovery Feature and Related Editing in the Kinesis Data Analytics Developer Guide.

Q: How is my input stream exposed to my SQL code?

Kinesis Data Analytics applies your specified schema and inserts your data into one or more in-application streams for streaming sources, and a single SQL table for reference sources. The default number of in-application streams is the one that meets the needs of most of your use cases. You should increase this if you find that your application is not keeping up with the latest data in your source stream as defined by CloudWatch metric MillisBehindLatest. The number of in-application streams required is impacted by both the amount of throughput in your source stream and your query complexity. The parameter for specifying the number of in-application streams that are mapped to your source stream is called input parallelism.


Q: What does my application code look like?

Application code is a series of SQL statements that process input and produce output. These SQL statements operate on in-application streams and reference tables. An in-application stream is like a continuously updating table on which you can perform the SELECT and INSERT SQL operations. Your configured sources and destinations are exposed to your SQL code through in-application streams. You can also create additional in-application streams to store intermediate query results.

You can use the following pattern to work with in-application streams:

  • Always use a SELECT statement in the context of an INSERT statement. When you select rows, you insert results into another in-application stream.
  • Use an INSERT statement in the context of a pump. You use a pump to make an INSERT statement continuous, and write to an in-application stream.
  • You use a pump to tie in-application streams together, selecting from one in-application stream and inserting into another in-application stream. 

The following SQL code provides a simple, working application:

CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (

    ticker_symbol VARCHAR(4),

    change DOUBLE,

    price DOUBLE);

 

CREATE OR REPLACE PUMP "STREAM_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM    

    SELECT STREAM ticker_symbol, change, price    

    FROM "SOURCE_SQL_STREAM_001";

For more information about application code, see Application Code in the Kinesis Data Analytics Developer Guide.

Q: How does Kinesis Data Analytics help me with writing SQL code?

Kinesis Data Analytics includes a library of analytics templates for common use cases including streaming filters, tumbling time windows, and anomaly detection. You can access these templates from the SQL editor in the AWS Management Console. After you create an application and navigate to the SQL editor, the templates are available in the upper-left corner of the console.

Q: How can I perform real-time anomaly detection in Kinesis Data Analytics?

Kinesis Data Analytics includes pre-built SQL functions for several advanced analytics including one for anomaly detection. You can simply make a call to this function from your SQL code for detecting anomalies in real-time. Kinesis Data Analytics uses the Random Cut Forest algorithm to implement anomaly detection. For more information on Random Cut Forests, see the Streaming Data Anomaly Detection whitepaper.


Q: How do I set up a destination?

In your application code, you write the output of SQL statements to one or more in-application streams. Optionally, you can add an output configuration to your application to persist everything written to specific in-application streams to up to four external destinations. These external destinations can be an Amazon S3 bucket, Amazon Redshift table, Amazon Elasticsearch domain (through Amazon Kinesis Data Firehose) and an Amazon Kinesis data stream. Each application supports up to four destinations, which can be any combination of the above. For more information, see Configuring Output Streams in the Kinesis Data Analytics Developer Guide.

Q: My preferred destination is not directly supported. How can I send SQL results to this destination?

You can use AWS Lambda to write to a destination that is not directly supported using. We recommend that you write results to an Amazon Kinesis data stream, and then use AWS Lambda to read the processed results and send it to the destination of your choice. For more information, see the Example: AWS Lambda Integration in the Kinesis Data Analytics Developer Guide. Alternatively, you can use an Amazon Kinesis Data Firehose delivery stream to load the data into Amazon S3, and then trigger an AWS Lambda function to read that data and send it to the destination of your choice. For more information, see Using AWS Lambda with Amazon S3 in the AWS Lambda Developer Guide.

Q: What delivery model does Kinesis Data Analytics provide?

Kinesis Data Analytics uses an "at least once" delivery model for application output to the configured destinations. Kinesis Data Analytics applications take internal checkpoints, which are points in time when output records were delivered to the destinations and there was no data loss. The service uses the checkpoints as needed to ensure that your application output is delivered at least once to the configured destinations. For more information about the delivery model, see Configuring Application Output in the Kinesis Data Analytics Developer Guide.


Q: How can I monitor the operations and performance of my Kinesis Data Analytics applications?

AWS provides various tools that you can use to monitor your Kinesis Data Analytics applications. You can configure some of these tools to do the monitoring for you. For more information about how to monitor your application, see Monitoring Kinesis Data Analytics in the Kinesis Data Analytics Developer Guide.

Q: How do I manage and control access to my Kinesis Data Analytics applications?

Kinesis Data Analytics needs permissions to read records from the streaming data sources that you specify in your application. Kinesis Data Analytics also needs permissions to write your application output to data  streams that you specify in your application output configuration. You can grant these permissions by creating IAM roles that Kinesis Data Analytics can assume. The permissions you grant to this role determine what Kinesis Data Analytics can do when the service assumes the role. For more information, see Granting Permissions in the Kinesis Data Analytics Developer Guide.

Q: How does Kinesis Data Analytics scale my application?

Kinesis Data Analytics elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Kinesis Data Analytics provisions capacity in the form of Amazon Kinesis Processing Units (KPU). A single KPU provides you with the memory (4 GB), and corresponding compute and networking.
Each streaming source is mapped to a corresponding in-application stream. While this is not required for many customers, you can more efficiently use KPUs by increasing the number of in-application streams that your source is mapped to by specifying the input parallelism parameter. Kinesis Data Analytics evenly assigns the streaming data source’s partitions, such as an Amazon Kinesis data stream’s shards, to the number of in-application data streams that you specified. For example, if you have a 10-shard Amazon Kinesis data stream as a streaming data source and you specify an input parallelism of two, Kinesis Data Analytics assigns five Amazon Kinesis shards to two in-application streams named “SOURCE_SQL_STREAM_001” and “SOURCE_SQL_STREAM_002”. For more information, see Configuring Application Input in the Kinesis Data Analytics Developer Guide.

Q: What are the best practices associated for building and managing my Kinesis Data Analytics applications?

For information about best practices, see the Best Practices section of the Kinesis Data Analytics Developer Guide, which covers managing applications, defining input schema, connecting to outputs, and authoring application code.


Q: How do I get a particular SQL statement to work correctly?

For details, see Example Applications in the Kinesis Data Analytics Developer Guide, which has a number of SQL examples that you can use. In addition, the Kinesis Data Analytics SQL Reference provides a detailed guide to authoring streaming SQL statements. If you are still running into issues, we recommend that you ask a question on the Amazon Kinesis Forums.

Q: Kinesis Data Analytics was unable to detect or discover my schema. How can I use Kinesis Data Analytics?

For other UTF-8 encoded data that does not use a delimiter, uses a different delimiter than CSV, or in cases where the discovery API did not fully discover the schema, you can define a schema by hand or use string manipulation functions to structure your data. For more information, see Using the Schema Discovery Feature and Related Editing in the Kinesis Data Analytics Developer Guide.

Q: What are the important parameters I should monitor to make sure my application is running correctly?

The most important parameter to monitor is the CloudWatch metric, MillisBehindLatest, which represents how far behind from the current time you are reading from the stream. This metric provides an effective mechanism to determine whether you are processing records from the source stream fast enough. You should set up a CloudWatch Alarm to trigger if you fall behind more than one hour (this number depends on your use case and can be adjusted as needed). You can learn more in the Best Practices section of the Kinesis Data Analytics Developer Guide.

Q: How do I troubleshoot invalid code errors when running a Kinesis Data Analytics application?
For information about invalid code errors and troubleshooting your Kinesis Data Analytics application, see Troubleshooting in the Amazon Kinesis Data Analytics Developer Guide.


Q: How much does Kinesis Data Analytics cost?

With Kinesis Data Analytics, you pay only for what you use. You are charged an hourly-rate based on the average number of Kinesis Processing Units (or KPUs) used to run your stream processing application. We round up to the nearest whole KPU.

A single KPU is a stream processing resource comprising of memory (4 GB), compute (1 vCPU), and corresponding networking capabilities. Because your streaming application's memory and compute consumption vary during processing, Kinesis Data Analytics automatically and elastically scales the number of KPUs based on your streaming workload. There are no resources to provision, no upfront costs or minimum fees associated with Kinesis Data Analytics.

For more information about pricing, see the Kinesis Data Analytics pricing page.

Q: Is Kinesis Data Analytics available in AWS Free Tier?

No. Kinesis Data Analytics is not currently available in AWS Free Tier. AWS Free Tier is a program that offers free trial for a group of AWS services.

Q: Am I charged for a Kinesis Data Analytics application that is running but not processing any data from the source?

You are charged a minimum of one KPU if your Kinesis Data Analytics application is running.

Q: Other than Kinesis Data Analytics costs, are there any other costs that I might incur?

Kinesis Data Analytics is a fully managed stream processing solution, independent from the streaming source that it reads data from and the destinations it writes processed data to. You will be billed independently for your Kinesis Data Firehose and Kinesis Data Streams usage costs related to your input and output streams.

Q: How does Kinesis Data Analytics differ from running my own application using the Amazon Kinesis Client Library?

The Amazon Kinesis Client Library (KCL) is a pre-built library that helps you build consumer applications for reading and processing data from an Amazon Kinesis data stream. The KCL handles complex issues such as adapting to changes in data stream volume, load balancing streaming data, coordinating distributed services, and processing data with fault-tolerance. The KCL enables you to focus on business logic while building applications.

With Kinesis Data Analytics, you can process and query real -time, streaming data. You use standard SQL to process your data streams, so you don’t have to learn any new programming languages. You just point Kinesis Data Analytics to an incoming data stream, write your SQL queries, and then specify where you want the results loaded. Kinesis Data Analytics uses the KCL to read data from streaming data sources as one part of your underlying application. The service abstracts this from you, as well as many of the more complex concepts associated with using the KCL, such as checkpointing.

If you want a fully managed solution and you want to use SQL to process the data from your data stream, you should use Kinesis Data Analytics. Use the KCL if you need to build a custom processing solution whose requirements are not met by Kinesis Data Analytics, and you are able to manage the resulting consumer application.