Amazon Managed Service for Apache Flink FAQs
General
What is Amazon Managed Service for Apache Flink?
With Amazon Managed Service for Apache Flink, you can transform and analyze streaming data in real time with Apache Flink. Apache Flink is an open source framework and engine for processing data streams. Amazon Managed Service for Apache Flink reduces the complexity of building, managing, and integrating Apache Flink applications with other AWS services.
Amazon Managed Service for Apache Flink takes care of everything required to continuously run streaming applications and scales automatically to match the volume and throughput of your incoming data. With Amazon Managed Service for Apache Flink, there are no servers to manage, there is no minimum fee or setup cost, and you only pay for the resources your streaming applications consume.
What is real-time stream processing and why do I need it?
What can I do with Amazon Managed Service for Apache Flink?
You can use Amazon Managed Service for Apache Flink for many use cases to process data continuously, getting insights in seconds or minutes rather than waiting days or even weeks. Amazon Managed Service for Apache Flink enables you to quickly build end-to-end stream processing applications for log analytics, clickstream analytics, Internet of Things (IoT), ad tech, gaming, and more. The four most common use cases are streaming extract-transform-load (ETL), continuous metric generation, responsive real-time analytics, and interactive querying of data streams.
Streaming ETL
With streaming ETL applications, you can clean, enrich, organize, and transform raw data prior to loading your data lake or data warehouse in real time, reducing or eliminating batch ETL steps. These applications can buffer small records into larger files prior to delivery and perform sophisticated joins across streams and tables. For example, you can build an application that continuously reads IoT sensor data stored in Amazon Managed Streaming for Apache Kafka (Amazon MSK), organize the data by sensor type, remove duplicate data, normalizes data per a specified schema, and then deliver the data to Amazon Simple Storage Service (Amazon S3).
Continuous metric generation
With continuous metric generation applications, you can monitor and understand how your data is trending over time. Your applications can aggregate streaming data into critical information and seamlessly integrate it with reporting databases and monitoring services to serve your applications and users in real time. With Amazon Managed Service for Apache Flink, you can use Apache Flink code (in Java, Scala, Python, or SQL) to continuously generate time series analytics over time windows. For example, you can build a live leaderboard for a mobile game by computing the top players every minute and then sending it to Amazon DynamoDB. You can also track the traffic to your website by calculating the number of unique website visitors every 5 minutes and then sending the processed results to Amazon Redshift.
Responsive real-time analytics
Responsive real-time analytics applications send real-time alarms or notifications when certain metrics reach predefined thresholds or, in more advanced cases, when your application detects anomalies using machine learning (ML) algorithms. With these applications, you can respond immediately to changes in your business in real time such as predicting user abandonment in mobile apps and identifying degraded systems. For example, an application can compute the availability or success rate of a customer-facing API over time and then send results to Amazon CloudWatch. You can build another application to look for events that meet certain criteria, and then automatically notify the right customers using Amazon Kinesis Data Streams and Amazon Simple Notification Service (Amazon SNS).
Interactive analysis of data streams
Interactive analysis helps you to stream data exploration in real time. With ad hoc queries or programs, you can inspect streams from Amazon MSK or Amazon Kinesis Data Streams and visualize what data looks like within those streams. For example, you can view how a real-time metric that computes the average over a time window behaves and send the aggregated data to a destination of your choice. Interactive analysis also helps with iterative development of stream processing applications. The queries you build continuously update as new data arrives. With Amazon Managed Service for Apache Flink Studio, you can deploy these queries to run continuously with auto scaling and durable state backups enabled.
Getting started
How do I get started with Apache Flink applications for Amazon Managed Service for Apache Flink?
How do I get started with Apache Beam applications for Amazon Managed Service for Apache Flink?
How do I get started with Amazon Managed Service for Apache Flink Studio?
What are the limits of Amazon Managed Service for Apache Flink?
Does Amazon Managed Service for Apache Flink support schema registration?
Yes, by using Apache Flink DataStream Connectors, Amazon Managed Service for Apache Flink applications can use AWS Glue Schema Registry, a serverless feature of AWS Glue. You can integrate Apache Kafka, Amazon MSK, and Amazon Kinesis Data Streams, as a sink or a source, with your Amazon Managed Service for Apache Flink workloads. Visit the AWS Glue Schema Registry Developer Guide to get started and learn more.
Key concepts
What is an Amazon Managed Service for Apache Flink application?
An application is the Amazon Managed Service for Apache Flink entity that you work with. Amazon Managed Service for Apache Flink applications continuously read and process streaming data in real time. You write application code in an Apache Flink–supported language to process the incoming streaming data and produce output. Then, Amazon Managed Service for Apache Flink writes the output to a configured destination.
Each application consists of three primary components:
- Input: Input is the streaming source for your application. In the input configuration, you map the streaming sources to data streams. Data flows from your data sources into your data streams. You process data from these data streams using your application code, sending processed data to subsequent data streams or destinations. You add inputs inside application code for Apache Flink applications and Studio notebooks and through the API for Amazon Managed Service for Apache Flink applications.
- Application code: Application code is a series of Apache Flink operators that process input and produce output. In its simplest form, application code can be a single Apache Flink operator that reads from adata stream associated with a streaming source and writes to another data stream associated with an output. For a Studio notebook, this could be a simple Flink SQL select query, with the results shown in context within the notebook. You can write Apache Flink code in its supported languages for Amazon Managed Service for Apache Flink applications or Studio notebooks.
- Output: You can then optionally configure an application output to persist data to an external destination. You add these outputs inside application code for Amazon Managed Service for Apache Flink applications and Studio notebooks.
What application code is supported?
Managing applications
How can I monitor the operations and performance of my Amazon Managed Service for Apache Flink applications?
AWS provides various tools that you can use to monitor your Amazon Managed Service for Apache Flink applications including access to the Flink Dashboard for Apache Flink applications. You can configure some of these tools to do the monitoring for you. For more information about how to monitor your application, explore the following developer guides:
- Monitoring Amazon Managed Service for Apache Flink in the Amazon Managed Service for Apache Flink Developer Guide.
- Monitoring Amazon Managed Service for Apache Flink in the Amazon Managed Service for Apache Flink Studio Developer Guide.
How do I manage and control access to my Amazon Managed Service for Apache Flink applications?
Amazon Managed Service for Apache Flink needs permissions to read records from the streaming data sources you specify in your application. Amazon Managed Service for Apache Flink also needs permissions to write your application output to specified destinations in your application output configuration. You can grant these permissions by creating AWS Identity and Access Management (IAM) roles that Amazon Managed Service for Apache Flink can assume. The permissions you grant to this role determine what Amazon Managed Service for Apache Flink can do when the service assumes the role. For more information, see the following developer guides:
- Granting permissions in the Amazon Managed Service for Apache Flink Developer Guide.
- Granting permissions in the Amazon Managed Service for Apache Flink Studio Developer Guide.
How does Amazon Managed Service for Apache Flink scale my application?
Amazon Managed Service for Apache Flink elastically scales your application to accommodate the data throughput of your source stream and your query complexity for most scenarios. Amazon Managed Service for Apache Flink provisions capacity in the form of Amazon KPUs. One KPU provides you with 1 vCPU and 4 GB memory.
For Apache Flink applications and Studio notebooks, Amazon Managed Service for Apache Flink assigns 50 GB of running application storage per KPU that your application uses for checkpoints and is available for you to use through temporary disk. A checkpoint is an up-to-date backup of a running application used to recover immediately from an application disruption. You can also control the parallel execution for your Amazon Managed Service for Apache Flink application tasks (such as reading from a source or executing an operator) using the Parallelism and ParallelismPerKPU parameters in the API. Parallelism defines the number of concurrent instances of a task. All operators, sources, and sinks run with a defined parallelism by default one. Parallelism per KPU defines the amount of the number of parallel tasks that can be scheduled per KPU of your application by default one. For more information, see Scaling in the Amazon Managed Service for Apache Flink Developer Guide.
What are the best practices associated for building and managing my Amazon Managed Service for Apache Flink applications?
For information about best practices for Apache Flink, see the Best Practices section of the Amazon Managed Service for Apache Flink Developer Guide. The section covers best practices for fault tolerance, performance, logging, coding, and more.
For information about best practices for Amazon Managed Service for Apache Flink Studio, see the Best Practices section of the Amazon Managed Service for Apache Flink Studio Developer Guide. In addition to best practices, this section covers samples for SQL, Python, and Scala applications, requirements for deploying your code as a continuously running stream processing application, performance, logging, and more.
Can I access resources behind an Amazon VPC with an Amazon Managed Service for Apache Flink application?
Yes. You can access resources behind an Amazon VPC. You can learn how to configure your application for VPC access in the Using an Amazon VPC section of the Amazon Managed Service for Apache Flink Developer Guide.
Can a single Amazon Managed Service for Apache Flink application have access to multiple VPCs?
Can an Amazon Managed Service for Apache Flink application that’s connected to a VPC access the internet and AWS service endpoints?
Amazon Managed Service for Apache Flink applications and Amazon Managed Service for Apache Flink Studio notebooks that are configured to access resources in a particular VPC do not have access to the internet as a default configuration. You can learn how to configure access to the internet for your application in the Internet and Service Access section of the Amazon Managed Service for Apache Flink Developer Guide.
Pricing and billing
How much does Amazon Managed Service for Apache Flink cost?
With Amazon Managed Service for Apache Flink, you pay only for what you use. There are no resources to provision or upfront costs associated with Amazon Managed Service for Apache Flink.
You are charged an hourly rate based on the number of Amazon KPUs used to run your streaming application. A single KPU is a unit of stream processing capacity comprised of 1 vCPU compute and 4 GB memory. Amazon Managed Service for Apache Flink automatically scales the number of KPUs required by your stream processing application as the demands of memory and compute vary in response to processing complexity and the throughput of streaming data processed.
For Apache Flink and Apache Beam applications, you are charged a single additional KPU per application for application orchestration. Apache Flink and Apache Beam applications are also charged for running application storage and durable application backups. Running application storage is used for stateful processing capabilities in Amazon Managed Service for Apache Flink and charged per GB-month. Durable application backups are optional, charged per GB-month, and provide a point-in-time recovery point for applications.
For Amazon Managed Service for Apache Flink Studio, in development or interactive mode, you are charged an additional KPU for application orchestration and 1 KPU for interactive development. You are also charged for running application storage. You are not charged for durable application backups.
For more pricing information, see the Amazon Managed Service for Apache Flink pricing page.
Am I charged for an Amazon Managed Service for Apache Flink application that is running but not processing any data from the source?
For Apache Flink and Apache Beam applications, you are charged a minimum of 2 KPUs and 50 GB running application storage if your Amazon Managed Service for Apache Flink application is running.
For Amazon Managed Service for Apache Flink Studio notebooks, you are charged a minimum of 3 KPUs and 50 GB running application storage if your application is running.
Other than Amazon Managed Service for Apache Flink costs, are there any other costs that I might incur?
Is Amazon Managed Service for Apache Flink available in the AWS Free Tier?
Building Apache Flink applications
What is Apache Flink?
Apache Flink is an open source framework and engine for stream and batch data processing. It makes streaming applications easy to build because it provides powerful operators and solves core streaming problems such as duplicate processing. Apache Flink provides data distribution, communication, and fault tolerance for distributed computations over data streams.
How do I develop applications?
You can start by downloading the open source libraries including the AWS SDK, Apache Flink, and connectors for AWS services. Get instructions on how to download the libraries and create your first application in the Amazon Managed Service for Apache Flink Developer Guide.
What does my application code look like?
You write your Apache Flink code using data streams and stream operators. Application data streams are the data structure you perform processing against using your code. Data continuously flows from the sources into application data streams. One or more stream operators are used to define your processing on the application data streams, including transform, partition, aggregate, join, and window. Data streams and operators can be connected in serial and parallel chains. A short example using pseudo code is shown below.
DataStream <GameEvent> rawEvents = env.addSource(
New KinesisStreamSource(“input_events”));
DataStream <UserPerLevel> gameStream =
rawEvents.map(event - > new UserPerLevel(event.gameMetadata.gameId,
event.gameMetadata.levelId,event.userId));
gameStream.keyBy(event -> event.gameId)
.keyBy(1)
.window(TumblingProcessingTimeWindows.of(Time.minutes(1)))
.apply(...) - > {...};
gameStream.addSink(new KinesisStreamSink("myGameStateStream"));
How do I use the Apache Flink operators?
Operators take an application data stream as input and send processed data to an application data stream as output. Operators can be connected to build applications with multiple steps and don’t require advanced knowledge of distributed systems to implement and operate.
What operators are supported?
Amazon Managed Service for Apache Flink supports all operators from Apache Flink that can be used to solve a wide variety of use cases including map, KeyBy, aggregations, windows, joins, and more. For example, the map operator allows you to perform arbitrary processing, taking one element from an incoming data stream and producing another element. KeyBy logically organizes data using a specified key so that you can process similar data points together. Aggregations performs processing across multiple keys such as sum, min, and max. Window Join joins two data streams together on a given key and window.
You can build custom operators if these do not meet your needs. Find more examples in the Operators section of the Amazon Managed Service for Apache Flink Developer Guide. You can find a full list of Apache Flink operators in the Apache Flink documentation.
What integrations are supported in an Amazon Managed Service for Apache Flink application?
You can set up prebuilt integrations provided by Apache Flink with minimal code or build your own integration to connect to virtually any data source. The open source libraries based on Apache Flink support streaming sources and destinations, or sinks, to process data delivery. This also includes data enrichment support through asynchronous I/O connectors. Some of these connectors include the following:
- Streaming data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams Destinations, or sinks: Amazon Kinesis Data Streams
- Amazon Kinesis Data Firehose, Amazon DynamoDB, Amazon Elasticsearch Service, and Amazon S3 (through file sink integrations)
Can Amazon Managed Service for Apache Flink applications replicate data across streams and topics?
Yes. You can use Amazon Managed Service for Apache Flink applications to replicate data between Amazon Kinesis Data Streams, Amazon MSK, and other systems. An example provided in our documentation demonstrates how to read from one Amazon MSK topic and write to another.
Are custom integrations supported?
You can add a source or destination to your application by building upon a set of primitives enabling you to read and write from files, directories, sockets, or anything that you can access over the internet. Apache Flink provides these primitives for data sources and data sinks. The primitives come with configurations like the ability to read and write data continuously or once, asynchronously or synchronously, and much more. For example, you can setup an application to read continuously from Amazon S3 by extending the existing file-based source integration.
What delivery and processing model do Amazon Managed Service for Apache Flink applications provide?
Apache Flink applications in Amazon Managed Service for Apache Flink use an exactly-once delivery model if an application is built using idempotent operators, including sources and sinks. This means the processed data impacts downstream results once and only once.
By default, Amazon Managed Service for Apache Flink applications use the Apache Flink exactly-once semantics. Your application supports exactly-once processing semantics if you design your applications using sources, operators, and sinks that use Apache Flink’s exactly-once semantics.
Do I have access to local storage from my application storage?
How does Amazon Managed Service for Apache Flink automatically back up my application?
Amazon Managed Service for Apache Flink automatically backs up your running application’s state using checkpoints and snapshots. Checkpoints save the current application state and enable Amazon Managed Service for Apache Flink applications to recover the application position to provide the same semantics as a failure-free execution. Checkpoints use running application storage. Checkpoints for Apache Flink applications are provided through Apache Flink’s checkpointing functionality. Snapshots save a point-in-time recovery point for applications and use durable application backups. Snapshots are analogous to Flink savepoints.
What are application snapshots?
What versions of Apache Flink are supported?
To learn more about supported Apache Flink versions, visit the Amazon Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Beam, Java, Scala, Python, and AWS SDKs that Amazon Managed Service for Apache Flink supports.
Can Amazon Managed Service for Apache Flink applications run Apache Beam?
Yes, Amazon Managed Service for Apache Flink supports streaming applications built using Apache Beam. You can build Apache Beam streaming applications in Java and run them in different engines and services including using Apache Flink on Amazon Managed Service for Apache Flink. You can find information regarding supported Apache Flink and Apache Beam versions in the Amazon Managed Service for Apache Flink Developer Guide.
Building Amazon Managed Service for Apache Flink Studio applications in a managed notebook
How do I develop a Studio application?
You can start from the Amazon Managed Service for Apache Flink Studio, Amazon Kinesis Data Streams, or Amazon MSK consoles in a few steps to launch a serverless notebook to immediately query data streams and perform interactive data analytics.
Interactive data analytics: You can write code in the notebook in SQL, Python, or Scala to interact with your streaming data, with query response times in seconds. You can use built-in visualizations to explore the data, view real-time insights on your streaming data from within your notebook, and develop stream processing applications powered by Apache Flink.
Once your code is ready to run as a production application, you can transition with a single step to a stream processing application that processes gigabytes of data per second, without servers.
Stream processing application: Once you are ready to promote your code to production, you can build your code by clicking “Deploy as stream processing application” in the notebook interface or issue a single command in the CLI. Studio takes care of all the infrastructure management necessary for you to run your stream processing application at scale, with auto scaling and durable state enabled, just as in an Amazon Managed Service for Apache Flink application.
What does my application code look like?
What SQL operations are supported?
You can perform SQL operations such as the following:
- Scan and filter (SELECT, WHERE)
- Aggregations (GROUP BY, GROUP BY WINDOW, HAVING)
- Set (UNION, UNIONALL, INTERSECT, IN, EXISTS)
- Order (ORDER BY, LIMIT)
- Joins (INNER, OUTER, Timed Window – BETWEEN, AND, Joining with Temporal Tables – tables that track changes over time)
- Top-N
- Deduplication
- Pattern recognition
Some of these queries, such as GROUP BY, OUTER JOIN, and Top-N, are results updating for streaming data, which means that the results are continuously updating as the streaming data is processed. Other DDL statements, such as CREATE, ALTER, and DROP, are also supported. For a complete list of queries and samples, see the Apache Flink Queries documentation.
How are Python and Scala supported?
Apache Flink’s Table API supports Python and Scala through language integration using Python strings and Scala expressions. The operations supported are very similar to the SQL operations supported, including select, order, group, join, filter, and windowing. A full list of operations and samples are included in our developer guide.
What versions of Apache Flink and Apache Zeppelin are supported?
To learn more about supported Apache Flink versions, visit the Amazon Managed Service for Apache Flink Release Notes page. This page also includes the versions of Apache Zeppelin, Apache Beam, Java, Scala, Python, and AWS SDKs that Amazon Managed Service for Apache Flink supports.
What integrations are supported by default in an Amazon Managed Service for Apache Flink Studio application?
- Data sources: Amazon Managed Streaming for Apache Kafka (Amazon MSK), Amazon Kinesis Data Streams, Amazon S3
- Destinations, or sinks: Amazon MSK, Amazon Kinesis Data Streams, and Amazon S3
Are custom integrations supported?
Service Level Agreement
What does the Amazon Managed Service for Apache Flink SLA guarantee?
How do I know if I qualify for an SLA Service Credit?
You are eligible for an SLA Service Credit for Amazon Managed Service for Apache Flink under the Amazon Managed Service for Apache Flink SLA if more than one Availability Zone in which you are running a task, within the same AWS Region, has a Monthly Uptime Percentage of less than 99.9% during any monthly billing cycle. For full details on all the SLA terms and conditions as well as details on how to submit a claim, visit the Amazon Managed Service for Apache Flink SLA details page.