Amazon Data Firehose Documentation
Amazon Data Firehose is designed to load streaming data into data stores and analytics tools. Data Firehose is designed to be a fully managed service that allows you to capture, transform, and load massive volumes of streaming data from hundreds of thousands of sources into Amazon Simple Storage Service (Amazon S3), Amazon Redshift, Amazon OpenSearch Service, generic HTTP endpoints, and third-party service providers, enabling near real-time analytics and insights.
Delivery streams
A delivery stream is the underlying entity of Data Firehose. You can use Data Firehose by creating a delivery stream and then sending data to it.
Key features
Launch and configuration
You can launch Amazon Data Firehose and create a delivery stream to load data into Amazon S3, Amazon Redshift, Amazon OpenSearch Service, HTTP endpoints, and third-party service providers in the AWS Management Console. You can send data to the delivery stream by calling the Firehose API, or running the Linux agent we provide on the data source. Data Firehose is designed to then load the data into the specified destinations.
Load new data
You can specify a batch size or batch interval to control how quickly data is uploaded to destinations. For example, you can set the batch interval to 60 seconds if you want to receive new data within 60 seconds of sending it to your delivery stream. Additionally, you can specify if data should be compressed. The service is designed to support common compression algorithms. Batching and compressing data before uploading enables you to control how quickly you receive new data at the destinations.
Elastic scaling to handle varying data throughput
The service is designed so that once launched, your delivery streams can scale up and down to handle large amounts of input data, and maintain data latency at levels you specify for the stream, within the limits.
Apache Parquet or ORC format conversion
Data Firehose supports Columnar data formats such as Apache Parquet and Apache ORC that can be used for storage and analytics using other AWS or third-party services. Data Firehose is designed so that it can convert the format of incoming data from JSON to Parquet or ORC formats before storing the data in Amazon S3, which helps you to save storage and analytics costs.
Deliver partitioned data to S3
You can dynamically partition your streaming data before delivery to Amazon S3 using static or dynamically defined keys like “customer_id” or “transaction_id”. Data Firehose is designed to group data by these keys and deliver into key-unique Amazon S3 prefixes, helping you to perform high performance, cost efficient analytics in Amazon S3.
Integrated data transformations
You can configure Amazon Data Firehose to prepare your streaming data before it is loaded to data stores. Select an AWS Lambda function from the Amazon Data Firehose delivery stream configuration tab in the AWS Management console. Amazon Data Firehose is designed to apply that function to every input data record and load the transformed data to destinations. Amazon Data Firehose is designed to provide pre-built Lambda blueprints for converting common data sources such as Apache logs and system logs to JSON and CSV formats. You have the option to utilize pre-built blueprints without any change, or customize them further, or write your own custom functions. You can also configure Amazon Data Firehose to automatically retry failed jobs and back up the raw streaming data.
Support for multiple data destinations
Amazon Data Firehose currently supports destinations including Amazon S3, Amazon Redshift, Amazon OpenSearch Service, HTTP endpoints, and certain third-party providers. You can specify the destination Amazon S3 bucket, the Amazon Redshift table, the Amazon OpenSearch Service domain, generic HTTP endpoints, or a service provider where the data should be loaded.
Optional encryption
Amazon Data Firehose provides you the option to have your data encrypted after it is uploaded to the destination. As part of the delivery stream configuration, you can specify an AWS Key Management System (KMS) encryption key.
Metrics for monitoring performance
Amazon Data Firehose is designed to expose several metrics through the console, as well as through Amazon CloudWatch, including volume of data submitted, volume of data uploaded to destination, time from source to destination, the delivery stream limits, throttled records number and upload success rate. You can use these metrics to monitor the health of your delivery streams, take any necessary actions such as modifying destinations, setting alarms when getting closer to the limits, and confirm that the service is ingesting data and loading it to destinations.
Additional Information
For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.