Why is Kinesis Data Firehose creating so many small files in S3?
Last updated: 2020-11-10
I'm trying to push data from Amazon Kinesis Data Firehose to Amazon Simple Storage Service (Amazon S3). However, I noticed that Kinesis Data Firehose is creating many small files in my Amazon S3 bucket. Why is this happening?
Kinesis Data Firehose delivers smaller records than specified (in the BufferingHints API) for the following reasons:
- Compression is enabled.
- Kinesis Data Firehose delivery stream has scaled.
Amazon Kinesis data stream is listed as the data source.
Compression is enabled
If compression is enabled on your Kinesis Data Firehose delivery stream, both of the BufferingHints parameters are applied before the compression. Check the SizeInMBs and IntervalInSeconds parameters to confirm.
After each batch of records is buffered, the parameters are applied. When the data records are buffered and compressed, smaller files are created in Amazon S3.
Kinesis Data Firehose delivery stream has scaled
If a limit increase was requested or Kinesis Data Firehose has automatically scaled, then the Data Firehose delivery stream can scale. By default, Kinesis Data Firehose automatically scales delivery streams up to a certain limit. Amazon Kinesis' automatic scaling behavior reduces the likelihood of throttling without requiring a limit increase.
When Kinesis Data Firehose's delivery stream scales, it can cause an effect on the buffering hints of Data Firehose. The overall buffer size (SizeInMBs) of the delivery stream scales proportionally but inversely. For example, if the capacity of Kinesis Data Firehose increases by two times the original buffer size limit, the buffer size is halved. If Kinesis Data Firehose scales up to four times, the buffer size reduces to one quarter of the overall buffer size.
There is also a proportional number of parallel buffering within the Kinesis Data Firehose delivery stream, where data is delivered simultaneously from all these buffers. For example, Kinesis Data Firehose can buffer the data and create a single file based on the buffer size limit. If Kinesis Data Firehose scales to double the buffer limit, then two separate channels will create the files within the same time interval. If Kinesis Data Firehose scales up to four times, there will be four different channels creating four files in S3 during the same time interval.
Example: Calculating the data stream limit
As another example, consider an Amazon Kinesis data stream that has an initial throughput (t) and creates a file size (s) in interval (x) seconds. The same Kinesis Data Firehose delivery stream (with a throughput of 4t) now creates a file (with a size of s/4) within the same time interval. There are also four parallel buffers delivering the data. As a result, the data delivered by Kinesis Data Firehose continues to remain about the same size:
4 * (s/4) = s
Consider a Kinesis data stream that has an initial throughput (t) of 5 MB/sec. The Kinesis stream creates a file size (s) that is 40 MB in 60 (x) second intervals:
4 * (40MB/4) = 40MB
If the Kinesis data stream is scaled up to 20 MB/sec (four times), then the stream creates four different files of approximately 10 MB each. Therefore, the total data size that's delivered by the Kinesis Data Firehose delivery stream is approximately 40 MB.
Check to make sure that the Kinesis Data Firehose delivery stream hasn't scaled beyond the default limit. To view the current limit of your Kinesis Data Firehose delivery stream, check the following Amazon CloudWatch metrics:
If the values of these metrics differ from the default quota limits, then it indicates that Kinesis Data Firehose' delivery stream has scaled.
Kinesis Data Stream is listed as the data source
When a Kinesis data stream is listed as a data source of Kinesis Data Firehose, Data Firehose scales internally. By default, Kinesis Data Firehose tries to meet the volume capacity of the Kinesis data stream. This scaling causes a change in the buffering size and can lead to the delivery of smaller sized records.
Note: Buffering hint options are treated as hints. As a result, Kinesis Data Firehose might choose to use different values to optimize the buffering.