AWS Architecture Blog
Designing a High-volume Streaming Data Ingestion Platform Natively on AWS
The total global data storage is projected to exceed 200 zettabytes by 2025. This exponential growth of data demands increased vigilance against cybercrimes. Emerging cybersecurity trends include increasing service attacks, ransomware, and critical infrastructure threats. Businesses are changing how they approach cybersecurity and are looking for new ways to tackle these threats. In the past, they have relied on internal IT or engaged a managed security services provider (MSSP) to monitor and prevent unauthorized access and attacks.
An end-to-end analytics solution should ingest and process log data streaming from various computing and IoT devices. It can then make processed data available to analytics systems users in near-real-time. However, the sheer volume of data in the future makes this difficult to address in a reliable and cost-effective manner.
In this blog post, we present three approaches for a high-volume log data ingestion and processing platform natively on Amazon Web Services (AWS). We also compare the pros and cons of each. We’ll discuss factors to consider when evaluating the different options and their associated flexibility, to take full advantage of AWS. We will showcase a fictional use case for a top MSSP who ingests high volumes of logs from security devices to cloud. This MSSP also performs downstream analytics and threat detection modeling.
The options we present here have a log collection platform (LCP) on-premises. It collects logs from security devices and sensors, performs necessary translations and tokenization, and pushes compressed log files to the processing tier on cloud. The collection platform can also be modernized to have the IoT-enabled devices send logs to AWS IoT services. This will push the data to Amazon Kinesis, a managed service for collecting and analyzing streaming data.
Approach 1: Amazon Kinesis for log ingestion and format conversion
Figure 1 illustrates a comprehensive solution that uses managed and serverless services on AWS.
1. LCP will invoke a scalable producer application for Amazon Kinesis Data Streams running on AWS Fargate behind an Application Load Balancer. The producer application will use the Amazon Kinesis Producer Library (KPL). KPL aggregates and batches data records to make ingestion into the data stream more efficient. The application may provide compressed records to the KPL to have it manage object compression.
The application can be set up as an HTTP endpoint that receives log files and processes them using KPL. Customer ID sent as part of an HTTP request header can be used to maintain affinity. The application can run in a Docker container, which is orchestrated by Amazon ECS on AWS Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers to manage scalability of the ingestion process.
2. Amazon Kinesis Scaling Utility can be used to scale data streams up or down by a count, or as a percentage of the total fleet. The scaling utility archive file can be imported as a library to AWS Lambda. It will automatically manage the number of shards in the stream based on the observed PUT or GET rate of the stream. The combination of customer ID and security device ID may be used to define the partition key.
3. Records uploaded to the stream by the producer application will be consumed by Lambda. It will perform gateway transformations (required by all downstream consumers) and the normalization of record format. Any additional consumer level transformations may be handled separately, associated with respective consumers.
A combination of batch window and batch size configurations can improve efficiency of function invocations. Batch windows are the maximum amount of time in seconds to gather records before invoking the function. Batch size is the number of records to send to the function in each batch. The Lambda function will throttle sending records to Amazon Kinesis Data Firehose. Error handling will be accomplished via retries with a smaller batch size, with number of retries limited as appropriate. It will discard records that are too old.
An Amazon Simple Queue Service (SQS) queue can be configured as a failed-event destination for further offline analysis. A Lambda function can read from the error SQS queue to do basic checks and determine appropriate follow-up actions. This can be an initiated email for additional investigation or a command to discard the message.
4. Output of transformations by Lambda will be saved to the short term (hot) storage Amazon S3 bucket via Kinesis Data Firehose. This can efficiently handle Parquet format conversion required by downstream analytics applications. Kinesis Data Firehose delivery streams will be created per customer and configured with associated AWS Glue Data Catalog table, to perform parquet format conversion.
5. AWS Glue jobs will be used to consolidate and write larger files to the long term (cold) storage bucket.
6. The data in the cold storage bucket will be accessed by internal SOC analysts for threat detection and mitigation.
7. The data in cold storage buckets will also be accessed by end customers via dashboards in Amazon QuickSight.
8. This architecture also provides additional options to modernize streaming analytics using Amazon Kinesis Data Analytics or AWS Glue streaming jobs as appropriate.
While this architecture proposes a fully managed, end-to-end solution, the sheer volume of log messages may drive up the total cost of the solution. This is especially true for Kinesis Data Streams and Kinesis Data Firehose costs.
Approach 2: Containerized application on AWS Fargate for ingestion and Amazon Kinesis for format conversion
An alternative approach shown in Figure 2 replaces the gateway Kinesis Data Streams and transformations, with a containerized application on Fargate. Conversion to Parquet format and writing to the S3 bucket is still handled by Kinesis Data Firehose.
1. LCP will upload log files to a raw storage bucket in Amazon S3.
2. A Lambda function will process Event Notifications from the raw data storage bucket. It can insert Amazon S3 object pointers to a Kinesis Data Stream partitioned by Customer ID and Device ID.
3. The producer application will retrieve the Event Notifications from the Data Stream and retrieve corresponding log files from S3. It will perform initial aggregations and transformations, and output to Kinesis Data Firehose. The application can run in a Docker container that is orchestrated by Amazon ECS on Fargate. A target tracking scaling policy can manage the number of parallel running data ingestion containers, to manage scalability of the ingestion process. ECS cluster capacity can be scaled up or down based on Amazon CloudWatch alarms.
4. Kinesis Data Firehose converts to Parquet format, zips the data, and persists to a short-term storage bucket in S3. This is backed by Glue Data Catalog.
Steps 5, 6 and 7 perform consolidation and availability of the processed data to downstream consumers, as in the previous approach.
This option uses the built-in capabilities of Kinesis Data Firehose to transform to Parquet format and deliver to S3. Note that higher costs associated with the service may still be cost prohibitive for larger data volumes.
Approach 3: Containerized application on AWS Fargate for ingestion and format conversion
Figure 3 uses a containerized application running on Fargate for both gateway transformations. This app also provides conversion to Parquet format before writing the files to a short term (hot) storage bucket. All the other steps are the same as in option 2.
This option offers the least expensive way to transform, aggregate, and enrich the incoming log records, as well as convert them to Parquet format. But it comes with additional overhead for custom development of format conversion, checkpointing, error handling, and application management. Evaluate based on your business needs and workflow.
In this post, we discussed multiple approaches to design a platform on AWS to ingest and process high-volume security log records. We compared the pros and cons for each option. Amazon Kinesis is a fully managed and scalable service that helps easily collect, process, and analyze video and data streams in real time. A solution primarily based on Kinesis may become cost prohibitive due to large data volumes. Consider alternate approaches that use containerized applications on AWS Fargate. The trade-off would be the ability for custom development versus application management overhead.
To improve your security log analysis solution, explore one of the approaches we illustrate and customize as appropriate to fit your unique needs.