AWS Big Data Blog

Ingest streaming data into Amazon OpenSearch Service within the privacy of your VPC with Amazon Data Firehose

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details.
February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more.


Today we are adding a new Amazon Data Firehose feature to set up VPC delivery to your Amazon OpenSearch Service domain from the Firehose. If you have been managing a custom application on Amazon Kinesis Data Streams to keep traffic private, you can now use Firehose and load your data into an Amazon OpenSearch Service endpoint in a VPC without having to invest, operate, and scale ingestion and delivery infrastructure. You can start using this new feature from Firehose console, AWS CLI, and API by selecting Amazon OpenSearch Service as the destination, the specific domain with VPC access, and setting the VPC configuration with subnets and the optional security groups.

Before this feature

Amazon OpenSearch Service domains can have public or private endpoints. Public endpoints are backed by IP addresses on the public internet. Private endpoints are backed by IP addresses within the IP space of your VPC.

If you have been using an Amazon OpenSearch Service VPC endpoint, you most likely use Kinesis Data Streams or similar soultion to ingest streaming data. This means running a custom application on the stream that delivers it to the Amazon OpenSearch Service VPC domain. You likely had to perform the following actions:

  • Implement buffering
  • Format conversions
  • Perform compression
  • Apply transformation
  • Manage backup
  • Handle transient delivery failures

Additionally, you have to build, scale, monitor, update, and maintain this custom application.

Firehose delivery to Amazon OpenSearch Service VPC endpoint

Firehose can now deliver data into an Amazon OpenSearch Service VPC endpoint. This provides a secure and easy way to ingest, transform, and deliver streaming data. You don’t need to worry about managing your data ingestion and delivery infrastructure. With this new feature, Firehose enables additional secure communication to Amazon OpenSearch Service VPC endpoints. Amazon OpenSearch Service endpoints that live within a VPC give you an extra layer of security.

How it works

When you create a Firehose delivery stream that delivers data to an Amazon OpenSearch Service VPC endpoint, Firehose creates an Elastic Network Interface (ENI) in each subnet you select. If you only use one Availability Zone, Firehose places an endpoint into only one subnet. Similarly, when you create an Amazon OpenSearch Service VPC endpoint, it creates endpoints in the subnets you chose.  Firehose uses ENI to deliver the data to your Amazon OpenSearch Service ENI, all inside your VPC. The following screenshot outlines the resulting architecture with a single subnet.

For this walkthrough, you have two security groups:

  • kdf-sec-grp for your Firehose endpoint
  • es-sec-grp for your Amazon OpenSearch Service endpoint

To let Firehose access your Amazon OpenSearch Service VPC endpoint, security group es-sec-grp needs to allow the ENI that Firehose created to make HTTPS calls. Firehose scales the ENIs automatically to meet the throughput requirements. As Firehose scales ENIs, the outbound rules of the enclosing security group kdf-sec-grp control the data stream. You should configure the Amazon OpenSearch Service security group (es-sec-grp) to allow HTTPS traffic from the Firehose security group (kdf-sec-grp). The Firehose security group needs to allow outbound HTTPS traffic, and its destination is the Amazon OpenSearch Service security group. With Firehose VPC delivery, you do not need to make the Firehose security group open to outside traffic.

You can also use the same security group for Firehose and Amazon OpenSearch Service endpoints. If you use the same security group for both, make sure the security group inbound rule allows HTTPS traffic.

For your existing delivery streams, you can change the destination endpoint. The new destination must be accessible within the same VPC, subnets, and security groups. Changing either of the VPC, subnets, and security groups requires you to recreate a delivery stream.

All existing Firehose limits apply to this capability. For example, you can increase the default 50 delivery streams per account by submitting a quota increase request. Also, KFirehose creates one or more ENIs per VPC destination subnet per delivery stream. Firehose automatically scales the number of ENIs as needed based on the actual throughput. The default throughput limit per delivery stream is 5 MB/second (dependent on Region). You can request an increase to this limit by submitting a support case.

You need to make sure you have enough ENIs available. By default, VPC has a quota of 5000 ENIs per Region. For more information, see Amazon VPC Quotas.

The advantage of using a managed service like Firehose is that you can focus on the value of your data and not the underlying plumbing. You can configure the frequency of data delivery from your delivery stream to your Amazon OpenSearch Service domain. Firehose buffers incoming data before delivering it to Amazon ES. You can configure the values for Amazon OpenSearch Service buffer size (1 MB–100 MB) or buffer interval (60–900 seconds), and the condition satisfied first triggers data delivery to Amazon OpenSearch Service. In case data delivery fails for an Amazon OpenSearch Service destination, you can specify a retry duration between 0 and 7,200 seconds when you create the delivery stream. If data delivery to your Amazon OpenSearch Service endpoint fails, Firehose retries data delivery for the specified time duration. After the retrial period, Firehose skips the current batch of data and moves on to the next batch. Skipped documents go to your Amazon S3 bucket elasticsearch_failed folder, which you can use for manual backfill.

For more information about sizing, see Get started with Amazon OpenSearch Service: T-shirt-size your domain.

Solution overview

To show you how to use this new feature, this post uses stock demo data available on the Firehose console to deliver to an Amazon OpenSearch Service endpoint in VPC. The following diagram illustrates the workflow.

This use case simulates a producer sending stock ticker data to the delivery stream (A). You use an AWS Lambda function (B) to add a timestamp to the stock records so that you can create Kibana visualization. Firehose streams the stock records to the Amazon OpenSearch Service endpoint (C) in your VPC. Finally, you can visualize the data using Kibana (D).

This post uses the Amazon Management Console to implement this solution, but you can also use AWS CLI.

Creating security groups

Start by creating two security groups: one for the Amazon OpenSearch Service VPC endpoint (es-sec-grp) and another for the delivery stream (kdf-sec-grp). Create security groups without any rules first. After you have created them, set the inbound and outbound rules. The following table summarizes these rules.

Creating an Amazon OpenSearch Service VPC endpoint

To create an Amazon OpenSearch Service endpoint in VPC, complete the following steps:

  1. On the Amazon OpenSearch Service console, choose Create a new Domain.
  2. For Deployment Type and Latest Version, choose Development and Testing.
  3. Choose Next.
  4. Give your Amazon OpenSearch Service endpoint a name.
  5. Select your instance type.

This post uses m5.xlarge.search. For production environments, select the appropriately sized instance type. For this post, leave the number of nodes at 1, though best practice is to set it to 2.

  1. Set EBS storage size per node to 100 GiB.
  2. Leave the rest of the settings at their defaults and choose Next.
  3. Select the VPC and private subnet for your Amazon OpenSearch Service endpoint and the security group for Amazon OpenSearch Service that you created previously (es-sec-grp).
  4. To access Kibana, choose fine-grained access.
  5. Choose Create Master User.

In this post, we are using internal user database enabled with HTTP basic authentication. For production environments, use IAM roles and configure the appropriate fine-grained access. For more information, see Fine-Grained Access Control in Amazon OpenSearch Service.

  1. Choose Allow open access to Domain.

Security groups already enforce IP-based access policies. This step opens access to your Amazon OpenSearch Service endpoint to resources in your VPC, and your Amazon OpenSearch Service endpoint is not accessible to the internet. For an additional layer of security in your Amazon OpenSearch Service endpoint, use access policies that specify IAM users or roles. For more information about controlling access to your domains, see Identity and Access Management in Amazon OpenSearch Service.

  1. Choose Next.
  2. Review your settings and choose Confirm.

The following screenshot shows an example of what your Amazon OpenSearch Service endpoint VPC settings should look like.

Creating a Lambda function for record transformation

Create a Lambda function to add a timestamp to the data feed. Complete the following steps:

  1. On the Lambda console, choose Create Function.
  2. Choose Author from scratch.
  3. Name your function; for example, tmakAddTSToStream.
  4. Choose Python 3.7 as your runtime.
  5. Choose Create.

The following code is for your Lambda function (under the basic settings section, change the timeout from 3 sec to 45 sec):

import base64
import json
from datetime import datetime

def lambda_handler(event, context):
    
    send_back = []
    now = datetime.utcnow().isoformat()

    for record in event['records']:
        stock_rec = json.loads(base64.b64decode(record['data']))
        stock_rec["timestamp"] = now
       
        record_w_ts = {
                'recordId': record['recordId'],
                'result': 'Ok',
                'data': base64.b64encode(json.dumps(stock_rec).encode('utf-8') + b'\n').decode('utf-8')
        }
        send_back.append(record_w_ts)

    return {'records': send_back} 

Creating a Firehose delivery stream

To create your delivery stream, complete the following steps:

  1. On the Firehose console, under Data Firehose, choose Create Delivery Stream.
  2. Enter a name for your stream; for example, tmak-kdf-stock-delivery-stream.
  3. For source, choose Direct PUT or other sources.
  4. Choose Next.
  5. For Data transformation, choose Enabled.
  6. Choose the Lambda function you created.
  7. Choose Next.
  8. Choose Amazon OpenSearch Service as the destination for your delivery stream.
  9. For Index, enter stockdata.

The VPC section populates automatically. Make sure you use the security group you created for Firehose (kdf-sec-grp).

  1. For Backup Mode, choose Failed records only.

You can select an existing S3 bucket or create a new one. The following screenshot shows an example of your delivery stream settings.

  1. Choose Next.
  2. Review the buffering settings and set any tags to identify your stream.

A delivery stream that delivers to VPC destinations needs permissions to manage ENIs, list VPCs, and subnets. The console gives you the option to create a new role based on a template that includes all the needed permissions. You can also use an existing role if you already created one.

  1. Choose Next.
  2. Review the settings and choose Create Stream.

It may take up to a few minutes to see the stream status show as Active. See the following screenshot.

On the Amazon EC2 console, under Network and Security, you can see the endpoints created in your VPC by Firehose and Amazon ES. See the following screenshot.

Configuring Kibana fine-grained access for Firehose

You need to give Firehose permissions to deliver stock data to your Amazon OpenSearch Service endpoint. You can accomplish this via the Kibana console or API. For more information, see API on the Open Distro for Elasticsearch website.

For more information about controlling access to your Amazon OpenSearch Service endpoint, see How to Control Access to Your Amazon OpenSearch Service Domain.

Because your Amazon OpenSearch Service endpoint is in the VPC to access Kibana, you must first connect to the VPC. This process varies by network configuration, but likely involves connecting to a VPN or corporate network. For this post, create a remote desktop EC2 instance public subnet of your VPC. The newly created security group (rdp-sec-grp) protects the instance. You can modify the es-sec-grp security group and allow inbound RDP traffic from rdp-sec-grp so you can access the Kibana URL. The following diagram illustrates this architecture.

Firehose uses the delivery role to sign HTTP (Signature Version 4) requests before sending the data to the Amazon OpenSearch Service endpoint. You manage Amazon OpenSearch Service fine-grained access control permissions using roles, users, and mappings. This section describes how to create roles and set permissions for Firehose.

The roles you create in this section are different from IAM roles. For more information, see Key Concepts.

Complete the following steps:

  1. Navigate to Kibana (you can find the URL on the Amazon OpenSearch Service console).
  2. Enter the master user and password that you set up when you created the Amazon OpenSearch Service endpoint.
  3. Under Security, choose Roles.
  4. Choose Add New Role.
  5. Name your role; for example, firehose-role.
  6. For cluster permissions, add cluster_composite_ops and cluster_monitor.
  7. Under Index permissions, choose Index Patterns and enter stockdata*.
  8. Under Permissions, add three action groups: crud, create_index, and manage.
  9. Choose Save Role Definition.

In the next step, you map the IAM role that Firehose uses to the role you just created.

  1. Under Security, choose Role Mappings.
  2. Choose the role you just created (firehose-role).
  3. For Backend Roles, choose Add Backend Role.
  4. Enter the IAM ARN of the role Firehose uses: arn:aws:iam::123456789012:role/firehose_stream_role_name.

You can find your delivery stream ARN on the Firehose console.

Streaming stock data through Firehose

To stream your stock data, complete the following steps:

  1. On the Firehose console, choose the stream you created.
  2. Choose Test with demo data.
  3. Choose Start sending demo data.

If everything is working, you see message Demo data is being sent to your delivery stream. Wait a few minutes before you choose Stop sending demo data.

Analyzing and visualizing data

To analyze and visualize your data, complete the following steps:

  1. On the Kibana console, choose Management.
  2. Choose Index patterns.
  3. For Index pattern, enter stockdata*.
  4. Choose Next.
  5. For the Time filter field, choose timestamp.
  6. Choose Visualize.
  7. Create a new visualization and choose Line.
  8. For Index pattern, choose stockdata*.
  9. For Y-Axis, choose Aggregation=Average and Field=price.
  10. For X-Axis, choose Aggregation=Data Histogram, Field=timestamp, and Interval=seconds.
  11. Under X-Axis, choose Add Sub-buckets.
  12. Choose Split Series.
  13. Set Sub-Aggregation=Terms and Field=ticker_symbol.keyword.
  14. Choose Apply Changes.

The following screenshot shows an example visualization.

You can see the raw data by choosing Discover on the Kibana dashboard. See the following screenshot.

Summary

This post demonstrated how you can move an Amazon OpenSearch Service endpoint inside your VPC with Firehose. Additionally, you do not need to enable and secure public access to your Amazon OpenSearch Service endpoint. If you have been reluctant to expose your Amazon OpenSearch Service endpoint to the internet but want to stream data, you can now do so with Firehose.


About the Author

Tarik Makota is a Principal Solutions Architect with the Amazon Web Services. He provides technical guidance, design advice and thought leadership to AWS’ customers across US Northeast. He holds an M.S. in Software Development and Management from Rochester Institute of Technology.