What are the major components of an Amazon OpenSearch pipeline?

An Amazon OpenSearch Ingestion pipeline consists of three major components: Source is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. The source can consume records either by receiving data over http/s or by reading from external 3rd part endpoints. Processors are intermediate processing units that can filter transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline. If you don't define a processor, records are published in the format defined in the source. You can have more than one processor. Processors are executed in the order that you define them in the pipeline. Sink is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. A sink can also be another pipeline, which allows you to chain multiple pipelines together.

Amazon OpenSearch Service

Amazon OpenSearch Ingestion

Ingest, transform and route data at scale to Amazon OpenSearch Domains and Serverless collections

Browse our developer guide

Why Amazon OpenSearch Service Ingestion?

Amazon OpenSearch Ingestion is a feature of Amazon OpenSearch Service that allows you to ingest, filter, transform, enrich, and route data to an Amazon OpenSearch domain or Serverless collection. Amazon OpenSearch Ingestion is capable of ingesting data from a wide variety of sources and has a rich ecosystem of built-in processors to take care of your most complex data transformation needs. Amazon OpenSearch Ingestion is serverless in nature and will scale automatically to meet the requirements of your most demanding workloads, helping you focus on your business logic while abstracting away the complexity of managing complex data pipelines for your observability and security use cases.

Benefits of Amazon OpenSearch Service

Realize storage cost reductions by deduplicating, sampling, and routing noisy data to lower cost storage.

Enforce data quality by transforming, filtering, and enriching data with built-in processors and by adopting schemas to accelerate observability and reduce security investigation times.

Protect sensitive data by redacting and obfuscating sensitive information before it gets to a destination.

Route data using conditional logic to maintain compliance with data residency laws.

Key features

AWS is a leading contributor of the OpenSearch project, which many customers use. You’ll get all of the new innovations for OpenSearch Data Prepper within this managed service. Beyond those features, which the community drives and contributes to, Amazon OpenSearch Ingestion Service also brings these capabilities:

AWS-managed software installation and patching
AWS monitors and repairs the service, 24x7
AWS upgrades versions
Zero downtime for updates and upgrades
Availability SLA: 99.9%
Serverless, with automatic scaling for ingestion workloads

Customers and partners

CyberArk customer review

“At CyberArk EPM (Endpoint Privilege Manager), a cloud-based multi-tenant system, we manage millions of endpoints and collect high-traffic data events using AWS OpenSearch. By leveraging Amazon OpenSearch Ingestion, we replaced our previous self-managed Logstash pipeline with an AWS-managed one, which eliminated the burden of managing our own infrastructure and provided us with a more scalable, cost-effective, reliable, and secure architecture for our data ingestion. This decision was made with the added advantage of CyberArk EPM achieving FedRAMP High In-Process status, while Amazon OpenSearch Ingestion already being FedRAMP compliant, allowing us to keep high level of security in our offering."

Ori Doolman, Senior Software Architect - CyberArk EPM

Calyptia customer review

“At Calyptia we’ve been working with data ingestion for 12+ years as the creators and maintainers of the Cloud Native Computing Foundation project, Fluentd and Fluent Bit. With the latest versions of these projects we are excited for users to gain more control in their first mile with the combination of the Fluent projects and OpenSearch Ingestion Service. With the ingestion service users can continue to scale agents and processing without having to worry about managing and maintaining infrastructure.”

Anurag Gupta, Co-founder - Calyptia

Confluent customer review

“We are thrilled to partner with the Amazon OpenSearch team as they build their OpenSearch Ingestion service, which will provide a native integration with Apache Kafka and Confluent. This integration will help our joint customers access real-time data via Apache Kafka inside OpenSearch so they can rethink customer experiences, build real-time backend operations, or launch new products and services. As the leading contributor to Apache Kafka, Confluent has 10X’ed Kafka by building a complete and cloud-native data streaming platform that allows you to move data from wherever it is created to where businesses can take action in the multi-SaasS world we all live in. This allows OpenSearch users to benefit from the 100's of data sources that Confluent is integrated with. We are excited to see what our joint customers build as they set data in motion with Confluent and OpenSearch.”

Paul Mac Farland, VP of Partner & Innovation Ecosystem - Confluent

Resources

Blog

Top strategies for high volume tracing with Amazon OpenSearch Ingestion

Read the blog

Documentation

Amazon OpenSearch Ingestion Developer Guide

View the guide

Ingestion FAQs
5

Ingestion FAQs

Open all

Amazon OpenSearch Ingestion is a data ingestion tier that enables you to filter, enrich, transform, normalize and aggregate data for downstream analytics and visualization in Amazon OpenSearch domains and Amazon OpenSearch Serverless collections. Amazon OpenSearch Ingestion allows you to create custom data pipelines to improve the operational view of your applications. The serverless nature of Amazon OpenSearch Ingestion abstracts away the complexities of self-managing data pipelines and ensure that the processing capabilities of your data pipelines auto-scales as per the demands of your workloads. With Amazon OpenSearch Ingestion, you can

Realize storage cost reductions by data deduplication, and sampling to prevent noisy data from being indexed in Amazon OpenSearch.
Enforce data quality and adopt common schemas by transforming, formatting, and enriching data before it is indexed in Amazon OpenSearch domains making it easier to troubleshoot issues.
Redact or obfuscate sensitive information before it gets to a destination enabling compliancy with data residency laws.

An Amazon OpenSearch Ingestion pipeline consists of three major components:

Source is the input component of a pipeline. It defines the mechanism through which a pipeline consumes records. The source can consume records either by receiving data over http/s or by reading from external 3rd part endpoints.
Processors are intermediate processing units that can filter transform, and enrich records into a desired format before publishing them to the sink. The processor is an optional component of a pipeline. If you don't define a processor, records are published in the format defined in the source. You can have more than one processor. Processors are executed in the order that you define them in the pipeline.
Sink is the output component of a pipeline. It defines one or more destinations to which a pipeline publishes records. A sink can also be another pipeline, which allows you to chain multiple pipelines together.

Amazon OpenSearch supports ingesting all types of data that you would normally index in an Amazon OpenSearch domain. This includes but is not limited to structured, unstructured, textual, numerical and geospatial data. OpenSearch Ingestion also supports ingestion of all three pillars of the observability data: logs, metrics and traces. You can use OpenSearch Ingestion along with its support for a rich ecosystem of data sources, processors and sinks to transform your data before storing it in Amazon OpenSearch domains. With OpenSearch Ingestion, you no longer have to write custom lambda function or self-manage Logstash and Elasticsearch ingest nodes to ingest data that needs to be indexed in Amazon OpenSearch clusters. Please refer to our documentation page to see the list of sources, processors and sinks supported by Amazon OpenSearch Ingestion.

Amazon OpenSearch Ingestion is a data ingestion tier that pre-processes data before the data is indexed in Amazon OpenSearch Service. OpenSearch Ingestions is built with Data Prepper which is a component of the OpenSearch project and supports all data formats, sources, processors and sinks supported by Data Prepper.

To get started with Amazon OpenSearch Ingestion, you begin by defining a data pipeline. An OpenSearch Ingestion pipeline is the core of your business logic and consists of a source, a single or a series of processors and a sink. You define your pipeline configuration via a YAML file which contains details of your source, processors and sinks. OpenSearch Ingestion also enables you to set up a minimum and maximum capacity of the OpenSearch Compute Units for Ingestion (OCUs) that you want to set per pipeline. Finally, you can choose on how your data reaches your OpenSearch Ingestion pipelines:

VPC access: For VPC access, we establish a Private Link from your VPC to the Amazon OpenSearch Ingestion pipeline. This provides private connectivity to your pipelines without exposing your traffic to the public internet.
Public access: In this network configuration, your data to your OpenSearch pipelines flows over the public internet.

You can get started with creating a data pipeline via the AWS Console or the AWS command line.

Get started with Amazon OpenSearch Service

Features page

Learn more about Amazon OpenSearch Service

Visit the features page

Support

Connect with migration experts

Explore support options

Amazon OpenSearch Ingestion

Why Amazon OpenSearch Service Ingestion?

Benefits of Amazon OpenSearch Service

Key features