[SEO Subhead]
This Guidance demonstrates how you can improve the observability of your data pipelines running on Apache Spark. While this open-source framework provides tools that collect runtime metrics for visibility into low-level data processing activities, these metrics are relatively raw. By using AWS services for extract, transform, and load (ETL) operations alongside Apache Spark, you can enhance data quality and granularity, enabling better insights into optimization opportunities. Additionally, improved data pipeline observability helps increase efficiency, reduce operational overhead, accelerate troubleshooting, avoid performance bottlenecks, and achieve greater value from your data processing workloads.
Please note: [Disclaimer]
Architecture Diagram
[Architecture diagram description]
Step 1
The observability connector for Amazon OpenSearch Service is packaged into Apache Spark applications running through Amazon EMR or AWS Glue or self-hosted on Amazon Elastic Compute Cloud (Amazon EC2). The connector is a Java Archive (JAR) file to put on the driver and executor classpaths.
Step 2
The observability connector includes a custom log appender (Log4j AsyncAppender) and a custom SparkListener. They collect logs and metrics from the application and push the data out through the OpenSearch Service client.
Step 3
The observability connector pushes the data into an Amazon OpenSearch Ingestion pipeline. The pipeline applies data transformation and also acts as an ingestion buffer into OpenSearch Service.
Step 4
Ingestion-related logs and metrics are stored in OpenSearch Service indexes: one for each data type. The data delivery frequency is defined as part of the OpenSearch Service pipeline configuration. Log and metric data are encrypted using an AWS Key Management Service (AWS KMS) key.
Step 5
Prebuilt OpenSearch Dashboards is a tool that offers authenticated users insights into their data pipelines using aggregated views of performance metrics and logs at various levels of granularity, such as Spark application, job run, stage, and partition. The dashboard also provides performance scores calculated based on the collected metrics to enable easier analysis.
Well-Architected Pillars
The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.
The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.
-
Operational Excellence
This Guidance uses an OpenSearch Service observability connector to automate the collection of Apache Spark logs and metrics, then transforms them through OpenSearch Ingestion pipelines, which are highly configurable and can evolve to your needs. OpenSearch Service provides a powerful search capability, and the built-in OpenSearch Dashboards improve observability, providing visuals that shorten time to insights and aid troubleshooting.
-
Security
AWS Identity and Access Management (IAM) lets you control access to the pipeline as well as to the OpenSearch indexes. You can use IAM policies to make sure that metric and log collection processes happen within the security boundaries of their current Apache Spark application. You can use dedicated management roles, pipeline roles, and ingestion roles to enforce the least privilege principal. Additionally, this Guidance uses Amazon Virtual Private Cloud (Amazon VPC) for communication with OpenSearch to achieve proper network traffic isolation, and it uses AWS KMS to encrypt the data before it is stored in OpenSearch.
-
Reliability
The OpenSearch Service observability connector collects and sends logs and metrics to OpenSearch Ingestion pipelines, which automatically scale in and out as new logs and metrics are produced. This limits the impact of unexpected activity spikes on OpenSearch Service cluster performance and stability. Additionally, the observability connector’s buffering and sampling feature helps you further reduce potential ingestion back pressure. To maintain high availability and increase reliability, you can enable multi–Availability Zone (AZ) deployments on the OpenSearch Service cluster, which will then distribute Ingestion OpenSearch Compute Units (Ingestion OCUs) across AZs.
-
Performance Efficiency
OpenSearch Service lets you specify minimum and maximum Ingestion OCUs for your OpenSearch Ingestion pipeline, and it will automatically scale up and down based on the pipeline's processing requirements and the load generated by your client application. The observability connector integrates with the native Apache Spark low-level plugin interface to collect the data while limiting performance overhead on Apache Spark jobs. Additionally, the built-in custom Apache Spark metric and log collector implements API consumption best practices, such buffering and exponential backoff, to minimize the impact on the Apache Spark application.
-
Cost Optimization
The OpenSearch Service observability connector pre-aggregates certain metrics to reduce postprocessing and ingestion volumes, optimizing the volume of metrics and logs produced. This reduces the risk of Ingestion OCU overconsumption. The OpenSearch Ingestion pipeline then uses dynamic scaling to make sure that you don’t incur charges for periods of inactivity; instead, you only pay for the pipeline’s effective usage. You can also start and stop pipelines on demand. OpenSearch Domains can also use ultrawarm nodes to reduce the cost of infrequently accessed indexes. Additionally, this Guidance reduces the need for custom developed components, which can help further optimize the total cost of ownership.
-
Sustainability
OpenSearch Ingestion pipelines are natively serverless and provide elasticity for data ingestion and transformation, minimizing the environmental impact of backend services. This Guidance also supports both provisioned clusters and serverless collections, and you can use OpenSearch Serverless to further optimize sustainability. Additionally, you can use the insights provided by OpenSearch Dashboards to optimize your Apache Spark workloads and reduce your overall environmental impact.
Implementation Resources
The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.
Related Content
[Title]
Disclaimer
The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.
References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.