This Guidance demonstrates how you can improve the observability of your data pipelines running on Apache Spark. While this open-source framework provides tools that collect runtime metrics for visibility into low-level data processing activities, these metrics are relatively raw. By using AWS services for extract, transform, and load (ETL) operations alongside Apache Spark, you can enhance data quality and granularity, enabling better insights into optimization opportunities. Additionally, improved data pipeline observability helps increase efficiency, reduce operational overhead, accelerate troubleshooting, avoid performance bottlenecks, and achieve greater value from your data processing workloads.

Please note: [Disclaimer]

Architecture Diagram

[Architecture diagram description]

Download the architecture diagram PDF 

Well-Architected Pillars

The AWS Well-Architected Framework helps you understand the pros and cons of the decisions you make when building systems in the cloud. The six pillars of the Framework allow you to learn architectural best practices for designing and operating reliable, secure, efficient, cost-effective, and sustainable systems. Using the AWS Well-Architected Tool, available at no charge in the AWS Management Console, you can review your workloads against these best practices by answering a set of questions for each pillar.

The architecture diagram above is an example of a Solution created with Well-Architected best practices in mind. To be fully Well-Architected, you should follow as many Well-Architected best practices as possible.

  • This Guidance uses an OpenSearch Service observability connector to automate the collection of Apache Spark logs and metrics, then transforms them through OpenSearch Ingestion pipelines, which are highly configurable and can evolve to your needs. OpenSearch Service provides a powerful search capability, and the built-in OpenSearch Dashboards improve observability, providing visuals that shorten time to insights and aid troubleshooting.

    Read the Operational Excellence whitepaper 
  • AWS Identity and Access Management (IAM) lets you control access to the pipeline as well as to the OpenSearch indexes. You can use IAM policies to make sure that metric and log collection processes happen within the security boundaries of their current Apache Spark application. You can use dedicated management roles, pipeline roles, and ingestion roles to enforce the least privilege principal. Additionally, this Guidance uses Amazon Virtual Private Cloud (Amazon VPC) for communication with OpenSearch to achieve proper network traffic isolation, and it uses AWS KMS to encrypt the data before it is stored in OpenSearch.

    Read the Security whitepaper 
  • The OpenSearch Service observability connector collects and sends logs and metrics to OpenSearch Ingestion pipelines, which automatically scale in and out as new logs and metrics are produced. This limits the impact of unexpected activity spikes on OpenSearch Service cluster performance and stability. Additionally, the observability connector’s buffering and sampling feature helps you further reduce potential ingestion back pressure. To maintain high availability and increase reliability, you can enable multi–Availability Zone (AZ) deployments on the OpenSearch Service cluster, which will then distribute Ingestion OpenSearch Compute Units (Ingestion OCUs) across AZs.

    Read the Reliability whitepaper 
  • OpenSearch Service lets you specify minimum and maximum Ingestion OCUs for your OpenSearch Ingestion pipeline, and it will automatically scale up and down based on the pipeline's processing requirements and the load generated by your client application. The observability connector integrates with the native Apache Spark low-level plugin interface to collect the data while limiting performance overhead on Apache Spark jobs. Additionally, the built-in custom Apache Spark metric and log collector implements API consumption best practices, such buffering and exponential backoff, to minimize the impact on the Apache Spark application.

    Read the Performance Efficiency whitepaper 
  • The OpenSearch Service observability connector pre-aggregates certain metrics to reduce postprocessing and ingestion volumes, optimizing the volume of metrics and logs produced. This reduces the risk of Ingestion OCU overconsumption. The OpenSearch Ingestion pipeline then uses dynamic scaling to make sure that you don’t incur charges for periods of inactivity; instead, you only pay for the pipeline’s effective usage. You can also start and stop pipelines on demand. OpenSearch Domains can also use ultrawarm nodes to reduce the cost of infrequently accessed indexes. Additionally, this Guidance reduces the need for custom developed components, which can help further optimize the total cost of ownership.

    Read the Cost Optimization whitepaper 
  • OpenSearch Ingestion pipelines are natively serverless and provide elasticity for data ingestion and transformation, minimizing the environmental impact of backend services. This Guidance also supports both provisioned clusters and serverless collections, and you can use OpenSearch Serverless to further optimize sustainability. Additionally, you can use the insights provided by OpenSearch Dashboards to optimize your Apache Spark workloads and reduce your overall environmental impact.

    Read the Sustainability whitepaper 

Implementation Resources

The sample code is a starting point. It is industry validated, prescriptive but not definitive, and a peek under the hood to help you begin.

[Content Type]

[Title]

This [blog post/e-book/Guidance/sample code] demonstrates how [insert short description].

Disclaimer

The sample code; software libraries; command line tools; proofs of concept; templates; or other related technology (including any of the foregoing that are provided by our personnel) is provided to you as AWS Content under the AWS Customer Agreement, or the relevant written agreement between you and AWS (whichever applies). You should not use this AWS Content in your production accounts, or on production or other critical data. You are responsible for testing, securing, and optimizing the AWS Content, such as sample code, as appropriate for production grade use based on your specific quality control practices and standards. Deploying AWS Content may incur AWS charges for creating or using AWS chargeable resources, such as running Amazon EC2 instances or using Amazon S3 storage.

References to third-party services or organizations in this Guidance do not imply an endorsement, sponsorship, or affiliation between Amazon or AWS and the third party. Guidance from AWS is a technical starting point, and you can customize your integration with third-party services when you deploy the architecture.

Was this page helpful?