AWS Big Data Blog

Tag: Spark

Diagram to illustrate soft multi-tenancy

Design considerations for Amazon EMR on EKS in a multi-tenant Amazon EKS environment

Many AWS customers use Amazon Elastic Kubernetes Service (Amazon EKS) in order to take advantage of Kubernetes without the burden of managing the Kubernetes control plane. With Kubernetes, you can centrally manage your workloads and offer administrators a multi-tenant environment where they can create, update, scale, and secure workloads using a single API. Kubernetes also […]

Disaster recovery considerations with Amazon EMR on Amazon EC2 for Spark workloads

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR launches all nodes for a given cluster in the same Amazon Elastic Compute Cloud (Amazon EC2) Availability Zone […]

Solution Architecture for the blog post

Simplify and optimize Python package management for AWS Glue PySpark jobs with AWS CodeArtifact

Data engineers use various Python packages to meet their data processing requirements while building data pipelines with AWS Glue PySpark Jobs. Languages like Python and Scala are commonly used in data pipeline development. Developers can take advantage of their open-source packages or even customize their own to make it easier and faster to perform use […]

EMR Notebooks: A managed analytics environment based on Jupyter notebooks

Notebooks are increasingly becoming the standard tool for interactively developing big data applications. It’s easy to see why. Their flexible architecture allows you to experiment with data in multiple languages, test code interactively, and visualize large datasets. To help scientists and developers easily access notebook tools, we launched Amazon EMR Notebooks, a managed notebook environment […]

Best practices for successfully managing memory for Apache Spark applications on Amazon EMR

May 2022: Post was reviewed for messaging and accuracy. Since this post has been published, Amazon EMR has introduced several new features that make it easier to fully utilize your cluster resources by default. The default Apache Spark configurations were updated in EMR 5.28.0 to reflect real-world workloads based both on the specific instance type […]

Improve Apache Spark write performance on Apache Parquet formats with the EMRFS S3-optimized committer

The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of Amazon EMR 5.19.0. This committer improves performance when writing Apache Parquet files to Amazon S3 using the EMR File System (EMRFS). In this post, we run a performance benchmark to compare this new optimized committer with existing committer […]

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content

This is a guest post by Takumi Sakamoto, a software engineer at SmartNews. SmartNews in their own words: “SmartNews is a machine learning-based news discovery app that delivers the very best stories on the Web for more than 18 million users worldwide.” Data processing is one of the key technologies for SmartNews. Every team’s workload […]