AWS Big Data Blog

Access Apache Livy using a Network Load Balancer on a Kerberos-enabled Amazon EMR cluster

Amazon EMR is a cloud big data platform for running large-scale distributed data processing jobs, interactive SQL queries, and machine learning (ML) applications using open-source analytics frameworks such as Apache Spark, Apache Hive, and Presto. Amazon EMR supports Kerberos for authentication; you can enable Kerberos on Amazon EMR and put the cluster in a private […]

Secure data movement across Amazon S3 and Amazon Redshift using role chaining and ASSUMEROLE

Data lakes use a ring of purpose-built data services around a central data lake. Data needs to move between these services and data stores easily and securely. The following are some examples of such services: Amazon Simple Storage Service (Amazon S3), which stores structured, unstructured, and semi-structured data Amazon Redshift, a fully managed, petabyte-scale data […]

Amazon EMR on Amazon EKS provides up to 61% lower costs and up to 68% performance improvement for Spark workloads

Amazon EMR on Amazon EKS is a deployment option offered by Amazon EMR that enables you to run Apache Spark applications on Amazon Elastic Kubernetes Service (Amazon EKS) in a cost-effective manner. It uses the EMR runtime for Apache Spark to increase performance so that your jobs run faster and cost less. In our benchmark […]

Build a modern data architecture and data mesh pattern at scale using AWS Lake Formation tag-based access control

January 2023: This post was reviewed and updated to use version 3 of the settings for AWS Lake Formation, which allows for cross-account grants with AWS Resource Access Manager. Customers are exploring building a data mesh on their AWS platform using AWS Lake Formation and sharing their data lakes across the organization. A data mesh […]

Modernize your healthcare clinical quality data repositories with Amazon Redshift Data Vault

With the shift to value-based care, tapping data to its fullest to improve patient satisfaction tops the priority of healthcare executives everywhere. To achieve this, reliance on key technology relevant to their sphere is a must. This is where data lakes can help. A data lake is an architecture that can assist providers in storing, […]

How SailPoint solved scaling issues by migrating legacy big data applications to Amazon EMR on Amazon EKS

This post is co-written with Richard Li from SailPoint. SailPoint Technologies is an identity security company based in Austin, TX. Its software as a service (SaaS) solutions support identity governance operations in regulated industries such as healthcare, government, and higher education. SailPoint distinguishes multiple aspects of identity as individual identity security services, including cloud governance, […]

Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions

Data lakes, business intelligence, operational analytics, and data warehousing share a common core characteristic—the ability to extract, transform, and load (ETL) data for analytics. Since its launch in 2017, AWS Glue has provided serverless data integration service that makes it easy to discover, prepare, and combine data for analytics, machine learning, and application development. AWS […]

Query 10 new data sources with Amazon Athena

When we first launched Amazon Athena, our mission was to make it simple to query data stored in Amazon Simple Storage Service (Amazon S3). Athena customers found it easy to get started and develop analytics on petabyte-scale data lakes, but told us they needed to join their Amazon S3 data with data stored elsewhere. We […]

Build your data pipeline in your AWS modern data platform using AWS Lake Formation, AWS Glue, and dbt Core

dbt has established itself as one of the most popular tools in the modern data stack, and is aiming to bring analytics engineering to everyone. The dbt tool makes it easy to develop and implement complex data processing pipelines, with mostly SQL, and it provides developers with a simple interface to create, test, document, evolve, […]