AWS Big Data Blog

Category: Analytics

Migrate data into Amazon ES using remote reindex

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Amazon OpenSearch Service recently launched support for remote reindexing. This feature adds the ability to copy data to an Amazon OpenSearch Service domain from self-managed Elasticsearch running on-premises, self-managed on Amazon Elastic Compute Cloud (Amazon EC2) on AWS, or another […]

Enable private access to Amazon Redshift from your client applications in another VPC

November 2023: This post was reviewed and updated to include configurations and options for Amazon Redshift Serverless. You can now use an Amazon Redshift-managed VPC endpoint (powered by AWS PrivateLink) to connect to your private Amazon Redshift cluster with the RA3-instance type or Amazon Redshift Serverless within your virtual private cloud (VPC). With an Amazon […]

The following diagram shows the overall architecture to address our two challenges.

Extract multidimensional data from Microsoft SQL Server Analysis Services using AWS Glue

AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources, but what if you need to extract data from an unsupported data source? What if […]

Migrate terabytes of data quickly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery

This blog post was last updated July, 2022 to update the new version of the connector and details on how to push down queries to Google BigQuery. The cloud is often seen as advantageous for data lakes because of better security, faster time to deployment, better availability, more frequent feature and functionality updates, more elasticity, […]

Doing data preparation using on-premises PostgreSQL databases with AWS Glue DataBrew

Today, with AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. Customers can choose from over 250 […]

Orchestrate an Amazon EMR on Amazon EKS Spark job with AWS Step Functions

At re:Invent 2020, we announced the general availability of Amazon EMR on Amazon EKS, a new deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With Amazon EMR on EKS, you can now run Spark applications alongside other […]

Build a real-time streaming application using Apache Flink Python API with Amazon Kinesis Data Analytics

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Amazon Kinesis Data Analytics is now expanding its Apache Flink offering by adding support for Python. This is exciting news for many of our customers who use […]

Introducing Auto-Tune in Amazon ES

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Today we announced Auto-Tune in Amazon OpenSearch Service, an innovation undertaken to automatically optimize resources in Elasticsearch clusters to improve its performance and availability. Auto-Tune gives us a unique opportunity of applying our learnings from operating clusters at cloud scale […]

Build a serverless tracking pixel solution in AWS

August 30, 2023: Amazon Kinesis Data Analytics has been renamed to Amazon Managed Service for Apache Flink. Read the announcement in the AWS News Blog and learn more. Let’s describe the typical use case where a tracking pixel solution, also known as a web beacon, might help you: Analyzing web traffic is critical to understanding […]

The following graph shows that the minimum throughput achieved with the persistent HFile

Amazon EMR 6.2.0 adds persistent HFile tracking to improve performance with HBase on Amazon S3

Apache HBase is an open-source, NoSQL database that you can use to achieve low latency random access to billions of rows. Starting with Amazon EMR 5.2.0, you can enable HBase on Amazon Simple Storage Service (Amazon S3). With HBase on Amazon S3, the HBase data files (HFiles) are written to Amazon S3, enabling data lake […]