AWS Big Data Blog

Simplify data integration pipeline development using AWS Glue custom blueprints

Update August 18, 2021 – AWS Glue custom blueprints are now generally available. Please visit https://docs.aws.amazon.com/glue/latest/dg/blueprints-overview.html to learn more. Organizations spend significant time developing and maintaining data integration pipelines that hydrate data warehouses, data lakes, and lake houses. As data volume increases, data engineering teams struggle to keep up with new requests from business teams. Although these […]

Read More

Simplify Snowflake data loading and processing with AWS Glue DataBrew

Historically, inserting and retrieving data from a given database platform has been easier compared to a multi-platform architecture for the same operations. To simplify bringing data in from a multi-database platform, AWS Glue DataBrew supports bringing your data in from multiple data sources via the AWS Glue Data Catalog. However, this requires you to have […]

Read More

Doing data preparation using on-premises PostgreSQL databases with AWS Glue DataBrew

Today, with AWS Glue DataBrew, data analysts and data scientists can easily access and visually explore any amount of data across their organization directly from their Amazon Simple Storage Service (Amazon S3) data lake, Amazon Redshift data warehouse, and Amazon Aurora and Amazon Relational Database Service (Amazon RDS) databases. Customers can choose from over 250 […]

Read More

Migrate terabytes of data quickly from Google Cloud to Amazon S3 with AWS Glue Connector for Google BigQuery

The cloud is often seen as advantageous for data lakes because of better security, faster time to deployment, better availability, more frequent feature and functionality updates, more elasticity, more geographic coverage, and costs linked to actual utilization. However, recent studies from Gartner and Harvard Business Review show multi-cloud and intercloud architectures are something leaders need […]

Read More

Orchestrate an Amazon EMR on Amazon EKS Spark job with AWS Step Functions

At re:Invent 2020, we announced the general availability of Amazon EMR on Amazon EKS, a new deployment option for Amazon EMR that allows you to automate the provisioning and management of open-source big data frameworks on Amazon Elastic Kubernetes Service (Amazon EKS). With Amazon EMR on EKS, you can now run Spark applications alongside other […]

Read More

Build a real-time streaming application using Apache Flink Python API with Amazon Kinesis Data Analytics

Amazon Kinesis Data Analytics is now expanding its Apache Flink offering by adding support for Python. This is exciting news for many of our customers who use Python as their primary language for application development. This new feature enables developers to build Apache Flink applications in Python using serverless Kinesis Data Analytics. With Kinesis Data […]

Read More

Introducing Auto-Tune in Amazon ES

September 8, 2021: Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service. See details. Today we announced Auto-Tune in Amazon OpenSearch Service (successor to Amazon Elasticsearch Service), an innovation undertaken to automatically optimize resources in Elasticsearch clusters to improve its performance and availability. Auto-Tune gives us a unique opportunity of applying our learnings from […]

Read More

Build a serverless tracking pixel solution in AWS

Let’s describe the typical use case where a tracking pixel solution, also known as a web beacon, might help you: Analyzing web traffic is critical to understanding user behavior in order to improve their experience. Let’s think about a company—Example Company Hotels—that embeds a piece of HTML into a high-traffic, third-party website (example.HighTrafficWebsite.com) to have […]

Read More
The following diagram shows the overall architecture to address our two challenges.

Extract multidimensional data from Microsoft SQL Server Analysis Services using AWS Glue

AWS Glue is fully managed service that makes it easier for you to extract, transform, and load (ETL) data for analytics. You can easily create ETL jobs to connect to backend data sources. There are several natively supported data sources, but what if you need to extract data from an unsupported data source? What if […]

Read More
The following diagram shows our solution architecture.

Effective data lakes using AWS Lake Formation, Part 2: Creating a governed table for streaming data sources

We announced the general availability of AWS Lake Formation transactions, row-level security, and acceleration at AWS re:Invent 2021. In Part 1 of this series, we explained how to set up a governed table and add objects to it. In this post, we expand on this example, and demonstrate how to ingest streaming data into governed tables […]

Read More