AWS Big Data Blog

Automate data lineage on Amazon MWAA with OpenLineage

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which […]

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with […]

Build a serverless streaming pipeline with Amazon MSK Serverless, Amazon MSK Connect, and MongoDB Atlas

This post was cowritten with Babu Srinivasan and Robert Walters from MongoDB. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed, highly available Apache Kafka service. Amazon MSK makes it easy to ingest and process streaming data in real time and use that data easily within the AWS ecosystem. With Amazon MSK […]

How NETSCOUT built a global DDoS awareness platform with Amazon OpenSearch Service

How NETSCOUT built a global DDoS awareness platform with Amazon OpenSearch Service

This post was co-written with Hardik Modi, AVP, Threat and Migitation Products at NETSCOUT. NETSCOUT Omnis Threat Horizon is a global cybersecurity awareness platform providing users with highly contextualized visibility into “over the horizon” threat activity on the global DDoS (Distributed Denial of Service) landscape—threats that could be impacting their industry, their customers, or their […]

Build highly available streams with Amazon Kinesis Data Streams

Many use cases are moving towards a real-time data strategy due to demand for real-time insights, low-latency response times, and the ability to adapt to the changing needs of end-users. For this type of workload, you can use Amazon Kinesis Data Streams to seamlessly provision, store, write, and read data in a streaming fashion. With […]

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access […]

Amazon QuickSight AWS re:Invent recap 2022

AWS re:Invent is a learning conference hosted by AWS for the global cloud computing community. Re:Invent was held at the end of 2022 in Las Vegas, Nevada, from November 28 to December 2. Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale. This post walks you through a full recap of QuickSight […]

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow. BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, […]

Accelerate orchestration of an ELT process using AWS Step Functions and Amazon Redshift Data API

Extract, Load, and Transform (ELT) is a modern design strategy where raw data is first loaded into the data warehouse and then transformed with familiar Structured Query Language (SQL) semantics leveraging the power of massively parallel processing (MPP) architecture of the data warehouse. When you use an ELT pattern, you can also use your existing […]

Add your own libraries and application dependencies to Spark and Hive on Amazon EMR Serverless with custom images

Amazon EMR Serverless allows you to run open-source big data frameworks such as Apache Spark and Apache Hive without managing clusters and servers. Many customers who run Spark and Hive applications want to add their own libraries and dependencies to the application runtime. For example, you may want to add popular open-source extensions to Spark, […]