AWS Big Data Blog

Category: Intermediate (200)

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow. BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, […]

Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions

Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform. Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. […]

Near-real-time fraud detection using Amazon Redshift Streaming Ingestion with Amazon Kinesis Data Streams and Amazon Redshift ML

The importance of data warehouses and analytics performed on data warehouse platforms has been increasing steadily over the years, with many businesses coming to rely on these systems as mission-critical for both short-term operational decision-making and long-term strategic planning. Traditionally, data warehouses are refreshed in batch cycles, for example, monthly, weekly, or daily, so that […]

Monitor AWS workloads without a single line of code with Logz.io and Kinesis Firehose

February 9, 2024: Amazon Kinesis Data Firehose has been renamed to Amazon Data Firehose. Read the AWS What’s New post to learn more. Observability data provides near real-time insights into the health and performance of AWS workloads, so that engineers can quickly address production issues and troubleshoot them before widespread customer impact. As AWS workloads […]

Introducing native Delta Lake table support with AWS Glue crawlers

June 2023: This post was reviewed and updated for accuracy. Delta Lake is an open-source project that helps implement modern data lake architectures commonly built on Amazon S3 or other cloud storages. With Delta Lake, you can achieve ACID transactions, time travel queries, CDC, and other common use cases on the cloud. Delta Lake is […]

Getting started with AWS Glue Data Quality for ETL Pipelines

June 2023: This post was reviewed and updated with the latest release from AWS Glue Data Catalog. Today, hundreds of thousands of customers use data lakes for analytics and machine learning. However, data engineers have to cleanse and prepare this data before it can be used. The underlying data has to be accurate and recent […]

Amazon EMR Serverless cost estimator

Amazon EMR Serverless is a serverless option in Amazon EMR that makes it easy for data analysts and engineers to run applications using open-source big data analytics frameworks such as Apache Spark and Hive without configuring, managing, and scaling clusters or servers. You get all the features of the latest open-source frameworks with the performance-optimized […]

Analyze real-time streaming data in Amazon MSK with Amazon Athena

Recent advances in ease of use and scalability have made streaming data easier to generate and use for real-time decision-making. Coupled with market forces that have forced businesses to react more quickly to industry changes, more and more organizations today are turning to streaming data to fuel innovation and agility. Amazon Managed Streaming for Apache […]

Create, Train and Deploy Multi Layer Perceptron (MLP) models using Amazon Redshift ML

Amazon Redshift is a fully managed and petabyte-scale cloud data warehouse which is being used by tens of thousands of customers to process exabytes of data every day to power their analytics workloads. Amazon Redshift comes with a feature called Amazon Redshift ML which puts the power of machine learning in the hands of every […]

How dynamic data masking support in Amazon Redshift helps achieve data privacy and compliance

Amazon Redshift is a fast, petabyte-scale cloud data warehouse delivering the best price–performance. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. Today, tens of thousands of customers run business-critical workloads on Amazon Redshift. Dynamic data masking (DDM) support in Amazon Redshift […]