AWS Big Data Blog

Category: Analytics

Power your Kafka Streams application with Amazon MSK and AWS Fargate

Today, companies of all sizes across all verticals design and build event-driven architectures centered around real-time streaming and stream processing. Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that makes it easy for you to build and run applications that use Apache Kafka to process streaming and event data. Apache […]

How Magellan Rx Management used Amazon Redshift ML to predict drug therapeutic conditions

This post is co-written with Karim Prasla and Deepti Bhanti from Magellan Rx Management as the lead authors. Amazon Redshift ML makes it easy for data scientists, data analysts, and database developers to create, train, and use machine learning (ML) models using familiar SQL commands in Amazon Redshift data warehouses. The ML feature can be […]

Easily manage your data lake at scale using AWS Lake Formation Tag-based access control

Thousands of customers are building petabyte-scale data lakes on AWS. Many of these customers use AWS Lake Formation to easily build and share their data lakes across the organization. As the number of tables and users increase, data stewards and administrators are looking for ways to manage permissions on data lakes easily at scale. Customers […]

How Comcast uses AWS to rapidly store and analyze large-scale telemetry data

This blog post is co-written by Russell Harlin from Comcast Corporation. Comcast Corporation creates incredible technology and entertainment that connects millions of people to the moments and experiences that matter most. At the core of this is Comcast’s high-speed data network, providing tens of millions of customers across the country with reliable internet connectivity. This […]

Use ML predictions over Amazon DynamoDB data with Amazon Athena ML

Today’s modern applications use multiple purpose-built database engines, including relational, key-value, document, and in-memory databases. This purpose-built approach improves the way applications use data by providing better performance and reducing cost. However, the approach raises some challenges for data teams that need to provide a holistic view on top of these database engines, and especially […]

Secure connectivity patterns to access Amazon MSK across AWS Regions

August 2023: Amazon MSK now offers a managed feature called multi-VPC private connectivity to simplify connectivity of your Kafka clients to your brokers. Refer this blog to learn more. AWS customers often segment their workloads across accounts and Amazon Virtual Private Cloud (Amazon VPC) to streamline access management while being able to expand their footprint. […]

Effective data lakes using AWS Lake Formation, Part 2: Securing data lakes with row-level access control

Apr 2023: This post was updated with the latest dataset and the updated CloudFormation template. July 2023: This post was reviewed for accuracy. Increasingly, customers are looking at data lakes as a core part of their strategy to democratize data access across the organization. Data lakes enable you to handle petabytes and exabytes of data […]

Benchmark the performance of the new Auto WLM with adaptive concurrency in Amazon Redshift

With Amazon Redshift, you can run a complex mix of workloads on your data warehouse clusters. For example, frequent data loads run alongside business-critical dashboard queries and complex transformation jobs. We also see more and more data science and machine learning (ML) workloads. Each workload type has different resource needs and different service level agreements. […]

How GE Healthcare modernized their data platform using a Lake House Architecture

GE Healthcare (GEHC) operates as a subsidiary of General Electric. The company is headquartered in the US and serves customers in over 160 countries. As a leading global medical technology, diagnostics, and digital solutions innovator, GE Healthcare enables clinicians to make faster, more informed decisions through intelligent devices, data analytics, applications, and services, supported by […]

Visualize data using Apache Spark running on Amazon EMR with Amazon QuickSight

Organizations often need to process large volumes of data before serving to business stakeholders. In this blog, we will learn how to leverage Amazon EMR to process data using Apache Spark, the go-to platform for in-memory analytics of large data volume, and connect business intelligence (BI) tool Amazon QuickSight to serve data to end-users. QuickSight […]