Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Choosing an AWS analytics service - Choosing an AWS analytics service

Choosing an AWS analytics service

Taking the first step

Introduction

Data is foundational to modern business. People and applications need to securely access and analyze data, which comes from new and diverse sources. The volume of data is also constantly increasing, which can cause organizations to struggle with capturing, storing, and analyzing all the necessary data.

Meeting these challenges means building a modern data architecture that breaks down all of your data silos for analytics and insights--including third-party data--and makes it accessible to everyone in the organization, in one place, with end-to-end governance. It's also increasingly important to connect your analytics and machine learning (ML) systems to enable predictive analytics.

This decision guide helps you ask the right questions to build your modern data architecture on AWS services. It explains how to break down your data silos (by connecting your data lake and data warehouses), your system silos (by connecting ML and analytics), and your people silos (by putting data in the hands of everyone in your organization).

This eight-minute exerpt is from a one-hour presentation by Sirish Chandrasekaran and Rick Sears at re:Invent 2024. It provides an overview of how a fictional company, Maxdome, uses SageMaker Unified Studio AI and analytics to unlock data insights.

Understand AWS analytics services

A modern data strategy is built with a set of technology building blocks that help you manage, access, analyze, and act on data. It also gives you multiple options to connect to data sources. A modern data strategy should empower your teams to:

  • Use your preferred tools or techniques

  • Use artificial intelligence (AI) to assist with finding answers to specific questions about your data

  • Manage who has access to data with the proper security and data governance controls

  • Break down data silos to give you the best of both data lakes and purpose-built data stores

  • Store any amount of data, at low-cost, and in open, standards-based data formats

  • Connect your data lake, data warehouse, and other purpose-built services into a coherent whole

AWS offers a variety of services to help you achieve a modern data strategy. The following diagram depicts the AWS services for analytics that this guide covers. The tabs that follow provide additional details.

Diagram showing the AWS analytics services
Unified analytics and AI

Amazon SageMaker Unified Studio (preview) provides a unified experience for using data, analytics, and AI capabilities within a single, governed environment. You can streamline access to familiar tools and functionality from purpose-built AWS services like Amazon EMR, AWS Glue, Athena, and Amazon Redshift.

Data processing
  • Amazon Athena helps you analyze unstructured, semi-structured, and structured data stored in Amazon S3. Examples include CSV, JSON, or columnar data formats such as Apache Parquet and Apache ORC. You can use Athena to run ad-hoc queries using ANSI SQL, without the need to aggregate or load the data into Athena. Athena integrates with Amazon QuickSight, AWS Glue Data Catalog, and other AWS services.

  • Amazon EMR is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark, on AWS to process and analyze vast amounts of data. Using these frameworks and related open-source projects, you can process data for analytics purposes and business intelligence workloads. Amazon EMR also lets you transform and move large amounts of data into and out of other AWS data stores and databases, such as Amazon S3.

  • With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor ETL pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Athena, Amazon EMR, and Amazon Redshift Spectrum.

Data streaming
  • With Amazon Managed Streaming for Apache Kafka (Amazon MSK), you can build and run applications that use Apache Kafka to process streaming data. Amazon MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters. It lets you use Apache Kafka data-plane operations, such as those for producing and consuming data.

  • With Amazon Kinesis Data Streams, you can collect and process large streams of data records in real time. The type of data used can include IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data.

  • Amazon Data Firehose is a fully managed service for delivering real-time streaming data to destinations such as Amazon S3, Amazon Redshift, Amazon OpenSearch Service, Splunk, and Apache Iceberg Tables. You can also send data to any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers, including Datadog, Dynatrace, LogicMonitor, MongoDB, New Relic, Coralogix, and Elastic.

  • With Amazon Managed Service for Apache Flink, you can use Java, Scala, Python, or SQL to process and analyze streaming data. You can author and run code against streaming sources and static sources to perform time-series analytics, feed real-time dashboards, and metrics.

Business intelligence

Amazon QuickSight gives decision-makers the opportunity to explore and interpret information in an interactive visual environment. In a single data dashboard, Amazon QuickSight can include AWS data, third-party data, big data, spreadsheet data, SaaS data, B2B data, and more. With Amazon QuickSight Q, you can use natural language to ask questions about your data and receive a response. For example, "What are the top-selling categories in California?"

Search analytics

Amazon OpenSearch Service provisions all the resources for your OpenSearch cluster and launches it. It also automatically detects and replaces failed OpenSearch Service nodes, reducing the overhead associated with self-managed infrastructures. You can use OpenSearch Service direct query to analyze data in Amazon S3 and other AWS services.

Data governance

With Amazon DataZone, you can manage and govern access to data by using fine-grained controls. These controls help ensure access with the right level of privileges and context. Amazon DataZone simplifies your architecture by integrating data management services, including Amazon Redshift, Athena, Amazon QuickSight, AWS Glue, on-premises sources, and third-party sources.

Data collaboration

AWS Clean Rooms is a secure collaboration workspace where you can analyze collective datasets without providing access to the raw data. You can collaborate with other companies by choosing the partners with whom you want to collaborate, selecting their datasets, and configuring privacy-enhancing controls for those partners. When you run queries, AWS Clean Rooms reads data from that data's original location and applies built-in analysis rules to help you maintain control over that data.

Data lake and warehouse
  • Amazon SageMaker Lakehouse unifies your data across Amazon S3 data lakes and Amazon Redshift data warehouses, helping you build powerful analytics, ML, and generative AI applications on a single copy of data. SageMaker Lakehouse provides integrated access controls and open-source Apache Iceberg for data interoperability and collaboration.

  • Amazon Simple Storage Service (Amazon S3) can store and protect virtually any amount and kind of data, which you can use for your data lake foundation. Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements. Amazon S3 Tables provide S3 storage that’s optimized for analytics workloads. Using standard SQL statements, you can query your tables with query engines that support Iceberg, such as Athena, Amazon Redshift, and Apache Spark.

  • Amazon Redshift is a fully managed, petabyte-scale data warehouse service. Amazon Redshift integrates with SageMaker Lakehouse, allowing you to use its powerful SQL analytic capabilities on your unified data across Amazon Redshift data warehouses and Amazon S3 data lakes. You can also use Amazon Q in Amazon Redshift, which simplifies SQL authoring through natural language.

Amazon SageMaker Unified Studio (preview) provides a unified experience for using data, analytics, and AI capabilities within a single, governed environment. You can streamline access to familiar tools and functionality from purpose-built AWS services like Amazon EMR, AWS Glue, Athena, and Amazon Redshift.

Consider criteria for AWS analytics services

There are many reasons for building data analytics on AWS. You might need to support a greenfield or pilot project as a first step in your cloud migration journey. Alternatively, you might be migrating an existing workload with as little disruption as possible. Whatever your goal, the following considerations can be useful in making your choice.

Assess data sources and data types

Analyze available data sources and data types to gain a comprehensive understanding of data diversity, frequency, and quality. Understand any potential challenges in processing and analyzing the data. This analysis is crucial because:

  • Data sources are diverse and come from various systems, applications, devices, and external platforms.

  • Data sources have unique structure, format, and frequency of data updates. Analyzing these sources helps in identifying suitable data collection methods and technologies.

  • Analyzing data types, such as structured, semi-structured, and unstructured data determines the appropriate data processing and storage approaches.

  • Analyzing data sources and types facilitates data quality assessment, helps you anticipate potential data quality issues—missing values, inconsistencies, or inaccuracies.

Data processing requirements

Determine data processing requirements for how data is ingested, transformed, cleansed, and prepared for analysis. Key considerations include:

  • Data transformation: Determine the specific transformations needed to make the raw data suitable for analysis. This involves tasks like data aggregation, normalization, filtering, and enrichment.

  • Data cleansing: Assess data quality and define processes to handle missing, inaccurate, or inconsistent data. Implement data cleansing techniques to ensure high-quality data for reliable insights.

  • Processing frequency: Determine whether real-time, near real-time, or batch processing is required based on the analytical needs. Real-time processing enables immediate insights, while batch processing may be sufficient for periodic analyses.

  • Scalability and throughput: Evaluate the scalability requirements for handling data volumes, processing speed, and the number of concurrent data requests. Ensure that the chosen processing approach can accommodate future growth.

  • Latency: Consider the acceptable latency for data processing and the time it takes from data ingestion to analysis results. This is particularly important for real-time or time-sensitive analytics.

Storage requirements

Determine storage needs by determining how and where data is stored throughout the analytics pipeline. Important considerations include:

  • Data volume: Assess the amount of data being generated and collected, and estimate future data growth to plan for sufficient storage capacity.

  • Data retention: Define the duration for which data should be retained for historical analysis or compliance purposes. Determine the appropriate data retention policies.

  • Data access patterns: Understand how data will be accessed and queried to choose the most suitable storage solution. Consider read and write operations, data access frequency, and data locality.

  • Data security: Prioritize data security by evaluating encryption options, access controls, and data protection mechanisms to safeguard sensitive information.

  • Cost optimization: Optimize storage costs by selecting the most cost-effective storage solutions based on data access patterns and usage.

  • Integration with analytics services: Ensure seamless integration between the chosen storage solution and the data processing and analytics tools in the pipeline.

Types of data

When deciding on analytics services for the collection and ingestion of data, consider various types of data that are relevant to your organization's needs and objectives. Common types of data that you might need to consider includes:

  • Transactional data: Includes information about individual interactions or transactions, such as customer purchases, financial transactions, online orders, and user activity logs.

  • File-based data: Refers to structured or unstructured data that is stored in files, such as log files, spreadsheets, documents, images, audio files, and video files. Analytics services should support the ingestion of different file formats.

  • Event data: Captures significant occurrences or incidents, such as user actions, system events, machine events, or business events. Events can include any data that is arriving in high velocity that is captured for onstream or downstream processing.

Operational considerations

Operational responsibility is shared between you, and AWS, with the division of responsibility varying across different levels of modernization. You have the option of self-managing your analytics infrastructure on AWS or leveraging the numerous serverless analytics services to lesson your infrastructure management burden.

Self-managed options grant users greater control over the infrastructure and configurations, but they require more operational effort.

Serverless options abstract away much of the operational burden, providing automatic scalability, high availability, and robust security features, allowing users to focus more on building analytical solutions and driving insights rather than managing infrastructure and operational tasks. Consider these benefits of serverless analytics solutions:

  • Infrastructure abstraction: Serverless services abstract infrastructure management, relieving users from provisioning, scaling, and maintenance tasks. AWS handles these operational aspects, reducing management overhead.

  • Auto-scaling and performance: Serverless services automatically scale resources based on workload demands, ensuring optimal performance without manual intervention.

  • High availability and disaster recovery: AWS provides high availability for serverless services. AWS manages data redundancy, replication, and disaster recovery to enhance data availability and reliability.

  • Security and compliance: AWS manages security measures, data encryption, and compliance for serverless services, adhering to industry standards and best practices.

  • Monitoring and logging: AWS offers built-in monitoring, logging, and alerting capabilities for serverless services. Users can access detailed metrics and logs through Amazon CloudWatch.

Type of workload

When building a modern analytics pipeline, deciding on the types of workload to support is crucial to meet different analytical needs effectively. Key decision points to consider for each type of workload includes:

Batch workload

  • Data volume and frequency: Batch processing is suitable for large volumes of data with periodic updates.

  • Data latency: Batch processing might introduce some delay in delivering insights compared to real-time processing.

Interactive analysis

  • Data query complexity: Interactive analysis requires low-latency responses for quick feedback.

  • Data visualization: Evaluate the need for interactive data visualization tools to enable business users to explore data visually.

Streaming workloads

  • Data velocity and volume: Streaming workloads require real-time processing to handle high-velocity data.

  • Data windowing: Define data windowing and time-based aggregations for streaming data to extract relevant insights.

Type of analysis needed

Clearly define the business objectives and the insights you aim to derive from the analytics. Different types of analytics serve different purposes. For example:

  • Descriptive analytics is ideal for gaining a historical overview

  • Diagnostic analytics helps understand the reasons behind past events

  • Predictive analytics forecasts future outcomes

  • Prescriptive analytics provides recommendations for optimal actions

Match your business goals with the relevant types of analytics. Here are some key decision criteria to help you choose the right types of analytics:

  • Data availability and quality: Descriptive and diagnostic analytics rely on historical data, while predictive and prescriptive analytics require sufficient historical data and high-quality data to build accurate models.

  • Data volume and complexity: Predictive and prescriptive analytics require substantial data processing and computational resources. Ensure that your infrastructure and tools can handle the data volume and complexity.

  • Decision complexity: If decisions involve multiple variables, constraints, and objectives, prescriptive analytics may be more suitable to guide optimal actions.

  • Risk tolerance: Prescriptive analytics may provide recommendations, but come with associated uncertainties. Ensure that decision-makers understand the risks associated with the analytics outputs.

Evaluate scalability and performance

Assess the scalability and performance needs of the architecture. The design must handle increasing data volumes, user demands, and analytical workloads. Key decision factors to consider includes:

  • Data volume and growth: Assess the current data volume and anticipate future growth.

  • Data velocity and real-time requirements: Determine if the data needs to be processed and analyzed in real-time or near real-time.

  • Data processing complexity: Analyze the complexity of your data processing and analysis tasks. For computationally intensive tasks, services such as Amazon EMR provide a scalable and managed environment for big data processing.

  • Concurrency and user load: Consider the number of concurrent users and the level of user load on the system.

  • Auto-scaling capabilities: Consider services that offer auto-scaling capabilities, allowing resources to automatically scale up or down based on demand. This ensures efficient resource utilization and cost optimization.

  • Geographic distribution: Consider services with global replication and low-latency data access if your data architecture needs to be distributed across multiple regions or locations.

  • Cost-performance trade-off: Balance the performance needs with cost considerations. Services with high performance may come at a higher cost.

  • Service-level agreements (SLAs): Check the SLAs provided by AWS services to ensure they meet your scalability and performance expectations.

Data governance

Data governance is the set of processes, policies, and controls you need to implement to ensure effective management, quality, security, and compliance of your data assets. Key decision points to consider includes:

  • Data retention policies: Define data retention policies based on regulatory requirements and business needs and establish processes for secure data disposal when it is no longer needed.

  • Audit trail and logging: Decide on the logging and auditing mechanisms to monitor data access and usage. Implement comprehensive audit trails to track data changes, access attempts, and user activities for compliance and security monitoring.

  • Compliance requirements: Understand the industry-specific and geographic data compliance regulations that apply to your organization. Ensure that the data architecture aligns with these regulations and guidelines.

  • Data classification: Classify data based on its sensitivity and define appropriate security controls for each data class.

  • Disaster recovery and business continuity: Plan for disaster recovery and business continuity to ensure data availability and resilience in case of unexpected events or system failures.

  • Third-party data sharing: If sharing data with third-party entities, implement secure data sharing protocols and agreements to protect data confidentiality and prevent data misuse.

Security

The security of data in the analytics pipeline involves protecting data at every stage of the pipeline to ensure its confidentiality, integrity, and availability. Key decision points to consider includes:

  • Access control and authorization: Implement robust authentication and authorization protocols to ensure that only authorized users can access specific data resources.

  • Data encryption: Choose appropriate encryption methods for data stored in databases, data lakes, and during data movement between different components of the architecture.

  • Data masking and anonymization: Consider the need for data masking or anonymization to protect sensitive data, such as PII or sensitive business data, while allowing certain analytical processes to continue.

  • Secure data integration: Establish secure data integration practices to ensure that data flows securely between different components of the architecture, avoiding data leaks or unauthorized access during data movement.

  • Network isolation: Consider services that support Amazon VPC Endpoints to avoid exposing resources to the public internet.

Plan for integration and data flows

Define the integration points and data flows between various components of the analytics pipeline to ensure seamless data flow and interoperability. Key decision points to consider includes:

  • Data source integration: Identify the data sources from which data will be collected, such as databases, applications, files, or external APIs. Decide on the data ingestion methods (batch, real-time, event-based) to bring data into the pipeline efficiently and with minimal latency.

  • Data transformation: Determine the transformations required to prepare data for analysis. Decide on the tools and processes to clean, aggregate, normalize, or enrich the data as it moves through the pipeline.

  • Data movement architecture: Choose the appropriate architecture for data movement between pipeline components. Consider batch processing, stream processing, or a combination of both based on the real-time requirements and data volume.

  • Data replication and sync: Decide on data replication and synchronization mechanisms to keep data up-to-date across all components. Consider real-time replication solutions or periodic data syncs depending on data freshness requirements.

  • Data quality and validation: Implement data quality checks and validation steps to ensure the integrity of data as it moves through the pipeline. Decide on the actions to be taken when data fails validation, such as alerting or error handling.

  • Data security and encryption: Determine how data will be secured during transit and at rest. Decide on the encryption methods to protect sensitive data throughout the pipeline, considering the level of security required based on data sensitivity.

  • Scalability and resilience: Ensure that the data flow design allows for horizontal scalability and can handle increased data volumes and traffic.

Architect for cost optimization

Building your analytics pipeline on AWS provides various cost optimization opportunities. To ensure cost efficiency, consider the following strategies:

  • Resource sizing and selection: Right-size your resources based on actual workload requirements. Choose AWS services and instance types that match the workloads performance needs while avoiding overprovisioning.

  • Auto-scaling: Implement auto-scaling for services that experience varying workloads. Auto-scaling dynamically adjusts the number of instances based on demand, reducing costs during low-traffic periods.

  • Spot Instances: Use Amazon EC2 Spot Instances for non-critical and fault-tolerant workloads. Spot Instances can significantly reduce costs compared to on-demand instances.

  • Reserved instances: Consider purchasing AWS Reserved Instances to achieve significant cost savings over on-demand pricing for stable workloads with predictable usage.

  • Data storage tiering: Optimize data storage costs by using different storage classes based on data access frequency.

  • Data lifecycle policies: Establish data lifecycle policies to automatically move or delete data based on its age and usage patterns. This helps manage storage costs and keeps data storage aligned with its value.

Analyze available data sources and data types to gain a comprehensive understanding of data diversity, frequency, and quality. Understand any potential challenges in processing and analyzing the data. This analysis is crucial because:

  • Data sources are diverse and come from various systems, applications, devices, and external platforms.

  • Data sources have unique structure, format, and frequency of data updates. Analyzing these sources helps in identifying suitable data collection methods and technologies.

  • Analyzing data types, such as structured, semi-structured, and unstructured data determines the appropriate data processing and storage approaches.

  • Analyzing data sources and types facilitates data quality assessment, helps you anticipate potential data quality issues—missing values, inconsistencies, or inaccuracies.

Choose AWS analytics services

Now that you know the criteria to evaluate your analytics needs, you are ready to choose which AWS analytics services are right for your organizational needs. The following table aligns sets of services with common capabilities and business goals.

Categories What is it optimized for? Services

Unified analytics and AI

Analytics and AI development

Optimized for using a single environment to access data, analytics, and AI capabilities.

Amazon SageMaker Unified Studio (preview)

Data processing

Interactive analytics

Optimized for performing real-time data analysis and exploration, which allows users to interactively query and visualize data.

Amazon Athena

Big data processing

Optimized for processing, moving, and transforming large amounts of data.

Amazon EMR

Data catalog

Optimized for providing detailed information about the available data, its structure, characteristics, and relationships.

AWS Glue

Data streaming

Apache Kafka processing of streaming data

Optimized for using Apache Kafka data-plane operations and running open source versions of Apache Kafka.

Amazon MSK

Real-time processing

Optimized for rapid and continuous data intake and aggregation, including IT infrastructure log data, application logs, social media, market data feeds, and web clickstream data.

Amazon Kinesis Data Streams

Real-time streaming data delivery

Optimized for delivering real-time streaming data to destinations such as Amazon S3, Amazon Redshift, OpenSearch Service, Splunk, Apache Iceberg Tables, and any custom HTTP endpoint or HTTP endpoints owned by supported third-party service providers.

Amazon Data Firehose

Building Apache Flink applications

Optimized for using Java, Scala, Python, or SQL to process and analyze streaming data.

Amazon Managed Service for Apache Flink

Business intelligence

Dashboards and visualizations

Optimized for visually representing complex datasets, and providing natural language query of your data.

Amazon QuickSight

Search analytics

Managed OpenSearch clusters

Optimized for log analytics, real-time application monitoring, and clickstream analysis.

Amazon OpenSearch Service

Data governance

Managing data access

Optimized for setting up the proper management, availability, usability, integrity, and security of data throughout its lifecycle.

Amazon DataZone

Data collaboration

Secure data clean rooms

Optimized for collaborating with other companies without sharing raw underlying data.

AWS Clean Rooms

Data lake and warehouse

Integrated data lake and data warehouse access

Optimized for unifying your data across Amazon S3 data lakes and Amazon Redshift data warehouses.

Amazon SageMaker Lakehouse

Object storage for data lakes

Optimized for providing a data lake foundation with virtually unlimited scalability and high durability.

Amazon S3

Data warehousing

Optimized for centrally storing, organizing, and retrieving large volumes of structured and sometimes semi-structured data from various sources within an organization.

Amazon Redshift

Use AWS analytics services

You should now have a clear understanding of your business objectives, and the volume and velocity of data you will be ingesting and analyzing to begin building your data pipelines.

To explore how to use and learn more about each of the available services—we have provided a pathway to explore how each of the services work. The following sections provides links to in-depth documentation, hands-on tutorials, and resources to get you started from basic usage to more advanced deep dives.

Amazon Athena
  • Getting started with Amazon Athena

    Learn how to use Amazon Athena to query data and create a table based on sample data stored in Amazon S3, query the table, and check the results of the query.

    Get started with the tutorial

  • Get started with Apache Spark on Athena

    Use the simplified notebook experience in Athena console to develop Apache Spark applications using Python or Athena notebook APIs.

    Get started with the tutorial

  • Catalog and govern Athena federated queries with SageMaker Lakehouse

    Learn how to connect to, govern, and run federated queries on data stored in Amazon Redshift, DynamoDB (Preview), and Snowflake (Preview).

    Read the blog

  • Analyzing data in Amazon S3 using Athena

    Explore how to use Athena on logs from Elastic Load Balancers, generated as text files in a pre-defined format. We show you how to create a table, partition the data in a format used by Athena, convert it to Parquet, and compare query performance.

    Read the blog post

AWS Clean Rooms
  • Setting up AWS Clean Rooms

    Learn how to set up AWS Clean Rooms in your AWS acccount.

    Read the guide

  • Unlock data insights across multi-party datasets using AWS Entity Resolution on AWS Clean Rooms without sharing underlying data

    Learn how to use preparation and matching to help improve data matching with collaborators.

    Read the blog post

  • How differential privacy helps unlock insights without revealing data at the individual-level

    Learn how AWS Clean Rooms Differential Privacy simplifies applying differential privacy and helps protect the privacy of your users.

    Read the blog

Amazon Data Firehose
  • Tutorial: Create a Firehose stream from console

    Learn how to use the AWS Management Console or an AWS SDK to create a Firehose stream to your chosen destination.

    Read the guide

  • Send data to a Firehose stream

    Learn how to use different data sources to send data to your Firehose stream.

    Read the guide

  • Transform source data in Firehose

    Learn how to invoke your Lambda function to transform incoming source data and deliver the transformed data to destinations.

    Read the guide

Amazon DataZone
  • Getting started with Amazon DataZone

    Learn how to create the Amazon DataZone root domain, obtain the data portal URL, walk through the basic Amazon DataZone workflows for data producers and data consumers.

    Get started with the tutorial

  • Announcing the general availability of data lineage in the next generation of Amazon SageMaker and Amazon DataZone

    Learn how Amazon DataZone uses automated lineage capture to focus on automatically collecting and mapping lineage information from AWS Glue and Amazon Redshift.

    Get started with the tutorial

Amazon EMR
  • Getting started with Amazon EMR

    Learn how to launch a sample cluster using Spark, and how to run a simple PySpark script stored in an Amazon S3 bucket.

    Get started with the tutorial

  • Getting started with Amazon EMR on Amazon EKS

    We show you how to get started using Amazon EMR on Amazon EKS by deploying a Spark application on a virtual cluster.

    Explore the guide

  • Get started with EMR Serverless

    Explore how Amazon EMR Serverless provides a serverless runtime environment that simplifies the operation of analytics applications that use the latest open source frameworks.

    Get started with the tutorial

AWS Glue
  • Getting started with AWS Glue DataBrew

    Learn how to create your first DataBrew project. You load a sample dataset, run transformations on that dataset, build a recipe to capture those transformations, and run a job to write the transformed data to Amazon S3.

    Get started with the tutorial

  • Transform data with AWS Glue DataBrew

    Learn about AWS Glue DataBrew, a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. Learn how to construct an ETL process using AWS Glue DataBrew.

    Get started with the lab

  • AWS Glue DataBrew immersion day

    Explore how to use AWS Glue DataBrew to clean and normalize data for analytics and machine learning.

    Get started with the workshop

  • Getting started with the AWS Glue Data Catalog

    Learn how to create your first AWS Glue Data Catalog, which uses an Amazon S3 bucket as your data source.

    Get started with the tutorial

  • Data catalog and crawlers in AWS Glue

    Discover how you can use the information in the Data Catalog to create and monitor your ETL jobs.

    Explore the guide

Kinesis Data Streams
  • Getting started tutorials for Amazon Kinesis Data Streams

    Learn how to process and analyze real-time stock data.

    Get started with the tutorials

  • Architectural patterns for real-time analytics using Amazon Kinesis Data Streams, part 1

    Learn about common architectural patterns of two use cases: time series data analysis and event driven microservices.

    Read the blog

  • Architectural Patterns for real-time analytics using Amazon Kinesis Data Streams, part 2

    Learn about AI applications with Kinesis Data Streams in three scenarios: real-time generative business intelligence, real-time recommendation systems, and Internet of Things data streaming and inferencing.

    Read the blog

Managed Service for Apache Flink
  • What is Amazon Managed Service for Apache Flink?

    Understand the fundamental concepts of Amazon Managed Service for Apache Flink.

    Explore the guide

  • Amazon Managed Service for Apache Flink Workshop

    In this workshop, you will learn how to deploy, operate, and scale a Flink application with Amazon Managed Service for Apache Flink.

    Attend the virtual workshop

Amazon MSK
  • Getting Started with Amazon MSK

    Learn how to create an Amazon MSK cluster, produce and consume data, and monitor the health of your cluster using metrics.

    Get started with the guide

  • Amazon MSK Workshop

    Go deep with this hands-on Amazon MSK workshop.

    Get started with the workshop

OpenSearch Service
  • Getting started with OpenSearch Service

    Learn how to use Amazon OpenSearch Service to create and configure a test domain.

    Get started with the tutorial

  • Visualizing customer support calls with OpenSearch Service and OpenSearch Dashboards

    Discover a full walkthrough of the following situation: a business receives some number of customer support calls and wants to analyze them. What is the subject of each call? How many were positive? How many were negative? How can managers search or review the the transcripts of these calls?

    Get started with the tutorial

  • Getting started with Amazon OpenSearch Serverless workshop

    Learn how to set up a new Amazon OpenSearch Serverless domain in the AWS console. Explore the different types of search queries available, and design eye-catching visualizations, and learn how you can secure your domain and documents based on assigned user privileges.

    Get started with the workshop

  • Cost Optimized Vector Database: Introduction to Amazon OpenSearch Service quantization techniques

    Learn how OpenSearch Service supports scalar and product quantization techniques to optimize memory usage and reduce operational costs.

    Read the blog post

Amazon QuickSight
  • Getting started with Amazon QuickSight data analysis

    Learn how to create your first analysis. Use sample data to create either a simple or a more advanced analysis. Or you can connect to your own data to create an analysis.

    Explore the guide

  • Visualizing with QuickSight

    Discover the technical side of business intelligence (BI) and data visualization with AWS. Learn how to embed dashboards into applications and websites, and securely manage access and permissions.

    Get started with the course

  • QuickSight workshops

    Get a head start on your QuickSight journey with workshops

    Get started with the workshops

Amazon Redshift
  • Getting started with Amazon Redshift Serverless

    Understand the basic flow of Amazon Redshift Serverless to create serverless resources, connect to Amazon Redshift Serverless, load sample data, and then run queries on the data.

    Explore the guide

  • Deploy a data warehouse on AWS

    Learn how to create and configure an Amazon Redshift data warehouse, load sample data, and analyze it using a SQL client.

    Get started with the tutorial

  • Amazon Redshift deep dive workshop

    Explore a series of exercises which help users get started using the Amazon Redshift platform.

    Get started with the workshop

Amazon S3
  • Getting started with Amazon S3

    Learn how to create your first DataBrew project. You load a sample dataset, run transformations on that dataset, build a recipe to capture those transformations, and run a job to write the transformed data to Amazon S3.

    Get started with the guide

  • Central storage - Amazon S3 as the data lake storage platform

    Discover how Amazon S3 is an optimal foundation for a data lake because of its virtually unlimited scalability and high durability.

    Read the whitepaper

SageMaker Lakehouse
  • Getting started with SageMaker Lakehouse

    Learn how to create a project and to browse, upload, and query data.

    Read the guide

  • Simplify data access for your enterprise using SageMaker Lakehouse

    Learn how to use preferred analytics, machine learning, and business intelligence engines through an open, Apache Iceberg REST API to help ensure secure access to data with consistent, fine-grained access controls.

    Read the blog

  • Catalog and govern Athena federated queries with SageMaker Lakehouse

    Learn how to connect to, govern, and run federated queries on data stored in Amazon Redshift, DynamoDB, and Snowflake.

    Read the blog

SageMaker Unified Studio
  • Getting started with SageMaker Unified Studio

    Learn how to create a project, add members, and use the sample JupyterLab notebook to begin building.

    Read the guide

  • Introducing the next generation of Amazon SageMaker: The center for all your data, analytics, and AI

    Learn how to get started with data processing, model development, and generative AI app development.

    Read the blog

  • What is Amazon SageMaker Unified Studio?

    Learn about the capabilities of SageMaker Unified Studio and how to access them.

    Read the blog

  • Getting started with Amazon Athena

    Learn how to use Amazon Athena to query data and create a table based on sample data stored in Amazon S3, query the table, and check the results of the query.

    Get started with the tutorial

  • Get started with Apache Spark on Athena

    Use the simplified notebook experience in Athena console to develop Apache Spark applications using Python or Athena notebook APIs.

    Get started with the tutorial

  • Catalog and govern Athena federated queries with SageMaker Lakehouse

    Learn how to connect to, govern, and run federated queries on data stored in Amazon Redshift, DynamoDB (Preview), and Snowflake (Preview).

    Read the blog

  • Analyzing data in Amazon S3 using Athena

    Explore how to use Athena on logs from Elastic Load Balancers, generated as text files in a pre-defined format. We show you how to create a table, partition the data in a format used by Athena, convert it to Parquet, and compare query performance.

    Read the blog post

Explore ways to use AWS analytics services

Editable architecture diagrams

Reference architecture diagrams

Explore architecture diagrams to help you develop, scale, and test your analytics solutions on AWS.

Explore analytics reference architectures

Ready-to-use code

Featured solution

Scalable Analytics Using Apache Druid on AWS

Deploy AWS-built code to help you set up, operate, and manage Apache Druid on AWS, a cost-effective, highly available, resilient, and fault tolerant hosting environment.

Explore this solution

AWS Solutions

Explore pre-configured, deployable solutions and their implementation guides, built by AWS.

Explore all AWS security, identity, and governance solutions

Documentation

Analytics whitepapers

Explore whitepapers for further insights and best practices on choosing, implementing, and using the analytics services that best fit your organization.

Explore analytics whitepapers

AWS Big Data Blog

Explore blog posts that address specific big data use cases.

Explore the AWS Big Data blog

Reference architecture diagrams

Explore architecture diagrams to help you develop, scale, and test your analytics solutions on AWS.

Explore analytics reference architectures

PrivacySite termsCookie preferences
© 2025, Amazon Web Services, Inc. or its affiliates. All rights reserved.