Introduction

Data needs to be securely accessed and analyzed by applications and people. Data volumes are coming from new and diverse sources, and increasing at an unprecedented rate. Organizations need to extract data value, but struggle to capture, store, and analyze all the data generated by today’s modern businesses.

Meeting these challenges means building a modern data architecture that breaks down all of your data silos for analytics and insights - including third-party data - and puts it in the hands of everyone in the organization, with end-to-end governance. It is also increasingly important to connect your analytics and machine learning (ML) systems to enable predictive analytics. 

This decision guide helps you ask the right questions to build your modern data architecture on AWS services. It explains how to break down your data silos (by connecting your data lake and data warehouses), your system silos (by connecting ML and analytics), and your people silos (by putting data in the hands of everyone in your organization).

This six-minute excerpt is from a one-hour presentation by G2 Krishnamoorthy, VP of AWS Analytics at re:Invent 2022. It provides an overview of AWS analytics services. The full presentation covers the current state of analytics on AWS as well as the latest service innovations around data, and highlights customer successes with AWS analytics.

Understand

A modern data strategy is enabled by a set of technology building blocks that help you manage, access, analyze, and act on data. It also gives you multiple options to connect to data sources. A modern data strategy should empower your teams to:

  • Run analytics or ML using your preferred tools or techniques
  • Manage who has access to data with the proper security and data governance controls
  • Break down data silos to give you the best of both data lakes and purpose-built data stores
  • Store any amount of data, at low-cost, and in open, standards-based data formats. The AWS modern data architecture connects your lake, warehouse, and other purpose-built services into a coherent whole.

Implementing a modern data strategy on AWS is based on the following five pillars:

Scalable data lakes

To make decisions quickly, you will want to store any amount of data in open formats and be able to break down disconnected data silos. You might also have a need to empower people in your organization to run analytics or ML (using your preferred tools or techniques for doing so), as well as manage who can access specific pieces of data with the proper security and data governance controls.

A modern data architecture starts with the data lake. A data lake lets you store all of your data (relational, non-relational, structured, and unstructured) cost effectively. With AWS, you can move any amount of data from various silos into an Amazon S3 data lake. Amazon S3 then stores data using a standard-based open format.

Purpose-built for performance and cost

On-premises data pipelines are often retrofitted to the tools you are currently using, providing a sub-optimal experience. AWS provides a broad and deep set of purpose-built data services allowing you to choose the right tool for the right job so you don’t have to compromise on functionality, performance, scale, or cost.

Serverless and easy to use

For many types of analytics needs, AWS provides serverless options designed to enable your to focus on your application, without having to touch any infrastructure.

The process of getting raw data into a state that can be used to derive business insights, and performed by the extract, transform, and load (ETL) stage of the data pipeline, can be challenging. AWS is moving towards a Zero-ETL approach (one that eliminates the need for traditional ETL processes). This approach will help you analyze data where it sits, without the need to use ETL. Features within AWS services that support this approach include:


  • Amazon Zero-ETL Aurora to Redshift
  • Amazon Redshift Streaming Ingestion directly from Kinesis and MSK to Redshift
  • Federated Query in Amazon Redshift and Amazon Athena

Unified data access, security, and governance

Once you have a centralized data lake and collection of purpose-built analytics services, you then need the ability to access that data wherever it lives, then secure it and have governance policies to comply with relevant regulations and security best practices.

Governance starts with AWS Lake Formation. This service allows you to access your data wherever it lives, whether it’s in a database, data warehouse, purpose-built data store, or a data lake, and then keep your data secure no matter where you store it.

For data governance, AWS automatically discovers, tags, catalogs, and keeps your data in sync and you can centrally define and manage security, governance, and auditing policies to satisfy regulations specific to your industry and geography.

Built-in machine learning

AWS offers built-in ML integration as part of our purpose-built analytics services. You can build, train, and deploy ML models using familiar SQL commands, without any prior ML experience.

It is not uncommon to use different types of data stores (relational, non-relational, data warehouses, and analytics services) for different use cases. AWS provides a range of integrations to give you options for training models on your data - or adding inference results right from your data store - without having to export and process your data.

Consider

There are many reasons for building an analytics pipeline on AWS. You may need to support a greenfield or pilot project as a first step in your cloud migration journey. Alternatively, you may be migrating an existing workload with as little disruption as possible. Whatever your goal, the following considerations may be useful in making your choice.

  • Analyze available data sources and data types to gain a comprehensive understanding of data diversity, frequency, and quality. Understand any potential challenges in processing and analyzing the data. This analysis is crucial because:

    • Data sources are diverse and come from various systems, applications, devices, and external platforms.
    • Data sources have unique structure, format, and frequency of data updates. Analyzing these sources helps in identifying suitable data collection methods and technologies.
    • Analyzing data types, such as structured, semi-structured, and unstructured data determines the appropriate data processing and storage approaches.
    • Analyzing data sources and types of facilitates data quality assessment, helps you anticipate potential data quality issues - missing values, inconsistencies, or inaccuracies.
  • Determine data processing requirements for how data is ingested, transformed, cleansed, and prepared for analysis. Key considerations include:

    • Data transformation: Determine the specific transformations needed to make the raw data suitable for analysis. This involves tasks like data aggregation, normalization, filtering, and enrichment.
    • Data cleansing: Assess data quality and define processes to handle missing, inaccurate, or inconsistent data. Implement data cleansing techniques to ensure high-quality data for reliable insights.
    • Processing frequency: Determine whether real-time, near real-time, or batch processing is required based on the analytical needs. Real-time processing enables immediate insights, while batch processing may be sufficient for periodic analyses.
    • Scalability and throughput: Evaluate the scalability requirements for handling data volumes, processing speed, and the number of concurrent data requests. Ensure that the chosen processing approach can accommodate future growth.
    • Latency: Consider the acceptable latency for data processing and the time it takes from data ingestion to analysis results. This is particularly important for real-time or time-sensitive analytics.
  • Determine storage needs by determining how and where data is stored throughout the analytics pipeline. Important considerations include:

    • Data volume: Assess the amount of data being generated and collected, and estimate future data growth to plan for sufficient storage capacity.
    • Data retention: Define the duration for which data should be retained for historical analysis or compliance purposes. Determine the appropriate data retention policies.
    • Data access patterns: Understand how data will be accessed and queried to choose the most suitable storage solution. Consider read and write operations, data access frequency, and data locality.
    • Data security: Prioritize data security by evaluating encryption options, access controls, and data protection mechanisms to safeguard sensitive information.
    • Cost optimization: Optimize storage costs by selecting the most cost-effective storage solutions based on data access patterns and usage.
    • Integration with analytics services: Ensure seamless integration between the chosen storage solution and the data processing and analytics tools in the pipeline.
  • When deciding on analytics services for the collection and ingestion of data, consider various types of data that are relevant to your organization's needs and objectives. Common types of data that you might need to consider includes:

    • Transactional data: Includes information about individual interactions or transactions, such as customer purchases, financial transactions, online orders, and user activity logs.
    • File-based data: Refers to structured or unstructured data that is stored in files, such as log files, spreadsheets, documents, images, audio files, and video files. Analytics services should support the ingestion of different file formats/
    • Event data: Captures significant occurrences or incidents, such as user actions, system events, machine events, or business events. Events can include any data that is arriving in high velocity that is captured for on-stream or down-stream processing.
  • Operational responsibility is shared between you, and AWS, with the division of responsibility varying across different levels of modernization. You have the option of self-managing your analytics infrastructure on AWS or leveraging the numerous serverless analytics services to lesson your infrastructure management burden.

    Self-managed options grant users greater control over the infrastructure and configurations, but they require more operational effort.

    Serverless options abstract away much of the operational burden, providing automatic scalability, high availability, and robust security features, allowing users to focus more on building analytical solutions and driving insights rather than managing infrastructure and operational tasks. Consider these benefits of serverless analytics solutions:

    • Infrastructure abstraction: Serverless services abstract infrastructure management, relieving users from provisioning, scaling, and maintenance tasks. AWS handles these operational aspects, reducing management overhead.
    • Auto-Scaling and performance: Serverless services automatically scale resources based on workload demands, ensuring optimal performance without manual intervention.
    • High availability and disaster recovery: AWS provides high availability for serverless services. AWS manages data redundancy, replication, and disaster recovery to enhance data availability and reliability.
    • Security and compliance: AWS manages security measures, data encryption, and compliance for serverless services, adhering to industry standards and best practices.
    • Monitoring and logging: AWS offers built-in monitoring, logging, and alerting capabilities for serverless services. Users can access detailed metrics and logs through AWS CloudWatch.
  • When building a modern analytics pipeline, deciding on the types of workload to support is crucial to meet different analytical needs effectively. Key decision points to consider for each type of workload includes:

    Batch workload

    • Data volume and frequency: Batch processing is suitable for large volumes of data with periodic updates.
    • Data latency: Batch processing might introduce some delay in delivering insights compared to real-time processing.

    Interactive analysis

    • Data query complexity: Interactive analysis requires low-latency responses for quick feedback.
    • Data visualization: Evaluate the need for interactive data visualization tools to enable business users to explore data visually.

    Streaming workloads

    • Data velocity and volume: Streaming workloads require real-time processing to handle high-velocity data.
    • Data windowing: Define data windowing and time-based aggregations for streaming data to extract relevant insights.
  • Clearly define the business objectives and the insights you aim to derive from the analytics. Different types of analytics serve different purposes. For example:

    • Descriptive analytics is ideal for gaining a historical overview
    • Diagnostic analytics helps understand the reasons behind past events
    • Predictive analytics forecasts future outcomes
    • Prescriptive analytics provides recommendations for optimal actions

    Match your business goals with the relevant types of analytics. Here are some key decision criteria to help you choose the right types of analytics:

    • Data availability and quality: Descriptive and diagnostic analytics rely on historical data, while predictive and prescriptive analytics require sufficient historical data and high-quality data to build accurate models.
    • Data volume and complexity: Predictive and prescriptive analytics require substantial data processing and computational resources. Ensure that your infrastructure and tools can handle the data volume and complexity.
    • Decision complexity: If decisions involve multiple variables, constraints, and objectives, prescriptive analytics may be more suitable to guide optimal actions.
    • Risk tolerance: Prescriptive analytics may provide recommendations, but come with associated uncertainties. Ensure that decision-makers understand the risks associated with the analytics outputs.
  • Assess the scalability and performance needs of the architecture. The design must handle increasing data volumes, user demands, and analytical workloads. Key decision factors to consider includes:

    • Data volume and growth: Assess the current data volume and anticipate future growth. 
    • Data velocity and real-time requirements: Determine if the data needs to be processed and analyzed in real-time or near real-time.
    • Data processing complexity: Analyze the complexity of your data processing and analysis tasks. For computationally intensive tasks, services such as Amazon EMR provide a scalable and managed environment for big data processing.
    • Concurrency and user load: Consider the number of concurrent users and the level of user load on the system. 
    • Auto-scaling capabilities: Consider services that offer auto-scaling capabilities, allowing resources to automatically scale up or down based on demand. This ensures efficient resource utilization and cost optimization.
    • Geographic distribution: Consider services with global replication and low-latency data access if your data architecture needs to be distributed across multiple regions or locations.
    • Cost-performance trade-off: Balance the performance needs with cost considerations. Services with high performance may come at a higher cost.
    • Service-level agreements (SLAs): Check the SLAs provided by AWS services to ensure they meet your scalability and performance expectations.
  • Data governance is the set of processes, policies, and controls you need to implement to ensure effective management, quality, security, and compliance of your data assets. Key decision points to consider includes:

    • Data retention policies: Define data retention policies based on regulatory requirements and business needs and establish processes for secure data disposal when it is no longer needed.
    • Audit trail and logging: Decide on the logging and auditing mechanisms to monitor data access and usage. Implement comprehensive audit trails to track data changes, access attempts, and user activities for compliance and security monitoring.
    • Compliance requirements: Understand the industry-specific and geographic data compliance regulations that apply to your organization. Ensure that the data architecture aligns with these regulations and guidelines.
    • Data classification: Classify data based on its sensitivity and define appropriate security controls for each data class. 
    • Disaster recovery and business continuity: Plan for disaster recovery and business continuity to ensure data availability and resilience in case of unexpected events or system failures.
    • Third-party data sharing: If sharing data with third-party entities, implement secure data sharing protocols and agreements to protect data confidentiality and prevent data misuse.
  • The security of data in the analytics pipeline involves protecting data at every stage of the pipeline to ensure its confidentiality, integrity, and availability. Key decision points to consider includes:

    • Access control and authorization: Implement robust authentication and authorization protocols to ensure that only authorized users can access specific data resources.
    • Data encryption: Choose appropriate encryption methods for data stored in databases, data lakes, and during data movement between different components of the architecture.
    • Data masking and anonymization: Consider the need for data masking or anonymization to protect sensitive data, such as PII or sensitive business data, while allowing certain analytical processes to continue.
    • Secure data integration: Establish secure data integration practices to ensure that data flows securely between different components of the architecture, avoiding data leaks or unauthorized access during data movement.
    • Network Isolation: Consider services that support AWS VPC Endpoints to avoid exposing resources to the public internet.
  • Define the integration points and data flows between various components of the analytics pipeline to ensure seamless data flow and interoperability. Key decision points to consider includes:

    • Data source integration: Identify the data sources from which data will be collected, such as databases, applications, files, or external APIs. Decide on the data ingestion methods (batch, real-time, event-based) to bring data into the pipeline efficiently and with minimal latency.
    • Data transformation: Determine the transformations required to prepare data for analysis. Decide on the tools and processes to clean, aggregate, normalize, or enrich the data as it moves through the pipeline.
    • Data movement architecture: Choose the appropriate architecture for data movement between pipeline components. Consider batch processing, stream processing, or a combination of both based on the real-time requirements and data volume.
    • Data replication and sync: Decide on data replication and synchronization mechanisms to keep data up-to-date across all components. Consider real-time replication solutions or periodic data syncs depending on data freshness requirements.
    • Data quality and validation: Implement data quality checks and validation steps to ensure the integrity of data as it moves through the pipeline. Decide on the actions to be taken when data fails validation, such as alerting or error handling.
    • Data security and encryption: Determine how data will be secured during transit and at rest. Decide on the encryption methods to protect sensitive data throughout the pipeline, considering the level of security required based on data sensitivity.
    • Scalability and resilience: Ensure that the data flow design allows for horizontal scalability and can handle increased data volumes and traffic. 
  • Building your analytics pipeline on AWS provides various cost optimization opportunities. To ensure cost efficiency, consider the following strategies:

    • Resource sizing and selection: Right-size your resources based on actual workload requirements. Choose AWS services and instance types that match the workloads performance needs while avoiding overprovisioning.
    • Auto-scaling: Implement auto-scaling for services that experience varying workloads. Auto-scaling dynamically adjusts the number of instances based on demand, reducing costs during low-traffic periods.
    • Spot instances: Utilize AWS EC2 Spot Instances for non-critical and fault-tolerant workloads. Spot Instances can significantly reduce costs compared to on-demand instances.
    • Reserved instances: Consider purchasing AWS Reserved Instances to achieve significant cost savings over on-demand pricing for stable workloads with predictable usage.
    • Data storage tiering: Optimize data storage costs by using different storage classes based on data access frequency.
    • Data lifecycle policies: Set up data lifecycle policies to automatically move or delete data based on its age and usage patterns. This helps manage storage costs and keeps data storage aligned with its value.

Choose

Now that you know the criteria to evaluate your analytics needs, you are ready to choose which AWS analytics services are right for your organizational needs. The following table categorizes sets of services aligning to what you will need to accomplish with for your business goals, such as conducting advanced analytics, performing data management or predictive analytics, and ML.

Categories
What is it optimized for?
Services
Close

AWS provides a broad and cost-effective set of analytics services to help you gain insights faster from your data.

Interactive analytics
Optimized for performing real-time data analysis and exploration, which allows users to interactively query and visualize data to gain insights and make data-driven decisions quickly.

Close

Amazon Athena

Amazon Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives. Analyze data or build applications from an Amazon S3 data lake and 30 data sources, including on-premises data sources or other cloud systems using SQL or Python. Athena is built on open-source Trino and Presto engines and Apache Spark frameworks, with no provisioning or configuration effort required.

Big data processing
Big data is characterized by its three dimensions, volume, velocity, and variety. Big data processing solutions aim to overcome the challenges posed by the sheer scale and complexity of big data.

Close

Amazon EMR

Amazon EMR is the industry-leading cloud big data solution for petabyte-scale data processing, interactive analytics, and machine learning using open-source frameworks such as Apache Spark, Apache Hive, and Presto.

Data warehousing
The centralized storage, organization, and retrieval of large volumes of structured and sometimes semi-structured data from various sources within an organization.

Close

Amazon Redshift

Amazon Redshift uses SQL to analyze structured and semi-structured data across data warehouses, operational databases, and data lakes, using AWS-designed hardware and machine learning to deliver the best price performance at any scale.

Real-time analytics
The process of analyzing and processing data as it is generated, received, or ingested, without any significant delay.

Close

Amazon Managed Service for Apache Flink

With Amazon Managed Service for Apache Flink, you can more easily transform and analyze streaming data in real time using Apache Flink.

Operational analytics
The use of real-time data analysis and insights to optimize and improve ongoing operational processes and activities within an organization.

Close

OpenSearch Service

OpenSearch is a distributed, community-driven, Apache 2.0-licensed, 100% open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. OpenSearch provides a highly scalable system for providing fast access and response to large volumes of data with an integrated visualization tool, OpenSearch Dashboards, that makes it easy for users to explore their data

Dashboards and visualizations
Dashboards and visualizations provide a visual representation of complex data sets, making it easier for users to grasp patterns, trends, and insights at a glance. They simplify the understanding of data, even for non-technical users, by presenting information in a visually appealing and intuitive manner.

Close

Amazon QuickSight

Amazon QuickSight powers data-driven organizations with unified business intelligence (BI) at hyperscale. With QuickSight, all users can meet varying analytic needs from the same source of truth through modern interactive dashboards, paginated reports, embedded analytics, and natural language queries.

Visual Data Preparation
Using visual tools and interfaces to explore, clean, transform, and manipulate data in a visual and intuitive manner.

Close

AWS Glue DataBrew

AWS Glue DataBrew is a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. You can choose from over 250 pre-built transformations to automate data preparation tasks, all without the need to write any code. 

Close

These services make it easy for you to combine, move, and replicate data across multiple data stores and your data lake.

Use Cases
Related analytics services

Real-time data movement
Real-time data movement involves minimal delay in transferring data, typically within seconds or milliseconds after it becomes available.

Close

Amazon MSK

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is a fully managed service that enables you to build and run applications that use Apache Kafka to process streaming data. Amazon MSK provides the control-plane operations, such as those for creating, updating, and deleting clusters.

Close

Amazon Kinesis Data Streams

Amazon Kinesis Data Streams is a serverless streaming data service that makes it easy to capture, process, and store data streams at any scale.

Close

Amazon Kinesis Data Firehose

Amazon Kinesis Data Firehose is an extract, transform, and load (ETL) service that reliably captures, transforms, and delivers streaming data to data lakes, data stores, and analytics services.

Close

Amazon Kinesis Video Streams

Amazon Kinesis Video Streams makes it easy to securely stream video from connected devices to AWS for analytics, ML, playback, and other processing. Kinesis Video Streams automatically provisions and elastically scales all the infrastructure needed to ingest streaming video data from millions of devices. It durably stores, encrypts, and indexes video data in your streams, and allows you to access your data through easy-to-use APIs.

Close

AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, ML, and application development.

Data governance
A set of processes, policies, and guidelines that ensure the proper management, availability, usability, integrity, and security of data throughout its lifecycle.

Close

Amazon DataZone

Use Amazon DataZone to share, search, and discover data at scale across organizational boundaries. Collaborate on data projects through a unified data analytics portal that gives you a personalized view of all your data while enforcing your governance and compliance policies.

Close

AWS Lake Formation

AWS Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes. These steps include collecting, cleansing, moving, and cataloging data, and securely making that data available for analytics and machine learning.

Object storage for data lakes
A data lake built on AWS uses Amazon S3 as its primary storage platform. Amazon S3 provides an optimal foundation for a data lake because of its virtually unlimited scalability and high durability. 

Close

Amazon S3

Amazon Simple Storage Service (Amazon S3) is an object storage service that offers scalability, data availability, security, and performance. Amazon S3 provides management features so that you can optimize, organize, and configure access to your data to meet your specific business, organizational, and compliance requirements.

Close

AWS Lake Formation

AWS Lake Formation is a fully managed service that makes it easy to build, secure, and manage data lakes. Lake Formation simplifies and automates many of the complex manual steps that are usually required to create data lakes. These steps include collecting, cleansing, moving, and cataloging data, and securely making that data available for analytics and machine learning.

Backup and archive for data lakes
Data lakes, powered by Amazon S3, provide organizations with the availability, agility, and flexibility required for modern analytics approaches to gain deeper insights. Protecting sensitive or business-critical information stored in these S3 buckets is a high priority for organizations.

Close

Amazon S3 Glacier

The Amazon S3 Glacier storage classes are purpose-built for data archiving, providing you with the highest performance, most retrieval flexibility, and the lowest cost archive storage in the cloud. All S3 Glacier storage classes provide virtually unlimited scalability and are designed for 99.999999999% (11 nines) of data durability. 

Close

AWS Backup

AWS Backup is a cost-effective, fully managed, policy-based service that simplifies data protection at scale.

Data catalog
A metadata management tool, providing detailed information about the available data, its structure, characteristics, and relationships.

Close

AWS Glue

AWS Glue is a serverless data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources for analytics, machine learning (ML), and application development.

Third-party data
Third-party data and Software-as-a-Service (SaaS) data are becoming increasingly important to business operations in the modern data-driven landscape.

Close

AWS Data Exchange

AWS Data Exchange is a service that makes it easy for AWS customers to find, subscribe to, and use third-party data in the AWS Cloud.

Close

Amazon AppFlow

With Amazon AppFlow automate bi-directional data flows between SaaS applications and AWS services in just a few clicks. Run the data flows at the frequency you choose, whether on a schedule, in response to a business event, or on demand. 

Close

For predictive analytics use cases, AWS provides a broad set of machine learning services and tools that run on your data lake on AWS.

Use Cases
Related analytics services

Frameworks and interfaces
AWS ML infrastructure supports all of the leading ML frameworks. 

Close

AWS Deep Learning AMI

AWS Deep Learning AMIs (DLAMI) provides ML practitioners and researchers with a curated and secure set of frameworks, dependencies, and tools to accelerate deep learning in the cloud. Built for Amazon Linux and Ubuntu, Amazon Machine Images (AMIs) come preconfigured with TensorFlow, PyTorch, Apache MXNet, Chainer, Microsoft Cognitive Toolkit (CNTK), Gluon, Horovod, and Keras, allowing you to quickly deploy and run these frameworks and tools at scale.

Platform services
Fully-managed infrastructure for building, training, and deploying machine learning models. 

Close

Amazon SageMaker

Build, train, and deploy machine learning (ML) models for any use case with fully managed infrastructure, tools, and workflows.

Direct Data Integrations
Create, train, and deploy ML models using familiar SQL commands.

Close

Amazon Athena ML

Athena ML allows you to build and deploy ML models in Amazon SageMaker and use SQL functions in Amazon Athena to generate predictions from your SageMaker models. 

This enables analytics teams to make model-driven insights available to business users and analysts without the need for specialized tools and infrastructure.

Close

Amazon QuickSight ML

QuickSight ML Insights leverages AWS’ proven ML and natural language capabilities to help you gain deeper insights from your data. These powerful, out-of-the-box features make it easy for anyone to discover hidden trends and outliers, identify key business drivers, and perform powerful what-if analysis and forecasting with no technical expertise or ML experience needed.

Close

Amazon Redshift ML

Amazon Redshift ML makes it easy for data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. With Redshift ML, you can take advantage of Amazon SageMaker, a fully managed machine learning service, without learning new tools or languages. Simply use SQL statements to create and train Amazon SageMaker ML models using your Redshift data and then use these models to make predictions.

Use

You should now have a clear understanding of your business objectives, and the volume and velocity of data you will be ingesting and analyzing to begin building your data pipelines.

To explore how to use and learn more about each of the available services - we have provided a pathway to explore how each of the services work. The following sections provides links to in-depth documentation, hands-on tutorials, and resources to get you started from basic usage to more advanced deep dives.

  • Amazon AppFlow
  • Amazon AppFlow

    Getting started with Amazon AppFlow

    Learn how to use Amazon Athena to query data and create a table based on sample data stored in Amazon S3, query the table, and check the results of the query.

    Read the guide »

    Tutorial: Transfer data between applications with Amazon AppFlow

    In this tutorial, you learn to transfer data between applications. Specifically, you transfer data both from Amazon S3 to Salesforce, and from Salesforce to Amazon S3.

    Get started with the tutorial »

    Hands On Workshop: Amazon AppFlow Workshop
    You will learn about Amazon AppFlow and how to easily transfer data between popular SaaS services and AWS. The workshop is divided into multiple modules, each targeting a specific SaaS application integration. 

    Get started with the workshop »

  • Amazon Athena
  • Amazon Athena

    Getting started with Amazon Athena

    Learn how to use Amazon Athena to query data and create a table based on sample data stored in Amazon S3, query the table, and check the results of the query.

    Get started with the tutorial »

    Amazon Athena

    Get started with Apache Spark on Amazon Athena

    Use the simplified notebook experience in Amazon Athena console to develop Apache Spark applications using Python or Athena notebook APIs.

    Get started with the tutorial »

    Amazon Athena

    AWS re:Invent 2022 - What's new in Amazon Athena

    Learn how you can bring Athena to your data, applying it to all of your data spanning data lakes, external sources, and more.


    Watch the session »

    Amazon Athena

    Analyzing data in S3 using Amazon Athena
     
    Explore how to use Athena on logs from Elastic Load Balancers, generated as text files in a pre-defined format. We show you how to create a table, partition the data in a format used by Athena, convert it to Parquet, and compare query performance.

    Read the blog post »

  • AWS Data Exchange
  • AWS Data Exchange

    Getting started as an AWS Data Exchange subscriber

    Understand the complete process of becoming a data product subscriber on AWS Data Exchange using the AWS Data Exchange console.

    Explore the guide »

    AWS Data Exchange

    Getting started as an AWS Data Exchange provider

    Understand the complete process of becoming a data product provider on AWS Data Exchange using the AWS Data Exchange console.

    Explore the guide »

    AWS Data Exchange

    AWS Data Exchange workshop

    Explore self-service labs that you can use to understand and learn how AWS services can be used in conjunction with third-party data to add insights to your data analytics projects.  

    Get started with the workshop »

  • Amazon DataZone
  • Amazon DataZone

    Getting started with Amazon DataZone
     

    Learn how to create the Amazon DataZone root domain, obtain the data portal URL, walk through the basic Amazon DataZone workflows for data producers and data consumers.

    Explore the guide »

  • Amazon EMR
  • Amazon EMR

    Getting started with AWS EMR

    Learn how to launch a sample cluster using Spark, and how to run a simple PySpark script stored in an Amazon S3 bucket.
     


    Get started with the tutorial »


    Amazon EMR

    Getting started with Amazon EMR on EKS
     

    We show you how to get started using Amazon EMR on EKS by deploying a Spark application on a virtual cluster.

    Get started with the tutorial »

    Amazon EMR

    Get started with EMR Serverless
     

    Explore how EMR Serverless provides a serverless runtime environment that simplifies the operation of analytics applications that use the latest open source frameworks.

    Get started with the tutorial »

    Amazon EMR

    What's new with Amazon EMR  

    Learn about the latest Amazon EMR developments, including Amazon EMR Serverless, Amazon EMR Studio, and more.

    Watch the session »

  • AWS Glue
  • AWS Glue DataBrew

    Getting started with AWS Glue DataBrew

    Learn how to create your first DataBrew project. You load a sample dataset, run transformations on that dataset, build a recipe to capture those transformations, and run a job to write the transformed data to Amazon S3.

    Get started with the tutorial »

    AWS Glue DataBrew

    Transform data with AWS Glue DataBrew

    Learn about AWS Glue DataBrew, a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. Learn how to construct an ETL process using AWS Glue DataBrew.

    Get started with the lab »

    AWS Glue DataBrew

    AWS Glue DataBrew immersion day

    Explore how to use AWS Glue DataBrew to clean and normalize data for analytics and machine learning. 





    Get started with the workshop »

    AWS Glue DataBrew

    Getting started with the AWS Glue Data Catalog

    Learn how to create your first AWS Glue Data Catalog, which uses an Amazon S3
    bucket as your data source.

    Get started with the tutorial »

    AWS Glue DataBrew

    Data catalog and crawlers in AWS Glue

    Discover how you can use the information in the Data Catalog to create and monitor your ETL jobs.

    Explore the guide »

  • Amazon Managed Service for Apache Flink
  • Amazon Managed Service for Apache Flink

    Getting started with Amazon Managed Service for Apache Flink
     

    Understand the fundamental concepts of Amazon Managed Service for Apache Flink.

    Explore the guide »

    Amazon Managed Service for Apache Flink

    Streaming analytics workshop 

    Learn how to build an end-to-end streaming architecture to ingest, analyze, and visualize streaming data in near real-time.

    Get started with the workshop »

    Amazon Managed Service for Apache Flink

    Amazon Managed Service for Apache Flink Workshop
     

    In this workshop, you will learn how to deploy, operate, and scale a Flink application with Amazon Managed Service for Apache Flink.

    Attend the virtual workshop »

  • AWS Lake Formation
  • AWS Lake Formation

    Getting started with AWS Lake Formation

    Learn how to setup AWS Lake Formation.

    Get started with the guide »

    AWS Lake Formation

    Tutorial: Creating a data lake from a JDBC source in Lake Formation

    Learn about AWS Glue DataBrew, a visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning. Learn how to construct an ETL process using AWS Glue DataBrew.

    Get started with the tutorial »

    AWS Lake Formation

    AWS Lake Formation workshop
    In this workshop there are a series of labs that you to follow through as four different personas who can benefit from Lake Formation. The personas are Data Admin, Data Engineer, Data Analyst and Data Scientist. 

    Get started with the workshop »

  • Amazon Managed Streaming for Apache Kafka (MSK)
  • Getting Started with Amazon Managed Streaming for Apache Kafka (MSK)
    Learn how to create an MSK cluster, produce and consume data, and monitor the health of your cluster using metrics.

    Get started with the guide »

    Amazon MSK Workshop
    Go deep with this hands-on MSK workshop. 

    Get started with the workshop »

  • OpenSearch
  • Amazon OpenSearch Service

    Getting started with OpenSearch Service

    Learn how to use Amazon OpenSearch Service to create and configure a test domain.

    Get started with the tutorial »

    Amazon OpenSearch Service

    Visualizing customer support calls with OpenSearch Service and OpenSearch Dashboards

    Discover a full walkthrough of the following situation: a business receives some number of customer support calls and wants to analyze them. What is the subject of each call? How many were positive? How many were negative? How can managers search or review the the transcripts of these calls?

    Get started with the tutorial »

    Amazon OpenSearch Service

    Getting started with OpenSearch Serverless workshop

    Learn how to set up a new Amazon OpenSearch Serverless domain in the AWS console. Explore the different types of search queries available, and design eye-catching visualizations, and learn how you can secure your domain and documents based on assigned user privileges.


    Get started with the workshop »

    Amazon OpenSearch Service

    Building a log analytics solution with OpenSearch Service

    Learn how to size an OpenSearch cluster for a log analytics workload.  

    Read the blog post »

  • Amazon QuickSight
  • Amazon QuickSight

    Getting started with Amazon QuickSight data analysis

    Learn how to create your first analysis. Use sample data to create either a simple or a more advanced analysis. Or you can connect to your own data to create an analysis.


    Explore the guide »

    Amazon QuickSight

    Visualizing with QuickSight


    Discover the technical side of business intelligence (BI) and data visualization with AWS. Learn how to embed dashboards into applications and websites, and securely manage access and permissions.

    Get started with the course »

    Amazon QuickSight

    QuickSight workshops


    Get a head start on your QuickSight journey with workshops.

     



    Get started with the workshops »

  • Amazon Redshift
  • Amazon Redshift

    Getting started with Amazon Redshift

    Understand the basic flow of Amazon Redshift Serverless to create serverless resources, connect to Amazon Redshift Serverless, load sample data, and then run queries on the data.

    Explore the guide »

    Amazon Redshift

    Modernize your data warehouse


    Explore how you can use the new capabilities of Amazon Redshift to modernize your data warehouse by gaining access to all your data.



    Watch the video »

    Amazon Redshift

    Deploy a data warehouse on AWS


    Learn how to create and configure an Amazon Redshift data warehouse, load sample data, and analyze it using a SQL client.


    Get started with the tutorial »

    Amazon Redshift

    Amazon Redshift deep dive workshop

    Explore a series of exercises which help users get started using the Redshift platform.

    Get started with the workshop »

  • Amazon S3
  • Amazon S3

    Getting started with Amazon S3
    Learn how to create your first DataBrew project. You load a sample dataset, run transformations on that dataset, build a recipe to capture those transformations, and run a job to write the transformed data to Amazon S3.

    Get started with the guide »

    Amazon S3

    Central storage - Amazon S3 as the data lake storage platform

    Discover how Amazon S3 is an optimal foundation for a data lake because of its virtually unlimited scalability and high durability.


    Read the whitepaper »

  • Amazon SageMaker
  • Amazon SageMaker

    How Amazon SageMaker works


    Explore the overview of machine learning and Amazon how SageMaker works. 



    Explore the guide »

    Amazon SageMaker

    Getting started with Amazon SageMaker

    We show you how to get started using Amazon EMR on EKS by deploying a Spark application on a virtual cluster.

    Explore the guide »

    Amazon SageMaker

    Generate machine learning predictions without writing code
     

    Learn how to use Amazon SageMaker Canvas to build ML models and generate accurate predictions without writing a single line of code.

    Get started with the tutorial »

Explore

Architecture diagrams

Explore architecture diagrams to help you develop, scale, and test your analytics solutions on AWS.

Explore architecture diagrams »

 

Whitepapers

Explore whitepapers to help you get started, learn best practices, and understand your analytics options.

Explore whitepapers »

 

AWS Solutions

Explore vetted solutions and architectural guidance for common use cases for analytics services.

Explore solutions »

 

Was this page helpful?