AWS Cloud
Get Started with Amazon Redshift

A data warehouse is a central repository of information that can be analyzed to make better informed decisions. Data flows into a data warehouse from transactional systems, relational databases, and other sources, typically on a regular cadence. Business analysts, data scientists, and decision makers access the data through business intelligence (BI) tools, SQL clients, and other analytics applications.

Data and analytics have become indispensable to businesses to stay competitive. Businesses use reports, dashboards, and analytics tools to extract insights from their data, monitor business performance, and support decision making. These reports, dashboards and analytics tools are powered by data warehouses, which store data efficiently to minimize I/O and deliver query results at blazing speeds to hundreds and thousands of users concurrently.

Download the whitepaper: Enterprise Data Warehousing on AWS

A data warehouse architecture consists of three tiers. The bottom tier of the architecture is the database server, where data is loaded and stored. The middle tier consists of the analytics engine that is used to access and analyze the data. The top tier is the front-end client that presents results through reporting, analysis, and data mining tools.

A data warehouse works by organizing data into a schema that describes the layout and type of data, such as integer, data field, or string. When data is ingested, it is stored in various tables described by the schema. Query tools use the schema to determine which data tables to access and analyze.

The benefits of a data warehouse are:

  • Better decision making
  • Consolidates data from many sources
  • Data quality, consistency, and accuracy
  • Historical intelligence
  • Separates analytics processing from transactional databases, improving performance of both systems

 

A data warehouse is specially designed for data analytics, which involves reading large amounts of data to understand relationships and trends across the data. A database is used to capture and store data, such as recording details of a transaction.

Characteristics Data Warehouse Transactional Database
Suitable workloads Analytics, reporting, big data  Transaction processing
Data source Data collected and normalized from many sources Data captured as-is from a single source, such as a transactional system
Data capture Bulk write operations typically on a predetermined batch schedule

Optimized for continuous write operations as new data is available to maximize transaction throughput

Data normalization Denormalized schemas, such as the Star schema or Snowflake schema Highly normalized, static schemas
Data storage Optimized for simplicity of access and high-speed query performance using columnar storage Optimized for high throughout write operations to a single row-oriented physical block
Data access Optimized to minimize I/O and maximize data throughput High volumes of small read operations

Unlike a data warehouse, a data lake is a centralized repository for all data, including structured and unstructured. A data warehouse utilizes a pre-defined schema optimized for analytics. In a data lake, the schema is not defined, enabling additional types of analytics like big data analytics, full text search, real-time analytics, and machine learning.

Characteristics Data Warehouse Data Lake
Data Relational data from transactional systems, operational databases, and line of business applications Non-relational and relational data from IoT devices, web sites, mobile apps, social media, and corporate applications
Schema Designed prior to the data warehouse implementation (schema-on-write) Written at the time of analysis (schema-on-read)
Price/Performance Fastest query results using higher cost storage Query results getting faster using low-cost storage
Data Quality Highly curated data that serves as the central version of the truth Any data that may or may not be curated (i.e. raw data)
Users Business analysts, data scientists, and data developers Data scientists, data developers, and business analysts (using curated data)
Analytics Batch reporting, BI, and visualizations Machine learning, predictive analytics, data discovery, and profiling

A data mart is a data warehouse that serves the needs of a specific team or business unit, like finance, marketing, or sales. It is smaller, more focused, and may contain summaries of data that best serve its community of users.

Characteristics Data Warehouse Data Mart
Scope Centralized, multiple subject areas integrated together Decentralized, specific subject area
Users Organization-wide A single community or department
Data source Many sources A single or a few sources, or a portion of data already collected in a data warehouse
Size Large, can be 100's of gigabytes to petabytes Small, generally up to 10's of gigabytes
Design Top-down Bottom-up
Data detail Complete, detailed data May hold summarized data

AWS allows you to take advantage of all of the core benefits associated with on-demand computing, such as access to seemingly limitless storage and compute capacity, and the ability to scale your system in parallel with the growing amount of data collected, stored, and queried, paying only for the resources you provision. Further, AWS offers a broad set of managed services that integrate seamlessly with each other so that you can quickly deploy an end-to-end analytics and data warehousing solution.

The following illustration shows the key steps of an end-to-end analytics process chain and the managed services available on AWS for each step:

Analytics Pipeline on AWS

Amazon Redshift is a fast, fully managed, and cost-effective data warehouse that gives you petabyte scale data warehousing and exabyte scale data lake analytics together in one service.

Amazon Redshift is up to ten times faster than traditional on-premises data warehouses. Get unique insights by querying across petabytes of data in Redshift and exabytes of structured data or open file formats in Amazon S3, without the need to move or transform your data.

Redshift is 1/10th the cost of traditional on-premises data warehouse solutions. You can start small for just $0.25 per hour with no commitments, scale out to petabytes of data for $250 to $333 per uncompressed terabyte per year, and extend analytics to your Amazon S3 data lake for as little as $0.05 for every 10 gigabytes of data scanned. Learn More