Amazon Redshift Documentation
Integration with your data lake and AWS services
Amazon Redshift lets you work with your data in open formats, and integrates with and connects to other AWS services.
Query and export data to and from your data lake
You can query open file formats such as Parquet, ORC, JSON, Avro, CSV, and more directly in S3 using familiar ANSI SQL. To export data to your data lake you use the Redshift UNLOAD command in your SQL code and specify Parquet as the file format and Redshift takes care of data formatting and data movement into S3. This is designed to give you the flexibility to store highly structured, frequently accessed data in a Redshift data warehouse, while also keeping structured, semi-structured, and unstructured data in S3. Exporting data from Redshift back to your data lake helps you to analyze the data further with AWS services like Amazon Athena, Amazon EMR, and Amazon SageMaker.
With the new federated query capability in Redshift, you can reach into your operational, relational database. Query live data across one or more Amazon RDS and Aurora PostgreSQL to get visibility into the end-to-end business operations without requiring data movement. You can join data from your Redshift data warehouse, data in your data lake, and now data in your operational stores which can help you make better data-driven decisions. Redshift is designed to provide optimizations to reduce data moved over the network and complements it with its parallel data processing for high-performance queries.
Amazon Redshift ML is designed to enable customers to use SQL statements to create and train Amazon SageMaker models on their data in Amazon Redshift and then use those models for predictions such as churn detection and risk scoring directly in their queries and reports.
Native integration with the AWS analytics service set makes it easier to handle end-to-end analytics workflows. For example, AWS Lake Formation is a service that helps set up a secure data lake in days. AWS Glue can extract, transform, and load (ETL) data into Redshift. Amazon Kinesis Data Firehose can help you to capture, transform, and load streaming data into Redshift for analytics. Amazon EMR is designed to process data using Hadoop/Spark and load the output into Amazon Redshift for BI and analytics. Amazon QuickSight is the BI service that you can use to create reports, visualizations, and dashboards on Redshift data. You can use Redshift to prepare your data to run machine learning workloads with Amazon SageMaker. To accelerate migrations to Amazon Redshift, you can use the AWS Schema Conversion tool and the AWS Database Migration Service (DMS). Amazon Redshift is also integrated with Amazon Key Management Service (KMS) and Amazon CloudWatch for security, monitoring, and compliance. You can also use Lambda UDFs to invoke a Lambda function from your SQL queries as if you are invoking a User Defined Function in Redshift.
RA3 instances are designed to improve speed for performance-intensive workloads that require large amounts of compute capacity, with the flexibility to pay separately for compute independently of storage by specifying the number of instances you need.
AQUA (Advanced Query Accelerator) for Amazon Redshift
AQUA is a new distributed and hardware-accelerated cache that enables Redshift to run faster by boosting certain types of queries. AQUA uses solid state storage, field-programmable gate arrays (FPGAs) and AWS Nitro to speed queries that scan, filter, and aggregate large data sets. AQUA is included with the Redshift RA3 instance type.
Storage and query processing
Amazon Redshift is designed to provide fast query performance on datasets of varying sizes. Columnar storage, data compression, and zone maps are designed to reduce the amount of I/O needed to perform queries. Along with the encodings such as LZO and Zstandard, Amazon Redshift also offers compression encoding, AZ64, for numeric and date/time types which can help you achieve both storage savings and optimized query performance.
Amazon Redshift materialized views is designed to help you to achieve faster query performance for iterative or predictable analytical workloads such as dashboarding, and queries from Business Intelligence (BI) tools, and Extract, Load, Transform (ELT) data processing jobs. You can use materialized views to store and manage pre-computed results of a SELECT statement that may reference one or more tables, including external tables. Subsequent queries referencing the materialized views can run faster by reusing the pre-computed results. Amazon Redshift is designed to maintain the materialized views incrementally to continue to provide the low latency performance benefits.
Machine learning to enhance throughput and performance
Advanced machine learning capabilities in Amazon Redshift can help deliver high throughput and performance, even with varying workloads or concurrent user activity. Amazon Redshift utilizes algorithms to predict and classify incoming queries based on their run times and resource requirements to dynamically manage performance and concurrency. Short query acceleration (SQA) sends short queries from applications such as dashboards to an express queue for processing rather than being starved behind large queries. Automatic workload management (WLM) uses machine learning to help dynamically manage memory and concurrency, helping improve query throughput. In addition, you can now set the priority of your most important queries. Amazon Redshift is also designed to be a self-learning system that observes the user workload continuously, determining the opportunities to improve performance as the usage grows, applying optimizations seamlessly, and making recommendations via Redshift Advisor when an explicit user action is needed to further enhance Amazon Redshift performance.
Amazon Redshift uses result caching to deliver fast response times for repeat queries. Dashboard, visualization, and business intelligence tools that execute repeat queries can experience a significant performance boost. When a query executes, Amazon Redshift searches the cache to see if there is a cached result from a prior run. If a cached result is found and the data has not changed, the cached result is returned quickly instead of re-running the query.
Data warehousing at scale
Amazon Redshift is designed to be simple and quickly scale as your needs change. Through the console or with a simple API call, you can change the number or type of nodes in your data warehouse, and scale up or down as your needs change.
Data lake analytics at scale
You can run queries against large amounts of data in Amazon S3 without having to load or transform any data with the Redshift Spectrum feature. You can use S3 as a highly available, secure, and effective data lake to store data in open data formats. Amazon Redshift Spectrum is designed to execute queries across thousands of parallelized nodes to help deliver fast results.
Amazon Redshift is designed to provide consistently fast performance, even with thousands of concurrent queries, whether they query data in your Amazon Redshift data warehouse, or directly in your Amazon S3 data lake. Amazon Redshift Concurrency Scaling supports many concurrent users and concurrent queries with availability by adding transient capacity in as concurrency increases.
Amazon Redshift data sharing can help you scale by sharing live data across Redshift clusters. Data Sharing is designed to improve the agility of organizations by giving fast, granular and high-performance access to data inside any Redshift cluster without the need to copy or move it.
With Amazon Redshift you can deploy a new data warehouse through the AWS console, and Amazon Redshift provisions the infrastructure for you. Most administrative tasks are taken care of, such as backups and replication. When you want control, there are options to help you make adjustments tuned to your specific workloads. New capabilities are released transparently.
Data in Amazon Redshift is backed up to Amazon S3, and Amazon Redshift is designed to replicate your snapshots to S3 in another region for disaster recovery. You can use any system or user snapshot to restore your cluster using the AWS Management Console or the Redshift APIs. Your cluster is available after the system metadata has been restored, and you can start running queries while user data is spooled down in the background.
Amazon Redshift is designed to continuously monitor user workloads and uses sophisticated algorithms to help you find ways to improve the physical layout of data to optimize query speeds. Automatic Table Optimization is designed to select the best sort and distribution keys to optimize performance for the cluster’s workload. If Amazon Redshift determines that applying a key will improve cluster performance, tables will be altered without requiring administrator intervention. Additional features Automatic Vacuum Delete, Automatic Table Sort, and Automatic Analyze are designed to eliminate the need for manual maintenance and tuning of Redshift clusters to get the best performance for new clusters and production workloads.
There are multiple features that can enhance the reliability of your data warehouse cluster. For example, Amazon Redshift is designed to continuously monitor the health of the cluster, and re-replicates data from failed drives and replaces nodes as necessary for fault tolerance. Clusters can also be relocated to alternative Availability Zones (AZ’s).
Amazon Redshift gives you the flexibility to execute queries within the console or connect SQL client tools, libraries, or Business Intelligence tools. The Query Editor on the AWS console provides an interface for executing SQL queries on Amazon Redshift clusters and viewing the query results and query execution plan (for queries executed on compute nodes) adjacent to your queries.
API to interact with Amazon Redshift
Amazon Redshift is designed to enable you to access data with many types of traditional, cloud-native, and containerized, serverless web services-based applications and event-driven applications. The Amazon Redshift Data API can help simplify data access, ingest, and egress from programming languages and platforms supported by the AWS SDK such as Python, Go, Java, Node.js, PHP, Ruby, and C++. The Data API helps eliminate the need for configuring drivers and managing database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours.
Native support for advanced analytics
Redshift supports standard scalar data types such as NUMBER, VARCHAR, and DATETIME and provides native support for the following advanced analytics processing:
Spatial data processing
Amazon Redshift provides a polymorphic data type, GEOMETRY, which supports multiple geometric shapes such as Point, Linestring, Polygon etc. Redshift also provides spatial SQL functions to construct geometric shapes, import, export, access and process the spatial data. You can add GEOMETRY columns to Redshift tables and write SQL queries spanning across spatial and non-spatial data. This capability enables you to store, retrieve, and process spatial data and can enhance your business insights by integrating spatial data into your analytical queries. With Redshift’s ability to query data lakes, you can also extend spatial processing to data lakes by integrating external tables in spatial queries.
HyperLogLog is a novel algorithm that estimates the approximate number of distinct values in a data set. HLL sketch is a construct that encapsulates the information about the distinct values in the data set. Redshift provides datatype HLLSKETCH and associated SQL functions to generate, persist, and combine HyperLogLog sketches. The Amazon Redshift's HyperLogLog capability uses bias correction techniques and is designed to provide high accuracy with low memory footprint.
DATE & TIME data types
Amazon Redshift is designed to provide multiple data types DATE, TIME, TIMETZ, TIMESTAMP and TIMESTAMPTZ to natively store and process data/time data. TIME and TIMESTAMP types store the time data without time zone information, whereas TIMETZ and TIMESTAMPTZ types store the time data including the timezone information. You can use various date/time SQL functions to process the date and time values in Redshift queries.
Semi-structured data processing
The Amazon Redshift SUPER data type is designed to natively stores JSON and other semi-structured data in Redshift tables, and uses the PartiQL query language to process the semi-structured data. The SUPER data type is schemaless in nature and allows storage of nested values that may contain Redshift scalar values, nested arrays and nested structures. PartiQL is an extension of SQL and provides querying capabilities such as object and array navigation, unnesting of arrays, dynamic typing, and schemaless semantics. This can help you to achieve advanced analytics that combine the classic structured SQL data with the semi-structured SUPER data.
Security and compliance
With the implementation of parameter settings, you can set up Amazon Redshift to use SSL to secure data in transit, and hardware-accelerated AES-256 encryption for data at rest. Amazon Redshift takes care of key management by default.
Amazon Redshift enables you to configure firewall rules to control network access to your data warehouse cluster. You can run Redshift inside Amazon Virtual Private Cloud (VPC) to isolate your data warehouse cluster in your own virtual network and connect it to your existing IT infrastructure using encrypted IPsec VPN.
Audit and compliance
Amazon Redshift integrates with AWS CloudTrail to enable you to audit your Redshift API calls. Redshift logs all SQL operations, including connection attempts, queries, and changes to your data warehouse. You can access these logs using SQL queries against system tables, or choose to save the logs to a secure location in Amazon S3.
Amazon Lambda user-defined functions (UDFs) enable you to use an AWS Lambda function as a UDF in Amazon Redshift and invoke it from Redshift SQL queries. This functionality can help you to write custom extensions for your SQL query to achieve tighter integration with other services or third-party products. You can write Lambda UDFs to enable external tokenization, data masking, identification or de-identification of data by integrating with vendors like Protegrity.
Granular access controls
Granular row and column level security controls are designed so that users see only the data they should have access to. Amazon Redshift is integrated with AWS Lake Formation so that Lake Formation’s column level access controls are also enforced for Redshift queries on the data in the data lake.
AQUA (Advanced Query Accelerator) for Amazon Redshift
AQUA (Advanced Query Accelerator) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run faster than many other enterprise cloud data warehouses by boosting certain types of queries. AQUA is included with the with the Redshift RA3 ra3.4xl and ra3.16xl node types.
A new approach to data warehousing
AQUA takes a new approach to cloud data warehousing. AQUA brings compute to storage by doing a share of data processing in place on the innovative cache. It uses AWS-designed processors and a scale-out architecture to help accelerate data processing.
Removes networking bandwidth limitations
AQUA is designed to accelerate Redshift queries by running data intensive tasks such as scans, filtering, and aggregation closer to the storage layer. This can help avoid networking bandwidth limitations by eliminating unnecessary data movement between data storage and compute clusters.
Powered by AWS-designed processors
AQUA uses AWS-designed processors with AWS Nitro chips designed to speed up data encryption and compression, and custom analytics processors, implemented in FPGAs, designed to accelerate operations such as scans, filtering, and aggregation.
AQUA is designed to process large amounts of data in parallel across multiple nodes, and scales out to add more capacity as your storage needs grow over time.
Amazon Redshift Concurrency Scaling
Analytics workloads can be highly unpredictable resulting in slower query performance and users competing for resources.
The Concurrency Scaling feature is designed to support thousands of concurrent users and concurrent queries, with consistently fast query performance. As concurrency increases, Amazon Redshift adds query processing power to process queries. Once the workload demand subsides, this extra processing power is removed.
Concurrency Scaling is designed to help you:
- Get consistently fast performance for thousands of concurrent queries and users.
- Allocate the clusters to specific user groups and workloads, and control the number of clusters that can be used.
- Continue to use your existing applications and Business Intelligence tools.
Amazon Redshift Data Sharing
Amazon Redshift data sharing is designed to extend the benefits of Amazon Redshift to multi-cluster deployments while being able to share data. Data sharing enables granular and fast data access across Amazon Redshift clusters without the need to copy or move it. Data sharing is designed to provide live access to data so that your users can see the most up-to-date and consistent information as it’s updated in the data warehouse. You can share live data with Amazon Redshift clusters in the same or different AWS accounts.
Amazon Redshift data sharing is designed to provide:
- A simple and direct way to share data across Amazon Redshift data warehouses
- Fast, granular, and high performance access without data copies and data movement.
- Live and transactionally consistent views of data across all consumers.
- Secure and governed collaboration within and across organizations and external parties.
Data sharing builds on Amazon Redshift RA3 managed storage, which is designed to decouple storage and compute, allowing either of them to scale independently. With data sharing, workloads accessing shared data are isolated from each other. Queries accessing shared data run on the consumer cluster and read data from the Amazon Redshift managed storage layer directly without impacting the performance of the producer cluster. Workloads accessing shared data can be provisioned with flexible compute resources that meet their workload-specific requirements and be scaled independently as needed in a self-service fashion.
Amazon Redshift RA3 instances with managed storage
With Amazon Redshift RA3 instances with managed storage, you can choose the number of nodes based on your performance requirements. Built on the AWS Nitro System, RA3 instances with managed storage use high performance SSDs for your hot data and Amazon S3 for your cold data.
The new RA3 instances with managed storage are designed to:
- Allow you to pay per hour for the compute and separately scale data warehouse storage capacity without adding any additional compute resources and paying only for what you use.
- Include AQUA, the new distributed and hardware accelerated cache that enables Redshift to run faster than many other cloud data warehouses by boosting certain types of queries.
- Use fine-grained data eviction and intelligent data pre-fetching to deliver the fast performance, while scaling storage to S3.
- Feature high bandwidth networking that can help reduce the time for data to be offloaded to and retrieved from Amazon S3.
Amazon Redshift ML
Amazon Redshift ML can help data analysts and database developers to create, train, and apply machine learning models using familiar SQL commands in Amazon Redshift data warehouses. With Redshift ML, you can take advantage of Amazon SageMaker, a managed machine learning service, without learning new tools or languages. Simply use SQL statements to create and train Amazon SageMaker machine learning models using your Redshift data and then use these models to make predictions.
Because Redshift ML allows you to use standard SQL, this can help you to be productive with new use cases for your analytics data. Redshift ML provides integration between Redshift and Amazon SageMaker and enables inference within the Redshift cluster, so you can use predictions generated by ML-based models in queries and applications. There is no need to manage a separate inference model end point, and the training data is secured end-to-end with encryption.
Use ML on your Redshift data using standard SQL
To get started, use the CREATE MODEL SQL command in Redshift and specify training data either as a table or SELECT statement. Redshift ML is designed to then compiles and imports the trained model inside the Redshift data warehouse and prepare a SQL inference function that can be immediately used in SQL queries. Redshift ML handles all the steps needed to train and deploy a model.
Predictive analytics with Amazon Redshift
With Redshift ML, you can embed predictions like fraud detection, risk scoring, and churn prediction directly in queries and reports. Use the SQL function to apply the ML model to your data in queries, reports, and dashboards.
Bring your own model (BYOM)
Redshift ML supports using BYOM for local or remote inference. You can use a model trained outside of Redshift with Amazon SageMaker for in-database inference local in Amazon Redshift. You can import SageMaker Autopilot and direct Amazon SageMaker trained models for local inference. Alternatively, you can invoke remote custom ML models deployed in remote SageMaker endpoints. You can use any SageMaker ML model that accepts and returns text or CSV for remote inference.
For additional information about service controls, security features and functionalities, including, as applicable, information about storing, retrieving, modifying, restricting, and deleting data, please see https://docs.aws.amazon.com/index.html. This additional information does not form part of the Documentation for purposes of the AWS Customer Agreement available at http://aws.amazon.com/agreement, or other agreement between you and AWS governing your use of AWS’s services.