Q: What is Amazon Redshift?
Amazon Redshift is a fully managed, scalable cloud data warehouse that accelerates your time to insights with fast, easy, and secure analytics at scale. Thousands of customers rely on Amazon Redshift to analyze data from terabytes to petabytes and run complex analytical queries. You can get real-time insights and predictive analytics on all your data across your operational databases, data lake, data warehouse, and third-party datasets. Amazon Redshift delivers all this at a price performance that’s up to three times better than other cloud data warehouses out of the box, helping you keep your costs predictable.
Amazon Redshift Serverless makes it easy for you to run petabyte-scale analytics in seconds to get rapid insights without having to configure and manage your data warehouse clusters. Amazon Redshift Serverless automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you pay only for the resources you use.
Q: What are the top reasons customers choose Amazon Redshift?
Thousands of customers choose Amazon Redshift to accelerate their time to insights because it’s easy to use, it delivers performance at any scale, and it lets you analyze all your data. Amazon Redshift is a fully managed service and offers both provisioned and serverless options, making it easy for you to run and scale analytics without having to manage your data warehouse. You can choose the provisioned option for predictable workloads or go with the Amazon Redshift Serverless option to automatically provision and scale the data warehouse capacity to deliver high performance for demanding and unpredictable workloads. It delivers performance at any scale with up to 3 times better price performance than other cloud data warehouses out of the box, helping you keep your costs predictable. Amazon Redshift lets you get insights from running real-time and predictive analytics on all your data across your operational databases, data lake, data warehouse, and thousands of third-party datasets. Amazon Redshift keeps your data secure at rest and in transit and meets internal and external compliance requirements. It supports industry-leading security to protect your data in transit and at rest and is compliant with SOC1, SOC2, SOC3, and PCI DSS Level 1 requirements. All Redshift security and compliance features are included at no additional cost.
Q: How does Amazon Redshift simplify data warehouse management?
Amazon Redshift is fully managed by AWS so you no longer need to worry about data warehouse management tasks such as hardware provisioning, software patching, setup, configuration, monitoring nodes and drives to recover from failures, or backups. AWS manages the work needed to set up, operate, and scale a data warehouse on your behalf, freeing you to focus on building your applications. Amazon Redshift also has automatic tuning capabilities, and surfaces recommendations for managing your warehouse in Redshift Advisor. For Redshift Spectrum, Amazon Redshift manages all the computing infrastructure, load balancing, planning, scheduling, and execution of your queries on data stored in Amazon S3. The serverless option automatically provisions and scales the data warehouse capacity to deliver high performance for demanding and unpredictable workloads, and you pay only for the resources you use.
Q: How does the performance of Amazon Redshift compare to that of other data warehouses?
TPC-DS benchmark results show that Amazon Redshift provides the best price performance out of the box, even for a comparatively small 3 TB dataset. Amazon Redshift delivers up to 3 times better price performance than other cloud data warehouses. This means that you can benefit from Amazon Redshift’s leading price performance from the start without manual tuning. Get up to 3x better price performance with Amazon Redshift than with other cloud data warehouses | AWS Big Data Blog.
Amazon Redshift uses a variety of innovations to achieve up to 10 times better performance than traditional databases for data warehousing and analytics workloads, including efficient read-optimized columnar compressed data storage with massively parallel processing (MPP) compute clusters that scale linearly to hundreds of nodes. Instead of storing data as a series of rows, Amazon Redshift organizes the data by column. When loading data into an empty table, Amazon Redshift automatically samples your data and selects the most appropriate compression scheme.
Redshift Spectrum lets you run queries against exabytes of data in Amazon S3. There is no loading or ETL required. Even if you don’t store any of your data in Amazon Redshift, you can still use Redshift Spectrum to query datasets as large as an exabyte in Amazon S3. Materialized views provide significantly faster query performance for repeated and predictable analytical workloads such as dashboards, queries from business intelligence (BI) tools, and ELT (Extract, Load, Transform) data processing. Using materialized views, you can store the precomputed results of queries and efficiently maintain them by incrementally processing the latest changes made to the source tables. Subsequent queries referencing the materialized views use the precomputed results to run much faster, and automatic refresh and query rewrite capabilities simplify and automate the use of materialized views.
The compute and storage capacity of on-premises data warehouses are limited by the constraints of the on-premises hardware. Amazon Redshift gives you the ability to scale compute and storage independently as needed to meet changing workloads. With Redshift Managed Storage (RMS), you now have the ability to scale your storage to petabytes using Amazon S3 storage.
Automatic Table Optimization (ATO) is a self-tuning capability that helps you achieve the performance benefits of creating optimal sort and distribution keys without manual effort. ATO observes how queries interact with tables and uses machine learning (ML) to select the best sort and distribution keys to optimize performance for the cluster’s workload. ATO optimizations have shown to increase cluster performance by 24% and 34% using the 3 TB and 30 TB TPC-DS benchmarks, respectively, versus a cluster without ATO. Additional features such as Automatic Vacuum Delete, Automatic Table Sort, and Automatic Analyze eliminate the need for manual maintenance and tuning of Redshift clusters to get the best performance for new clusters and production workloads.
Workload management allows you to route queries to a set of defined queues to manage the concurrency and resource utilization of the cluster. Today, Amazon Redshift has both automatic and manual configuration types. With manual WLM configurations, you’re responsible for defining the amount of memory allocated to each queue and the maximum number of queries, each of which gets a fraction of that memory, which can run in each of their queues. Manual WLM configurations don’t adapt to changes in your workload and require an intimate knowledge of your queries’ resource utilization to get right. Amazon Redshift Auto WLM doesn’t require you to define the memory utilization or concurrency for queues. Instead, it adjusts the concurrency dynamically to optimize for throughput. Optionally, you can define queue priorities to provide queries preferential resource allocation based on your business priority. Auto WLM also provides powerful tools to let you manage your workload. Query priorities let you define priorities for workloads so they can get preferential treatment in Amazon Redshift, including more resources during busy times for consistent query performance, and query monitoring rules offer ways to manage unexpected situations such as detecting and preventing runaway or expensive queries from consuming system resources. The following are key areas of Auto WLM with adaptive concurrency performance improvements: proper allocation of memory, elimination of static partitioning of memory between queues, and improved throughput.
Amazon Redshift Advisor develops customized recommendations to increase performance and optimize costs by analyzing your workload and usage metrics for your cluster. Sign in to the Amazon Redshift console to view Advisor recommendations. For more information, see Working with recommendations from Amazon Redshift Advisor.
Q: How do I get started with Amazon Redshift?
With just a few clicks in the AWS Management Console, you can start querying data. You can take advantage of pre-loaded sample data sets, including benchmark datasets TPC-H, TPC-DS, and other sample queries to kick start analytics immediately. You can create databases, schemas, tables and load data from Amazon S3, Amazon Redshift data shares, or restore from an existing Amazon Redshift provisioned cluster snapshot. You can also directly query data in open formats, such as Parquet or ORC in Amazon S3 data lake, or query data in operational databases, such as Amazon Aurora, Amazon RDS PostgreSQL and MySQL.
To get started with Amazon Redshift Serverless, choose “Try Amazon Redshift Serverless” and start querying data. Amazon Redshift Serverless automatically scales to meet any increase in workloads.
Q: What is Advanced Query Accelerator (AQUA) for Amazon Redshift?
Advanced Query Accelerator (AQUA) is a new distributed and hardware-accelerated cache that enables Amazon Redshift to run up to 10 times faster than other enterprise cloud data warehouses by automatically boosting certain types of queries. AQUA is available with the RA3.16xlarge, RA3.4xlarge, or RA3.xlplus nodes at no additional charge and with no code changes.
Q: How do I enable/disable AQUA for my Redshift data warehouse?
For Redshift clusters running on RA3 nodes, you can enable/disable AQUA at the cluster level using the Redshift console, AWS Command Line Interface (CLI), or API. For Redshift clusters running on DC, DS, or older-generation nodes, you must upgrade to RA3 nodes first and enable/disable AQUA.
Q: What type of queries are accelerated by AQUA?
AQUA accelerates analytics queries by running data-intensive tasks such as scans, filtering, and aggregation closer to the storage layer. You’ll see the most noticeable performance improvement on queries that require large scans, especially those with LIKE and SIMILAR_TO predicates. Over time, the types of queries that are accelerated by AQUA will increase.
Q: How do I know which queries on my Redshift cluster are accelerated by AQUA?
You can query the system tables to see the queries accelerated by AQUA.
Q: What is Amazon Redshift managed storage?
Amazon Redshift managed storage is available with serverless and RA3 node types and lets you scale and pay for compute and storage independently so you can size your cluster based only on your compute needs. It automatically uses high-performance SSD-based local storage as tier-1 cache and takes advantage of optimizations such as data block temperature, data block age, and workload patterns to deliver high performance while scaling storage automatically to Amazon S3 when needed without requiring any action.
Q: How do I use Amazon Redshift’s managed storage?
If you are already using Amazon Redshift Dense Storage or Dense Compute nodes, you can use Elastic Resize to upgrade your existing clusters to the new compute instance RA3. Amazon Redshift Serverless and clusters using the RA3 instance automatically use Redshift-managed storage to store data. No other action outside of using Amazon Redshift Serverless or RA3 instances is required to use this capability.
Q: What is Amazon Redshift Spectrum?
Amazon Redshift Spectrum is a feature of Amazon Redshift that lets you run queries against your data lake in Amazon S3, with no data loading or ETL required. When you issue an SQL query, it goes to the Amazon Redshift endpoint, which generates and optimizes a query plan. Amazon Redshift determines what data is local and what is in Amazon S3, generates a plan to minimize the amount of S3 data that needs to be read, and requests Amazon Redshift Spectrum workers out of a shared resource pool to read and process data from S3.
Q: When should I consider using RA3 instances?
Consider choosing RA3 node types in these cases:
- You need the flexibility to scale and pay for compute separate from storage.
- You query a fraction of your total data.
- Your data volume is growing rapidly or is expected to grow rapidly.
- You want the flexibility to size the cluster based only on your performance needs.
As the scale of data continues to grow, reaching petabytes, the amount of data you ingest into your Amazon Redshift data warehouse is also growing. You may be looking for ways to cost-effectively analyze all your data.
With new Amazon Redshift RA3 instances with managed storage, you can choose the number of nodes based on your performance requirements, and pay only for the managed storage that you use. This gives you the flexibility to size your RA3 cluster based on the amount of data you process daily without increasing your storage costs. Built on the AWS Nitro System, RA3 instances with managed storage use high performance SSDs for your hot data and Amazon S3 for your cold data, providing ease of use, cost-effective storage, and fast query performance.
Q: When would I use Amazon Redshift vs. Amazon RDS?
Both Amazon Redshift and Amazon Relational Database Service RDS let you run traditional relational databases in the cloud while off-loading database administration. Customers use Amazon RDS databases primarily for online-transaction processing (OLTP) workload, while Amazon Redshift is used primarily for reporting and analytics. OLTP workloads require quickly querying specific information, and support for transactions such as insert, update, and delete are best handled by Amazon RDS. Amazon Redshift harnesses the scale and resources of multiple nodes and uses a variety of optimizations to provide order of magnitude improvements over traditional databases for analytic and reporting workloads against very large datasets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows if you want to prevent your reporting and analytic processing from interfering with the performance of your OLTP workload. Now, with the new Federated Query feature, you can easily query data across your Amazon RDS or Aurora database services with Amazon Redshift.
Q: When would I use Amazon Redshift or Redshift Spectrum vs. Amazon EMR?
You should use Amazon EMR if you use custom code to process and analyze extremely large datasets with big data processing frameworks such as Apache Spark, Hadoop, Presto, or Hbase. Amazon EMR gives you full control over the configuration of your clusters and the software you install on them.
Data warehouses like Amazon Redshift are designed for a different type of analytics altogether. Data warehouses are designed to pull together data from lots of different sources, like inventory, financial, and retail sales systems. In order to ensure that reporting is consistently accurate across the entire company, data warehouses store data in a highly structured fashion. This structure builds data consistency rules directly into the tables of the database. Amazon Redshift is the best service to use when you need to perform complex queries on massive collections of structured and semi-structured data and get fast performance.
While the Redshift Spectrum feature is great for running queries against data in Amazon Redshift and S3, it really isn’t a fit for the types of use cases that enterprises typically ask from processing frameworks like Amazon EMR. Amazon EMR goes far beyond just running SQL queries. Amazon EMR is a managed service that lets you process and analyze extremely large data sets using the latest versions of popular big data processing frameworks, such as Spark, Hadoop, and Presto, on fully customizable clusters. With Amazon EMR, you can run a wide variety of scale-out data processing tasks for applications such as machine learning, graph analytics, data transformation, streaming data, and virtually anything you can code.
You can use Redshift Spectrum with EMR. Redshift Spectrum uses the same approach to store table definitions as Amazon EMR. Redshift Spectrum can support the same Apache Hive Metastore used by Amazon EMR to locate data and table definitions. If you’re using Amazon EMR and have a Hive Metastore already, you just have to configure your Amazon Redshift cluster to use it. You can then start querying that data right away along with your Amazon EMR jobs. Therefore, if you’re already using EMR to process a large data store, you can use Redshift Spectrum to query that data at the same time without interfering with your Amazon EMR jobs.
Query services, data warehouses, and complex data processing frameworks all have their place, and they are used for different things. You just need to choose the right tool for the job.
Q: When should I use Amazon Athena vs. Amazon Redshift Spectrum?
Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is easy to use. Simply point to your data in S3, define the schema, and start querying using standard SQL.
Redshift Spectrum is a feature of Amazon Redshift. If you need to analyze frequently accessed data with highest-performance, strict Service Level Agreement (SLA), you should use Amazon Redshift. You can use Redshift Spectrum to extend your Amazon Redshift queries out to less frequently accessed data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for processing when you need it.
Q: Why should I use Amazon Redshift instead of running my own MPP data warehouse cluster on Amazon EC2?
- Setup: With Amazon Redshift, you simply create a data warehouse cluster, define your schema, and begin loading and querying your data. You don’t have to manage provisioning, configuration or patching.
- Data Durability: Amazon Redshift replicates your data within your data warehouse cluster and continuously backs up your data to Amazon S3, which is designed for eleven nines of durability. Amazon Redshift mirrors each drive's data to other nodes within your cluster. If a drive fails, your queries will continue with a slight latency increase while Redshift rebuilds your drive from replicas. In case of node failure(s), Amazon Redshift automatically provisions new node(s) and begins restoring data from other drives within the cluster or from Amazon S3. It prioritizes restoring your most frequently queried data so your most frequently executed queries will become performant quickly.
- Scaling: You can add or remove nodes from your Amazon Redshift data warehouse cluster with a single API call or via a few clicks in the AWS Management Console as your capacity and performance needs change. You can also schedule your scaling and resize operations by using the scheduler capability in Amazon Redshift.
- Automatic Updates and Patching: Amazon Redshift automatically applies upgrades and patches your data warehouse so you can focus on your application and not on its administration.
- Exabyte Scale Query Capability: Amazon Redshift Spectrum enables you to run queries against exabytes of data in Amazon S3. There is no loading or ETL required. Even if you don’t store any of your data in Amazon Redshift, you can still use Redshift Spectrum to query datasets as large as an exabyte in Amazon S3.
Q: How do I create and access an Amazon Redshift data warehouse cluster?
You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs. You can start with a single node, 160GB data warehouse and scale all the way to petabytes or more with a few clicks in the AWS Console or a single API call.
The single node configuration, which is best suited for evaluation or development/test workloads, lets you get started with Amazon Redshift quickly and cost effectively and scale up to a multi-node configuration as your needs grow. A Redshift data warehouse cluster can contain 1–128 compute nodes, depending on the node type. For the latest generation node type, RA3, the minimum number of nodes is two. For details, see the documentation.
The multi-node configuration requires a leader node that manages client connections and receives queries, and two compute nodes that store data and perform queries and computations. The leader node, which is the same size as the compute node, is provisioned for you automatically and you are not charged for it.
Simply specify your preferred Availability Zone (optional), the number of nodes, node types, a master name and password, security groups, your preferences for backup retention, and other system settings. Once you've chosen your desired configuration, Amazon Redshift will provision the required resources and set up your data warehouse cluster.
Once your data warehouse cluster is available, you can retrieve its endpoint and JDBC and ODBC connection string from the AWS Management Console or by using the Redshift APIs. You can then use this connection string with your favorite database tool, programming language, or Business Intelligence (BI) tool. You will need to authorize network requests to your running data warehouse cluster. For a detailed explanation, please refer to our Getting Started Guide.
Q: Why should I use Amazon Redshift Spatial?
Amazon Redshift spatial provides location-based analytics for rich insights into your data. It seamlessly integrates spatial and business data to provide analytics for decision making. Amazon Redshift launched native spatial data processing support in November 2019, with a polymorphic data type GEOMETRY and several key SQL spatial functions. We now support GEOGRAPHY data type, and our library of SQL spatial functions has grown to 80. We support all the common spatial datatypes and standards, including Shapefiles, GeoJSON, WKT, WKB, eWKT, and eWKB. To learn more, visit the documentation page or the Amazon Redshift spatial tutorial page.
Q: What is cold query performance enhancement, and what does Amazon Redshift do to enhance cold query performance?
Amazon Redshift can process queries up to two times faster when they need to be compiled. This improvement gives you better query performance when you create a new Redshift cluster, onboard a new workload on an existing cluster, or after a software update of an existing cluster. These query performance improvements are available at no additional cost, and no action is needed to enable it on your clusters.
With cold query performance enhancement, query compilations are scaled to a serverless compilation service beyond the compute resources of the leader node of your cluster. Amazon Redshift supports an unlimited cache to store compiled objects to increase cache hits, from 99.60% to 99.95%, when your mission-critical queries are submitted to Amazon Redshift.
When queries are sent to Amazon Redshift, the query execution engine compiles the query into machine code and distributes it to the cluster nodes. The compiled code runs faster because it eliminates the overhead of using an interpreter. For a new cluster with no code cache or after an existing cluster is upgraded with the latest release, code cache is flushed, and queries must undergo query compilation. As a result, the latency of a query may vary, which may not meet the requirements of some workloads. With this update, unlimited cache minimizes the need to compile code, and when compilation is needed, a scalable compilation farm compiles it in parallel to speed up your workloads. The magnitude of increased speed depends on the workload’s complexity and concurrency. To learn more about code compilation, see Query Processing in the Database developer guide.
Q: What is Amazon Redshift Serverless (preview)?
Amazon Redshift Serverless (preview) is a serverless option of Amazon Redshift that makes it easy to run and scale analytics in seconds without the need to set up and manage data warehouse infrastructure. With Redshift Serverless, any user—including data analysts, developers, business professionals, and data scientists—can get insights from data by simply loading and querying data in the data warehouse.
Q: How do I get started with Amazon Redshift Serverless (preview)?
With just a few clicks in the AWS Management Console, you can choose "configure Amazon Redshift Serverless" and begin querying data. You can take advantage of preloaded sample datasets, such as weather data, census data, and benchmark datasets, along with sample queries to kick start analytics immediately. You can create databases, schemas, tables, and load data from Amazon S3, Amazon Redshift data shares, or restore from an existing Redshift provisioned cluster snapshot. You can also directly query data in open formats (such as Parquet or ORC) in the Amazon S3 data lake, or query data in operational databases, such as Amazon Aurora and Amazon RDS PostgreSQL and MySQL.
Q: What capabilities does Amazon Redshift Serverless (preview) provide?
Amazon Redshift Serverless offers you numerous benefits, including:
- The ability to gain insights rapidly without provisioning and managing clusters.
- Intelligent and automatic scaling based on workload demands without having to over-provision resources.
- Continuous service availability for scaling and version updates.
- Fast, out-of-the box query performance for both data loaded in the data warehouse, open formats in Amazon S3 data lake, and data in operational databases without requiring database tuning.
- Rich SQL analytics, durability, and transactional guarantees of Amazon Redshift.
- Cost efficiency by paying only for the capacity used and reduced data warehouse complexity.
Q: What are the benefits of using Amazon Redshift Serverless (preview)?
If you don't have data warehouse management experience, you don’t have to worry about setting up, configuring, managing clusters or tuning the warehouse. You can focus on deriving meaningful insights from your data or delivering on your core business outcomes through data. You pay only for what you use, keeping costs manageable. You continue to benefit from all of Amazon Redshift’s top-notch performance, rich SQL features, seamless integration with data lakes and operational data warehouses, and built-in predictive analytics and data sharing capabilities. If you need fine-grained control of your data warehouse, you can provision Redshift clusters.
Q: How does Amazon Redshift Serverless (preview) work with other AWS services?
You can continue to use all the rich analytics functionality of Amazon Redshift, such as complex joins, direct queries to data in the Amazon S3 data lake and operational databases, materialized views, stored procedures, semi-structured data support, and ML, as well as high performance at scale. All the related services that Amazon Redshift integrates with (such as Amazon Kinesis, AWS Lambda, AWS QuickSight, Amazon SageMaker, Amazon EMR, AWS Lake formation, and AWS Glue) continue to work with Amazon Redshift Serverless.
Q: What use cases can I handle with Amazon Redshift Serverless (preview)?
You can continue to run all analytics use cases. With a simple getting started workflow, automatic scaling, and the ability to pay for use, the Amazon Redshift Serverless experience now makes it even easier and more cost-effective to run development and test environments that need to get started quickly, ad-hoc business analytics, workloads with varying and unpredictable compute needs, and intermittent or sporadic workloads.
Q: How is Amazon Athena different than Amazon Redshift Serverless?
Amazon Athena and Amazon Redshift address different needs and use cases even if both services are serverless. A data warehouse such as Amazon Redshift is the best choice if you need the best price performance for complex BI and analytics workloads that require high performance at any scale. Amazon Redshift also provides the capability to query data stored in Amazon S3 and combine with data stored in the data warehouse. By comparison, Athena is better suited for interactive analysis on any data store without worrying about ingesting and formatting data. Athena analysis is decoupled from storage, so it gives you the flexibility to use other tools and services such as Spark, Flink, and Kafka to further enrich analysis and data processing on the same data analyzed by Athena.
Q: What is Amazon Redshift data sharing?
Amazon Redshift data sharing lets you share live data in Amazon Redshift to securely and easily share data for read purposes with other Redshift clusters within and across AWS accounts and with AWS analytic services using the data lake. With data sharing, you can instantly query live data from any Redshift cluster as long as they have permissions to access without the complexity and delays associated with data copies and data movement. Amazon Redshift lets you share and query live data across the organization, accounts, and even Regions.
Q: What are the use cases for data sharing?
Key use cases include:
- A central ETL cluster sharing data with many BI/analytics clusters to provide read workload isolation and optional charge-ability.
- A data provider sharing data to external consumers.
- Sharing common datasets such as customers, products across different business groups and collaborating for broad analytics and data science.
- Decentralizing a data warehouse to simplify management.
- Sharing data between development, test, and production environments.
- Accessing Redshift data from other AWS analytic services.
Q: What are cross-database queries in Amazon Redshift?
With cross-database queries, you can seamlessly query and join data from any Redshift database that you have access to, regardless of which database you are connected to. This can include databases local on the cluster and also shared datasets made available from remote clusters. Cross-database queries give you flexibility to organize data as separate databases to support multi-tenant configurations.
Q: What is AWS Data Exchange for Amazon Redshift?
AWS Data Exchange for Amazon Redshift lets you find and subscribe to third-party data in AWS Data Exchange that you can query in a Redshift data warehouse in minutes. You can also easily license your data in Amazon Redshift through AWS Data Exchange. Access is automatically granted when a customer subscribes to your data and automatically revoked when their subscription ends, invoices are automatically generated, and payments are automatically collected and disbursed through AWS. This feature empowers you to quickly query, analyze, and build applications with third-party data.
Q: Who are the primary users of AWS Data Exchange?
AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. Data analysts, product managers, portfolio managers, data scientists, quants, clinical trial technicians, and developers in nearly every industry would like access to more data to drive analytics, train ML models, and make data-driven decisions. But there is no one place to find data from multiple providers and no consistency in how providers deliver data, leaving them to deal with a mix of shipped physical media, FTP credentials, and bespoke API calls. Conversely, many organizations would like to make their data available for research or commercial purposes, but it’s too hard and expensive to build and maintain data delivery, entitlement, and billing technology, which further depresses the supply of valuable data.
Q: What AWS Regions is AWS Data Exchange available in?
AWS Data Exchange has a single, globally available product catalog oﬀered by providers. You can see the same catalog regardless of which Region you are using. The resources underlying the product (datasets, revisions, and assets) are regional resources that you manage programmatically or through the AWS Data Exchange console in speciﬁc AWS Regions. See the AWS Regional Availability Table for a list of AWS Regions in which AWS Data Exchange is available today.
Q: What is the difference between AWS Data Exchange and the Registry of Open Data on AWS?
There are five key differences between AWS Data Exchange and the Registry of Open Data on AWS:
- First, AWS Data Exchange supports both free and commercial data products, with any applicable commercial fees applied to your AWS invoice. The Registry of Open Data on AWS gives you access to a curated list of free and open datasets.
- Third, you must use the AWS Data Exchange API to copy data from AWS Data Exchange to your desired Amazon S3 location. The Registry of Open Data on AWS datasets are accessed via S3 APIs.
- Fourth, AWS Data Exchange delivers data providers access to daily, weekly, and monthly reports detailing subscription activity. With the Registry of Open Data on AWS, data providers must analyze their own logs to track usage of data.
- Finally, to become a data provider on AWS Data Exchange, qualified customers must register as a data provider on AWS Marketplace to be eligible to list both free and commercial products. However, any customer can add free data to the Registry of Open Data on AWS through GitHub and may apply to the AWS Public Dataset Program for AWS to sponsor the costs of storage and bandwidth for select open datasets.
Q: What is Amazon Redshift Query Editor V2?
Amazon Redshift Query Editor v2 is a web-based SQL client application that you can use to author and run queries on your Redshift data warehouse. You can visualize query results with charts and collaborate by sharing queries with members of your team. Query Editor v2 provides several capabilities, such as the ability to browse and explore multiple databases, external tables, views, stored procedures, and user-defined functions. It provides wizards to create schemas, tables, and user-defined functions. You can also load data in Amazon Redshift from Amazon S3 using a visual wizard. It simplifies the management and collaboration of saved queries. You can also gain faster insights by visualizing the results with a single click. With the latest preview release, data analysts can share their queries and collaborate through a common interface called the Query Doc that lets them embed code/SQL queries, annotations, results, and visualizations.
Q: Why should I use Query Editor V2?
If you are a data analyst, data scientist, or data engineer, you can now use Query Editor V2 to browse, create schema and tables, load data, and author SQL queries, stored procedures, and UDFs through a web-based interface. You can also perform a visual analysis of data in-place without having to leave the tool. You can also schedule your long-running queries or queries for a simple reporting purpose such as daily reporting.
Q: What are the features included in Query Editor v2?
Query Editor v2 allows you to:
- Visually create schema, tables, and load data from Amazon S3.
- Author queries and gain faster insights with an intuitive editor for authoring SQL queries.
- Perform analysis of results and download results in JSON/CSV formats to your desktop.
- Automatically manage different versions of queries.
- Collaborate with other users to share queries, analysis, and results.
- Run queries in the background even if the browser is closed.
Scalability and concurrency
Q: How do I scale the size and performance of my Amazon Redshift data warehouse cluster?
If you would like to increase query performance or respond to CPU, memory, or I/O overutilization, you can increase the number of nodes within your data warehouse cluster using Elastic Resize through the AWS Management Console or the ModifyCluster API. When you modify your data warehouse cluster, your requested changes will be applied immediately. Metrics for compute utilization, storage utilization, and read/write traffic to your Redshift data warehouse cluster are available free of charge through the AWS Management Console or Amazon CloudWatch APIs. You can also add user-defined metrics via Amazon CloudWatch custom metric functionality.
With the Concurrency Scaling feature, you can support virtually unlimited concurrent users and concurrent queries, with consistently fast query performance. When concurrency scaling is enabled, Amazon Redshift automatically adds cluster capacity when your cluster experiences increase in query queueing.
With Amazon Redshift Spectrum, you can run multiple Redshift clusters accessing the same data in Amazon S3. You can use different clusters for different use cases. For example, you can use one cluster for standard reporting and another for data science queries. Your marketing team can use their own clusters different from your operations team. Redshift Spectrum automatically distributes the execution of your query to several Redshift Spectrum workers out of a shared resource pool to read and process data from Amazon S3, and pulls results back into your Redshift cluster for any remaining processing.
Q: Will my data warehouse cluster remain available during scaling?
It depends. When you using the Concurrency Scaling feature, the cluster is fully available for read and write during concurrency scaling. With Elastic resize, the cluster is unavailable for four to eight minutes of the resize period. With the Redshift RA3 storage elasticity in managed storage, the cluster is fully available and data is automatically moved between managed storage and compute nodes.
Q: When should I use concurrency scaling and when should I use data sharing?
Data sharing and concurrency scaling are complementary features. With concurrency scaling, Amazon Redshift allows you to auto-scale one or more workloads in a single cluster to handle high concurrency and query spikes. Amazon Redshift elastically and automatically spins up the capacity in seconds to deal with the bursts of user activity and brings it down when activity subsides. Applications continue to interact with Amazon Redshift using a single application endpoint. Data sharing allows you to scale to diverse workloads with multi-cluster, multi-account deployments. This allows for workload isolation and charge-ability, cross-group collaboration in decentralized environments, and the ability to offer data as a service to internal and external stakeholders. You can enable concurrency scaling on both data sharing producer clusters and consumer clusters.
Q: How do I manage resources to ensure that my Amazon Redshift cluster can provide consistently fast performance during periods of high concurrency?
A typical data warehouse has significant variance in concurrent query usage over the course of a day. It is more cost-effective to add resources just for the period during which they are required rather than provisioning to peak demand. Amazon Redshift handles this automatically on your behalf.
Concurrency Scaling is a feature in Amazon Redshift that provides consistently fast query performance, even with thousands of concurrent queries. With this feature, Amazon Redshift automatically adds transient capacity when needed to handle heavy demand. Amazon Redshift automatically routes queries to scaling clusters, which are provisioned in seconds and begin processing queries immediately.
This feature is free for most customers. Each Amazon Redshift cluster earns up to one hour of free Concurrency Scaling credits per day. This gives you predictability in your month-to-month cost, even during periods of fluctuating analytical demand.
Q: What is Elastic Resize and how is it different from Concurrency Scaling?
Elastic Resize adds or removes nodes from a single Redshift cluster within minutes to manage its query throughput. For example, an ETL workload for certain hours in a day or month-end reporting may need additional Amazon Redshift resources to complete on time. Concurrency Scaling adds additional cluster resources to increase the overall query concurrency.
Q: Can I access the Concurrency Scaling clusters directly?
No. Concurrency Scaling is a massively scalable pool of Amazon Redshift resources and customers do not have direct access.
Data integration and loading
Q: How do I load data into my Amazon Redshift data warehouse?
You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon RDS, Amazon DynamoDB, Amazon EMR, AWS Glue, AWS Data Pipeline and or any SSH-enabled host on Amazon EC2 or on-premises. Amazon Redshift attempts to load your data in parallel into each compute node to maximize the rate at which you can ingest data into your data warehouse cluster. Clients can connect to Amazon Redshift using ODBC or JDBC and issue 'insert' SQL commands to insert the data. Please note this is slower than using S3 or DynamoDB since those methods load data in parallel to each compute node while SQL insert statements load via the single leader node. For more details on loading data into Amazon Redshift, please view our Getting Started Guide.
Q: How do I load data from my existing Amazon RDS, Amazon EMR, Amazon DynamoDB, and Amazon EC2 data sources to Amazon Redshift?
You can use our COPY command to load data in parallel directly to Amazon Redshift from Amazon EMR, Amazon DynamoDB, or any SSH-enabled host. Amazon Redshift Spectrum also lets you load data from Amazon S3 into your cluster with a simple INSERT INTO command. With this, you could load data from various formats such as Parquet and ORC into your cluster. Note that if you use this approach, you will accrue Redshift Spectrum charges for the data scanned from Amazon S3.
AWS Data Pipeline provides a high performance, reliable, fault tolerant solution to load data from a variety of AWS data sources like Amazon RDS to Redshift. You can use AWS Data Pipeline to specify the data source and desired data transformations, and then run a prewritten import script to load your data into Amazon Redshift. In addition, AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy to prepare and load data for analytics. You can create and run an AWS Glue ETL job with a few clicks in the AWS Management Console. Further, many ETL companies have certified Amazon Redshift for use with their tools, and a number are offering free trials to help you get started loading your data. Some of these features have also implemented deeper integration with the Redshift console for easier discovery and monitoring of data pipelines into Amazon Redshift from a large variety of third-party sources.
Q: I have a lot of data for initial loading into Amazon Redshift. Transferring via the Internet would take a long time. How do I load this data?
You can use AWS Snowball to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or data center and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.
Q: How does Amazon Redshift keep my data secure?
Amazon Redshift supports industry-leading security with built-in AWS IAM integration, identity federation for single-sign on (SSO), multi-factor authentication, column-level access control, Amazon Virtual Private Cloud (Amazon VPC), and provides built-in AWS KMS integration to protect your data in transit and at rest. Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well. By default, Amazon Redshift takes care of key management but you can choose to manage your keys through AWS Key Management Service. All Amazon Redshift security features are offered at no additional costs. Redshift Spectrum supports Amazon S3’s Server Side Encryption (SSE) using your account’s default key managed used by the AWS Key Management Service (KMS).
Q: Does Redshift support granular access controls like column level security?
Yes. Granular column level security controls ensure users see only the data they should have access to. Amazon Redshift supports column level access control for local tables so you can control access to individual columns of a table or view by granting / revoking column level privileges to a user or a user-group. Redshift is integrated with AWS Lake Formation, ensuring Lake Formation’s column level access controls are also enforced for Redshift queries on the data in the data lake.
Q: Does Amazon Redshift support data masking or data tokenization?
AWS Lambda user-defined functions (UDFs) enable you to use an AWS Lambda function as a UDF in Amazon Redshift and invoke it from Redshift SQL queries. This functionality enables you to write custom extensions for your SQL query to achieve tighter integration with other services or third-party products. You can write Lambda UDFs to enable external tokenization, data masking, identification or de-identification of data by integrating with vendors like Protegrity, and protect or unprotect sensitive data based on a user’s permissions and groups, in query time.
Q: Does Amazon Redshift support single sign-on?
Yes. Customers who want to use their corporate identity providers such as Microsoft Azure Active Directory, Active Directory Federation Services, Okta, Ping Federate, or other SAML compliant identity providers can configure Amazon Redshift to provide single-sign on.
Q: How does Amazon Redshift support single sign-on with Microsoft Azure Active Directory?
You can sign-on to Amazon Redshift cluster with Microsoft Azure Active Directory (AD) identities. This allows you to be able to sign-on to Redshift without duplicating Azure Active Directory identities in Redshift.
Q: Does Amazon Redshift support multi-factor authentication (MFA)?
Yes. You can use multi-factor authentication (MFA) for additional security when authenticating to your Amazon Redshift cluster.
Q: Can I use Amazon Redshift in Amazon Virtual Private Cloud (Amazon VPC)?
Yes. You can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own data center. This gives you complete control over who can access your Redshift data warehouse cluster. You can use Redshift Spectrum with a Redshift cluster that is part of your Amazon VPC.
Amazon Redshift supports managed VPC endpoints (powered by AWS PrivateLink) to connect to your Redshift cluster in a VPC. With an Amazon Redshift–managed endpoint, you can privately access your Redshift data warehouse within your VPC from your client applications in another VPC within the same or another AWS account and running on premises without using public IPs or requiring traffic to traverse the Internet.
Q: Can I access my Amazon Redshift compute nodes directly?
No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster's leader node. This provides an additional layer of security for your data.
Q: Does Redshift support role-based access control in the database? (Pre-announcement)
Amazon Redshift will provide support for role-based access control soon.
Availability and durability
Q: What happens to my data warehouse cluster availability and data durability if a drive on one of my nodes fails?
Amazon Redshift will detect a drive or node failure in these cases and replace the cluster node automatically. On Dense Compute (DC) and Dense Storage (DS2) clusters, the data is stored on the compute nodes to ensure high data durability. When a node is replaced, the data is refreshed from the mirror copy on the other node.
RA3 clusters and Redshift serverless are not impacted the same way since the data is stored in Amazon S3 and the local drive is just used as a data cache. In the event of a node replacement, the data is retrieved from Amazon S3. Amazon S3 provides a 99.9999% data durability guarantee. In the event of a multi-node or complete cluster failure, an up-to-date copy of the data is available in S3 and the cluster can be recovered in the same AZ or another AZ without any data loss.
The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the database. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from Amazon S3 on RA3 and serverless, and from the mirror on DS2 and Amazon Dense Compute (DC2). Single node DC2 and DS2 clusters do not support data replication. In the event of a drive failure, you will need to restore the cluster from a snapshot on S3. Single node RA3.XLPLUS clusters can be recreated without any data loss using the data stored in S3 with the assistance of AWS Support. We recommend using at least two nodes for production to maximize availability.
Q: What happens to my data warehouse cluster availability and data durability in the event of individual node failure?
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB. Amazon Redshift makes your replacement node available immediately and loads your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible. Single node clusters do not support data replication. In the event of a drive failure, you will need to restore the cluster from snapshot on S3. We recommend using at least two nodes for production.
Q: What happens to my data warehouse cluster availability and data durability if my data warehouse cluster's Availability Zone (AZ) has an outage?
If your Amazon Redshift data warehouse cluster's Availability Zone becomes unavailable, Amazon Redshift will automatically move your cluster to another AWS Availability Zone (AZ) without any data loss or application changes. To activate this, you must enable the relocation capability in your cluster configuration settings.
Q: Does Amazon Redshift support Multi-AZ Deployments?
Currently, Amazon Redshift only supports Single-Region deployments. To set up a disaster recovery (DR) configuration, you can enable cross-Region snapshot copy on your cluster. This will replicate all snapshots from your cluster to another AWS Region. In the event of a DR event, the snapshots in the replica Region can be restored to create a new cluster. Amazon Redshift also supports cross-Region data sharing, where a consumer cluster can access live data in a producer cluster in another region. This is supported only with Amazon Redshift Serverless and RA3..
Querying and analytics
Q: Are Amazon Redshift and Redshift Spectrum compatible with my preferred business intelligence software package and ETL tools?
Yes, Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of the Redshift Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.
Amazon Redshift Spectrum supports all Amazon Redshift client tools. The client tools can continue to connect to the Amazon Redshift cluster endpoint using ODBC or JDBC connections. No changes are required.
You use exactly the same query syntax and have the same query capabilities to access tables in Redshift Spectrum as you have for tables in the local storage of your Redshift cluster. External tables are referenced using the schema name defined in the CREATE EXTERNAL SCHEMA command where they were registered.
Q: What data formats and compression formats does Amazon Redshift Spectrum support?
Amazon Redshift Spectrum currently supports many open source data formats, including Avro, CSV, Grok, Amazon Ion, JSON, ORC, Parquet, RCFile, RegexSerDe, Sequence, Text, and TSV.
Amazon Redshift Spectrum currently supports Gzip and Snappy compression.
Q: What happens if a table in my local storage has the same name as an external table?
Just like with local tables, you can use the schema name to pick exactly which one you mean by using schema_name.table_name in your query.
Q: I use a Hive Metastore to store metadata about my S3 data lake. Can I use Redshift Spectrum?
Yes. The CREATE EXTERNAL SCHEMA command supports Hive Metastores. We do not currently support DDL against the Hive Metastore.
Q: How do I get a list of all external database tables created in my cluster?
You can query the system table SVV_EXTERNAL_TABLES to get that information.
Q: Does Redshift support the ability to use Machine Learning with SQL?
Yes, the Amazon Redshift ML feature makes it easy for SQL users to create, train, and deploy machine learning (ML) models using familiar SQL commands. Amazon Redshift ML allows you to leverage your data in Amazon Redshift with Amazon SageMaker, a fully managed ML service. Amazon Redshift supports both unsupervised learning (K-Means) and supervised learning (Autopilot, XGBoost, MLP algorithms).
Q: Does Amazon Redshift provide an API to query data?
Amazon Redshift provides a Data API that you can use to painlessly access data from Amazon Redshift with all types of traditional, cloud-native, and containerized, serverless web services-based and event-driven applications. The Data API simplifies access to Amazon Redshift because you don’t need to configure drivers and manage database connections. Instead, you can run SQL commands to an Amazon Redshift cluster by simply calling a secured API endpoint provided by the Data API. The Data API takes care of managing database connections and buffering data. The Data API is asynchronous, so you can retrieve your results later. Your query results are stored for 24 hours.
Q: What types of credentials can I use with Amazon Redshift Data API?
The Data API supports both IAM credentials and using a secret key from AWS Secrets Manager. The Data API federates AWS Identity and Access Management (IAM) credentials so you can use identity providers like Okta or Azure Active Directory or database credentials stored in Secrets Manager without passing database credentials in API calls.
Q: Can I use Amazon Redshift Data API from AWS CLI?
Yes, you can use the Data API from AWS CLI using the aws redshift-data command line option.
Q: Is the Redshift Data API integrated with other AWS services?
You can use the Data API from other services such as AWS Lambda, AWS Cloud9, AWS AppSync and Amazon EventBridge.
Q: Do I have to pay separately for using the Amazon Redshift Data API?
No, there is no separate charge for using the Data API.
Backup and restore
Q: How does Amazon Redshift backup my data? How do I restore my cluster from a backup?
Amazon Redshift RA3 clusters and Amazon Redshift Serverless use Redshift Managed Storage, which always has the latest copy of the data available. DS2 and DC2 clusters mirror the data on the cluster to ensure the latest copy is available in the event of a failure. Backups are automatically created on all Redshift cluster types and retained for 24 hours, and on serverless recovery points are provided for the past 24 hours.
You can also create your own backups that can be retained indefinitely. These backups can be created at any time, and the Amazon Redshift automated backups or Amazon Redshift Serverless recovery points can be converted into a user backup for longer retention.
Amazon Redshift can also asynchronously replicate your snapshots or recovery points to Amazon S3 in another Region for disaster recovery.
On a DS2 or DC2 cluster, free backup storage is limited to the total size of storage on the nodes in the data warehouse cluster and only applies to active data warehouse clusters.
For example, if you have total data warehouse storage of 8 TB, we will provide at most 8 TB of backup storage at no additional charge. If you would like to extend your backup retention period beyond one day, you can do so using the AWS Management Console or the Amazon Redshift APIs. For more information on automated snapshots, please refer to the Amazon Redshift Management Guide.
Amazon Redshift only backs up data that has changed, so most snapshots use only a small amount of your free backup storage. When you need to restore a backup, you have access to all the automated backups within your backup retention window. Once you choose a backup from which to restore, we will provision a new data warehouse cluster and restore your data to it.
Q: How do I manage the retention of my automated backups and snapshots?
You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you wish to turn off automated backups altogether, you can set up the retention period to 0 (not recommended).
Q: What happens to my backups if I delete my data warehouse cluster?
When you delete a data warehouse cluster you have the ability to specify whether a final snapshot is created upon deletion. This enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.
Monitoring and maintenance
Q: How do I monitor the performance of my Amazon Redshift data warehouse cluster?
Metrics for compute utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon CloudWatch’s custom metric functionality. The AWS Management Console provides a monitoring dashboard that helps you monitor the health and performance of all your clusters. Amazon Redshift also provides information on query and cluster performance via the AWS Management Console. This information enables you to see which users and queries are consuming the most system resources to diagnose performance issues by viewing query plans and execution statistics. In addition, you can see the resource utilization on each of your compute nodes to ensure that you have data and queries that are well-balanced across all nodes.
Q: What is a maintenance window? Will my data warehouse cluster be available during software maintenance?
Amazon Redshift periodically performs maintenance to apply fixes, enhancements and new features to your cluster. You can change the scheduled maintenance windows by modifying the cluster, either programmatically or by using the Redshift Console. During these maintenance windows, your Amazon Redshift cluster is not available for normal operations. For more information about maintenance windows and schedules by region, see Maintenance Windows in the Amazon Redshift Management Guide.