Amazon Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud. With a few clicks in the AWS Management Console, customers can launch a Redshift cluster, starting with a few hundred gigabytes and scaling to a petabyte or more, for under $1,000 per terabyte per year.
Traditional data warehouses require significant time and resource to administer, especially for large datasets. In addition, the financial cost associated with building, maintaining, and growing self-managed, on-premise data warehouses is very high. Amazon Redshift not only significantly lowers the cost of a data warehouse, but also makes it easy to analyze large amounts of data very quickly.
Amazon Redshift gives you fast querying capabilities over structured data using familiar SQL-based clients and business intelligence (BI) tools using standard ODBC and JDBC connections. Queries are distributed and parallelized across multiple physical resources. You can easily scale an Amazon Redshift data warehouse up or down with a few clicks in the AWS Management Console or with a single API call. Amazon Redshift automatically patches and backs up your data warehouse, storing the backups for a user-defined retention period. Amazon Redshift uses replication and continuous backups to enhance availability and improve data durability and can automatically recover from component and node failures. In addition, Amazon Redshift supports Amazon Virtual Private Cloud (Amazon VPC), SSL, and AES-256 encryption to protect your data in transit and at rest.
As with all Amazon Web Services, there are no up-front investments required, and you pay only for the resources you use. Amazon Redshift lets you pay as you go.
Amazon Redshift manages the work needed to set up, operate, and scale a data warehouse, from provisioning the infrastructure capacity to automating ongoing administrative tasks such as backups and patching. Amazon Redshift automatically monitors your nodes and drives to help you recover from failures.
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
You can sign up and get started within minutes from the Amazon Redshift detail page or via the AWS Management Console. If you don’t already have an AWS account, you’ll be prompted to create one. Once you’ve signed up, please view our Getting Started Guide.
You can easily create an Amazon Redshift data warehouse cluster by using the AWS Management Console or the Amazon Redshift APIs. You can provision Amazon Redshift as a single 2TB data warehouse or as a cluster of many 2TB or 16TB nodes.
The single node configuration enables you to get started with Amazon Redshift quickly and cost-effectively and scale up to a multi-node configuration as your needs grow. The multi-node configuration requires a leader node that manages client connections and receives queries, and two compute nodes that store data and perform queries and computations. The leader node is provisioned for you automatically and you are not charged for it.
Simply specify your preferred Availability Zone (optional), the number of nodes, node types, a master name and password, security groups, your preferences for backup retention, and other system settings. Once you’ve chosen your desired configuration, Amazon Redshift will provision the required resources and set up your data warehouse cluster.
A leader node receives queries from client applications, parses the queries and develops execution plans, which are an ordered set of steps to process these queries. The leader node then coordinates the parallel execution of these plans with the compute nodes, aggregates the intermediate results from these nodes and finally returns the results back to the client applications.
Compute nodes execute the steps specified in the execution plans and transmit data among themselves to serve these queries. The intermediate results are sent back to the leader node for aggregation before being sent back to the client applications.
Amazon Redshift has two compute node sizes. The first, the dw.hs1.xlarge (XL), can store up to 2TB of compressed user data per node. This type can be used in a single node configuration or as part of a multi-node data warehouse. The second, the dw.hs1.8xlarge (8XL), can store up to 16TB of compressed user data per node. 8XL nodes can only be used as part of a multi-node Amazon Redshift data warehouse cluster.
Amazon Redshift’s MPP architecture means you can increase your performance by increasing the number of nodes in your data warehouse cluster. The optimal amount of data per compute node depends on your application characteristics and your query performance needs.
By default, you can have a maximum of 16 XL nodes per data warehouse cluster or 16 8XL nodes per data warehouse cluster. You can request a higher limit via this Node Limit Increase form.
The total number of nodes across all your Amazon Redshift data warehouse clusters limits the number of data warehouse clusters you can run. By default, you are allowed to have up to 16 nodes across all your data warehouse clusters. You can request a higher limit via this node limit increase form
Once your data warehouse cluster is available, you can retrieve its endpoint and JDBC and ODBC connection string from the AWS Management Console or by using the Redshift APIs. You can then use this connection string with your favorite database tool, programming language, or Business Intelligence (BI) tool. You will need to authorize network requests to your running data warehouse cluster. For a detailed explanation please refer to our Getting Started Guide.
Both Amazon Redshift and Amazon RDS enable you to run traditional relational databases such as MySQL, Oracle and SQL Server in the cloud while offloading database administration. Customers use Amazon RDS databases both for online-transaction processing (OLTP) and for reporting and analysis. Amazon Redshift harnesses the scale and resources of multiple nodes and uses a variety of optimizations to provide order of magnitude improvements over traditional databases for analytic and reporting workloads against very large data sets. Amazon Redshift provides an excellent scale-out option as your data and query complexity grows or if you want to prevent your reporting and analytic processing from interfering with the performance of your OLTP workload.
Amazon Redshift is ideal for large volumes of structured data that you want to persist and query using standard SQL and your existing BI tools. Amazon EMR is ideal for processing and transforming data to bring in to Amazon Redshift and is also a much better option for data sets that are relatively transitory, not stored for long-term use.
Amazon Redshift automatically handles many of the time-consuming tasks associated with managing your own data warehouse including:
You pay only for what you use, and there are no minimum or setup fees. You are billed based on:
For Amazon Redshift pricing information, please visit the Amazon Redshift pricing page.
Billing commences for a data warehouse cluster as soon as the data warehouse cluster is available. Billing continues until the data warehouse cluster terminates, which would occur upon deletion or in the event of instance failure.
Node usage hours are billed for each hour your data warehouse cluster is running in an available state. If you no longer wish to be charged for your data warehouse cluster, you must terminate it to avoid being billed for additional node hours. Partial node hours consumed are billed as full hours.
You can load data into Amazon Redshift from a range of data sources including Amazon S3, Amazon DynamoDB, and AWS Data Pipeline. Amazon Redshift attempts to load your data in parallel into each compute node to maximize the rate at which you can ingest data into your data warehouse cluster. For more details on loading data into Amazon Redshift please view our Getting Started Guide.
Yes, clients can connect to Amazon Redshift using ODBC or JDBC and issue ‘insert’ SQL commands to insert the data. Please note this is slower than using S3 or DynamoDB since those methods load data in parallel to each compute node while SQL insert statements load via the single leader node.
AWS Data Pipeline provides a high performance, reliable, fault tolerant solution to load data from a variety of AWS data sources. You can use AWS Data Pipeline to specify the data source, desired data transformations, and then execute a pre-written import script to load your data into Amazon Redshift.
You can use AWS Import/Export to transfer the data to Amazon S3 using portable storage devices. In addition, you can use AWS Direct Connect to establish a private network connection between your network or datacenter and AWS. You can choose 1Gbit/sec or 10Gbit/sec connection ports to transfer your data.
Amazon Redshift encrypts and keeps your data secure in transit and at rest using industry-standard encryption techniques. To keep data secure in transit, Amazon Redshift supports SSL-enabled connections between your client application and your Redshift data warehouse cluster. To keep your data secure at rest, Amazon Redshift encrypts each block using hardware-accelerated AES-256 as it is written to disk. This takes place at a low level in the I/O subsystem, which encrypts everything written to disk, including intermediate query results. The blocks are backed up as is, which means that backups are encrypted as well.
Yes, you can use Amazon Redshift as part of your VPC configuration. With Amazon VPC, you can define a virtual network topology that closely resembles a traditional network that you might operate in your own datacenter. This gives you complete control over who can access your Amazon Redshift data warehouse cluster.
No. Your Amazon Redshift compute nodes are in a private network space and can only be accessed from your data warehouse cluster’s leader node. This provides an additional layer of security for your data.
Your Amazon Redshift data warehouse cluster will remain available in the event of a drive failure however you may see a slight decline in performance for certain queries. In the event of a drive failure, Amazon Redshift will transparently use a replica of the data on that drive which is stored on other drives within that node. In addition, Amazon Redshift will attempt to move your data to a healthy drive or will replace your node if it is unable to do so.
Amazon Redshift will automatically detect and replace a failed node in your data warehouse cluster. The data warehouse cluster will be unavailable for queries and updates until a replacement node is provisioned and added to the DB, typically only a few minutes. Amazon Redshift makes your replacement node available immediately and streams your most frequently accessed data from S3 first to allow you to resume querying your data as quickly as possible.
If your Amazon Redshift data warehouse cluster’s Availability Zone becomes unavailable, you will not be able to use your cluster until power and network access to the AZ are restored. Your data warehouse cluster’s data is preserved so you can start using your Amazon Redshift data warehouse as soon as the AZ becomes available again. In addition, you can also choose to restore any existing snapshots to a new AZ in the same Region. Amazon Redshift will restore your most frequently accessed data first so you can resume queries as quickly as possible.
Currently, Amazon Redshift only supports Single-AZ deployments. You can run data warehouse clusters in multiple AZ’s by loading data into two Amazon Redshift data warehouse clusters in separate AZs from the same set of Amazon S3 input files. In addition, you can also restore a data warehouse cluster to a different AZ from your data warehouse cluster snapshots.
Amazon Redshift replicates all your data within your data warehouse cluster when it is loaded and also continuously backs up your data to S3. Amazon Redshift always attempts to maintain at least three copies of your data (the original and replica on the compute nodes and a backup in Amazon S3).
By default, Amazon Redshift retains backups for 1 day by default. You can configure this to be as long as 35 days.
You have access to all the automated backups within your backup retention window and restore to them. Once you choose a backup from which to restore, we will provision a new data warehouse cluster and restore your data to it.
By default, Amazon Redshift enables automated backups of your data warehouse cluster with a 1-day retention period. Free backup storage is limited to the total size of storage on the nodes in the data warehouse cluster and only applies to active data warehouse clusters. For example, if you have total data warehouse storage of 8TB, we will provide at most 8TB of backup storage at no additional charge. If you would like to extend your backup retention period beyond one day, you can do so using the AWS Management Console or the Amazon Redshift APIS. For more information on automated snapshots, please refer to the Amazon Redshift Developer Guide. Amazon Redshift only backs up data that has changed so most snapshots only use up a small amount of your free backup storage.
You can use the AWS Management Console or ModifyCluster API to manage the period of time your automated backups are retained by modifying the RetentionPeriod parameter. If you desire to turn off automated backups altogether, you can do so by setting the retention period to 0 (not recommended).
When you delete a data warehouse cluster, you have the ability to specify whether a final snapshot is created upon deletion, which enables a restore of the deleted data warehouse cluster at a later date. All previously created manual snapshots of your data warehouse cluster will be retained and billed at standard Amazon S3 rates, unless you choose to delete them.
If you would like to increase query performance or respond to CPU, memory or I/O over-utilization, you can increase the number of nodes within your data warehouse cluster via the AWS Management Console or the ModifyCluster API. When you modify your data warehouse cluster, your requested changes will be applied during your specified maintenance window. You can also use the "apply-immediately" flag to apply your scaling requests immediately. Bear in mind that any other pending system changes will be applied as well. Metrics for compute utilization, memory utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatch’s custom metric functionality.
The existing data warehouse cluster remains available for read operations while a new data warehouse cluster gets created during scaling operations. When the new data warehouse cluster is ready, your existing data warehouse cluster will be temporarily unavailable while the canonical name record of the existing data warehouse cluster is flipped to point to new data warehouse cluster. This period of unavailability typically lasts only a few minutes, and will occur during the maintenance window for your data warehouse cluster, unless you specify that the modification should be applied immediately. Amazon Redshift moves data in parallel from the compute nodes in your existing data warehouse cluster to the compute nodes in your new cluster. This enables your operation to complete as quickly as possible.
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. We have validated integrations with popular BI and ETL vendors. You can also go to AWS Marketplace to deploy solutions designed to work with Amazon Redshift. Here, instead of spending time integrating a variety of technologies to build a complete BI stack, you can find solutions from third party ETL and BI vendors that can be automatically deployed and configured in minutes.
Metrics for compute utilization, memory utilization, storage utilization, and read/write traffic to your Amazon Redshift data warehouse cluster are available free of charge via the AWS Management Console or Amazon CloudWatch APIs. You can also add additional, user-defined metrics via Amazon Cloudwatch’s custom metric functionality. In addition to CloudWatch metrics, Amazon Redshift also provides information on query and cluster performance via the AWS Management Console. This information enables you to see which users and queries are consuming the most system resources and diagnose performance issues. In addition, you can see the resource utilization on each of your compute nodes to ensure that you have data and queries that are well balanced across all nodes.
You can think of the Amazon Redshift maintenance window as an opportunity to control when data warehouse cluster modifications (such as scaling data warehouse cluster by adding more nodes) and software patching occur, in the event either are requested or required. If a "maintenance" event is scheduled for a given week, it will be initiated and completed at some point during the thirty-minute maintenance window you identify.
Required patching is automatically scheduled only for patches that are security and durability related. Such patching occurs infrequently (typically once every few months). If you do not specify a preferred weekly maintenance window when creating your data warehouse cluster, a default value will be assigned. If you wish to modify when maintenance is performed on your behalf, you can do so by modifying your data warehouse cluster in the AWS Management Console or by using the ModifyCluster API. Each of your data warehouse clusters can have different preferred maintenance windows.