Q: What is data warehousing?
Analytics is ubiquitous. We all use reports and dashboards to manage our work, report our progress to stakeholders, and perform ad-hoc analytics to support decision making. Under the hoods, these reports, dashboards and BI tools are powered by data warehouses, which store data efficiently to minimize I/O and deliver query results at blazing speeds to hundreds and thousands of users concurrently. Unlike transactional databases, data warehouses use specialized architectures and storage for fast query and data load performance. Data warehouses also need to be highly scalable so that you can add more data sources all the time to enrich analytics and insights. Lastly, data warehouses should integrate seamlessly with 3rd party business intelligence tools and SQL clients, and support standard SQL so that customers can use skills they already have.
Q: Why should I run data warehousing on AWS?
Amazon Redshift, our data warehousing solution, is fast, easy-to-use, and fully managed. It automates infrastructure provisioning and administrative tasks such as backups, replication, and patching. It integrates seamlessly with 3rd party BI and ETL tools, so you can get to your first report in just a few minutes. And, there is no limit to the amount of data you can load and analyze. As your data grows, you don’t have to worry about expensive system upgrades or slow performance. Amazon Redshift is fast at any scale because it uses columnar storage and several optimization techniques. Amazon Redshift is also cost-effective and you only pay for what you use. Bottom line is, you can have unlimited number of users doing unlimited analytics on all your data for just $1000 per terabyte per year.
Q: What is Amazon Redshift?
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse that makes it simple and cost-effective to analyze all your data using your existing business intelligence tools. Start small for $0.25 per hour with no commitments and scale to petabytes for $1,000 per terabyte per year, less than a tenth the cost of traditional solutions. Customers typically see 3x compression, reducing their costs to $333 per uncompressed terabyte per year.
Q: How does the performance of Amazon Redshift compare to most traditional databases for data warehousing and analytics?
Amazon Redshift uses a variety of innovations to achieve up to ten times higher performance than traditional databases for data warehousing and analytics workloads:
- Massively parallel: Amazon Redshift delivers fast query performance on datasets ranging in size from gigabytes to exabytes. Redshift uses columnar storage, data compression, and zone maps to reduce the amount of I/O needed to perform queries. It uses a massively parallel processing (MPP) data warehouse architecture to parallelize and distribute SQL operations to take advantage of all available resources. The underlying hardware is designed for high performance data processing, using local attached storage to maximize throughput between the CPUs and drives, and a high bandwidth mesh network to maximize throughput between nodes.
- Machine learning: Amazon Redshift uses machine learning to deliver high throughput, irrespective of your workloads or concurrent usage. Redshift utilizes sophisticated algorithms to predict incoming query run times, and assigns them to the optimal queue for the fastest processing. For example, queries such as dashboards and reports with high concurrency requirements are routed to an express queue for immediate processing. As concurrency increases further, Amazon Redshift predicts when queuing may begin and automatically deploys transient resources with the Concurrency Scaling feature to ensure consistently fast performance, irrespective of variability in demand on the cluster.
- Result caching: Amazon Redshift uses result caching to deliver sub-second response times for repeat queries. Dashboard, visualization, and business intelligence tools that execute repeat queries experience a significant performance boost. When a query executes, Redshift searches the cache to see if there is a cached result from a prior run. If a cached result is found and the data has not changed, the cached result is returned immediately instead of re-running the query.
Q: How do I access my running data warehouse cluster?
Once your data warehouse cluster is available, you can retrieve its endpoint and JDBC and ODBC connection string from the AWS Management Console or by using the Redshift APIs. You can then use this connection string with your favorite database tool, programming language, or Business Intelligence (BI) tool. You will need to authorize network requests to your running data warehouse cluster. For a detailed explanation please refer to our Getting Started Guide.
Q: Is Amazon Redshift compatible with my preferred business intelligence software package and ETL tools?
Amazon Redshift uses industry-standard SQL and is accessed using standard JDBC and ODBC drivers. You can download Amazon Redshift custom JDBC and ODBC drivers from the Connect Client tab of our Console. We have validated integrations with popular BI and ETL vendors, a number of which are offering free trials to help you get started loading and analyzing your data. You can also go to the AWS Marketplace to deploy and configure solutions designed to work with Amazon Redshift in minutes.
Q: How do I get started with Amazon Redshift?
You can try Amazon Redshift for free. If you’ve never created an Amazon Redshift cluster, you’re eligible for a 2-month free trial of our DC1.Large node. You get 750 hours per month for free, enough hours to continuously run one DC1.Large node with 160GB of compressed SSD storage. You can also build clusters with multiple nodes to test larger data sets, which will consume your free hours more quickly. Once your two month free trial expires or your usage exceeds 750 hours per month, you can shut down your cluster, avoiding any charges, or keep it running at our standard On-Demand Rate.