AWS Big Data Blog
Monitor and control the storage space of a schema with quotas with Amazon Redshift
September 2023: This post was reviewed for accuracy.
Yelp connects people with great local businesses. Since its launch in 2004, Yelp has grown from offering services for just one city—its headquarters home of San Francisco—to a multinational presence spanning major metros across more than 30 countries. The company’s performance-based advertising and transactional business model led to revenues of more than $500 million during 2015, a 46% increase over the prior year. Yelp has evolved into a mobile-centric company, with more than 70% of searches, and more than 58% of content originating from mobile devices.
Yelp uses Amazon Redshift to analyze mobile app usage data and ad data on customer cohorts, auctions, and ad metrics. Yelp has immediately benefited by the new Amazon Redshift schema storage quota feature.
“Amazon Redshift is a managed data warehouse service that allows Yelp to focus on data analytics without spending time on database administration,” says Steven Moy, Lead Engineer for Yelp’s Metrics Platform. The Metrics Platform provides long-term persistent data storage and SQL-on-anything query capabilities for Yelp’s Engineering teams. “A key strategy for our data warehouse users to iterate quickly is to have a writable schema called ‘tmp’ for users to prototype various table schema. However, we occasionally faced challenges when there was not enough free space during a query execution, degrading the entire data warehouse query operation. With the new schema quota feature, we can provision a storage quota ceiling on the ‘tmp’ schema to safeguard runaway storage issues. We look forward to all the autonomous features coming from Amazon Redshift.”
Many organizations are moving toward self-service analytics, where different personas create their own insights on the evolved volume, variety, and velocity of data to keep up with the acceleration of business. This data democratization creates the need to enforce data governance, control cost, and prevent data mismanagement. Controlling the storage quota of different personas is a significant challenge for data governance and data storage operation. This post shows you how to set up Amazon Redshift storage quotas by different personas.
Prerequisites
Before starting this walkthrough, you must have the following:
- An Amazon Redshift cluster. The US East (N. Virginia) Region is preferred because you need to load data from Amazon Simple Storage Service (Amazon S3) in
us-east-1
. - A database user with superuser permission.
Setting up and testing the schema quota
To set up the environment and implement the use case, complete the following steps:
- Connect to your Amazon Redshift cluster using your preferred SQL client as a superuser or user with
CREATE SCHEMA
privileges.
- Create the user
sales
with the following code:
The user in the preceding code is just for the purposes of this walkthrough. You should use identity federation in AWS Identity and Access Management (IAM) to use a corporate user rather than creating one manually. For more information, see Authentication.
- Set up a schema quota with the CREATE SCHEMA See the following code:
The QUOTA
parameters define the maximum amount of disk space that the specified schema can use.
- Create the schema
sales_schema
with a quota of 2 GB for the usersales
. See the following code:
- Impersonate the user
sales
with the following code:
- Create the tables
region
andlineitem
with the following code:
Single statement transaction
To test single statement transaction use case, complete the following steps:
- Load data into the table
region
with the following code:
<Your-IAM-Role>
is the ARN of your IAM role with the necessary permissions associated with your Amazon Redshift cluster.
The COPY
command should complete with five rows loaded. See the following code:
- Check the quota and disk usage for the schema from system view svv_schema_quota_state. See the following code:
The statement should show a 2,048 MB quota with 30 MB disk usage on your four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:
- Load data into the table
lineitem
with the following code:
You should get an error that the transaction is aborted due to exceeding the disk space quota. See the following code:
Amazon Redshift checks each transaction for quota violations before committing the transaction. The size limit can exceed the quota temporarily within a transaction before it’s committed because the quota violation check occurs at the end of a transaction. That’s why you initially see the message loading to table lineitem
completed successfully. When a transaction exceeds the quota, Amazon Redshift aborts the transaction, prohibits subsequent ingestions, and reverts all changes until you free up disk space.
The quota violation occurrence is stored in the system table stl_schema_quota_violations.
- Run the following code to see the violation:
It should show the quota, disk usage, disk usage percentage out of the configured quota, and the timestamp of when the violation happened. See the following code:
- Change the schema quota to 4 GB using the ALTER SCHEMA See the following code:
You have to be a database superuser or user with CREATE SCHEMA
permissions to set a schema quota during schema creation. However, only a superuser can change a schema quota. You use RESET SESSION AUTHORIZATION
to reset the user impersonation back to the original superuser.
- Check the quota and disk usage for the schema with the following code:
The statement should return a 4,096 MB quota with 30 MB disk usage on your four-node DC2.large cluster. The failed COPY
statement earlier automatically reclaims disk space when the transaction is aborted for a pristine table. See the following code:
- Rerun the
COPY
command to load data into the tablelineitem
with the following code:
It should return the following message:
- Check the quota and disk usage for the schema with the following code:
The statement should return a 4,096 MB quota and 2798 MB disk usage on a four-node DC2.large cluster. Disk usage varies with different instance types and number of nodes. See the following code:
- Rerun the
COPY
command to load a large amount of data into the tablelineitem
to exceed the quota. See the following code:
It should return an error that the transaction is aborted due to exceeding the disk space quota. See the following code:
- Execute the following code to perform subsequent small COPY/INSERT/UPDATE after the quota violation:
All statements should return that the error transaction is aborted due to exceeding disk space quota. See the following code:
For non-pristine tables, although the offending transaction that exceeded the quota was aborted (for example, COPY
to lineitem
), the transaction still consumes space and the schema is blocked until disk space is reclaimed. That is why the subsequent COPY/INSERT/UPDATE
to the region
table is aborted as well.
- Double-check the quota violation from system table with the following code:
It should return disk usage of 5,486 MB with a four-node DC2.large cluster. See the following code:
Single statement shrink-only transaction
A shrink-only transaction is a transaction composed solely of statements that has the potential to directly or indirectly reduce the amount of disk space used, such as: DELETE, TRUNCATE, DROP TABLE, VACUUM, etc.
To test single statement shrink-only transaction use case, complete the following steps:
- Delete all data from the lineitem table with the following code:
It should return a record count of zero:
Although the schema quota was exceeded, the DELETE
transaction is committed because the shrink-only transaction is an exception to the quota violation check.
The TRUNCATE command is much more efficient than DELETE to delete all data and doesn’t require a VACUUM and ANALYZE command. The preceding code is for illustration purposes only.
- Rerun the
COPY
command to load data into the tablelineitem
with the following code:
It should return the following error message:
See the following code:
Because DELETE
doesn’t automatically free up disk space, the preceding COPY
command exceeds the quota and the transaction is aborted.
- Double-check the quota violation from the system table with the following code:
The latest row should return a disk usage of 8,150 MB with a four-node DC2.large cluster. This shows that disk usage increases with the aborted transaction. See the following code:
- Run
VACUUM
to free up disk space:
The schema quota is exceeded, but because Amazon Redshift disregards the quota violation and commits transactions in certain cases like VACUUM
, this completes without error.
Amazon Redshift now automatically runs the VACUUM DELETE operation in the background to reclaim disk space.
- Rerun the
COPY
command to load data into the tablelineitem
with the following code:
It should return the following message:
Because VACUUM
reclaimed the disk space, the COPY
command ran successfully.
CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema
To test CREATE TABLE AS (CTAS), ALTER TABLE APPEND statement and transactions on temporary schema use case, complete the following steps:
- Execute the CTAS statement to create another table using a query. See the following code:
It should return that the transaction is aborted due to exceeding the disk space quota. CTAS is subject to similar quota checks in the same schema. See the following code:
- Create a temp table and populate data with the following code:
It should return 59,986,052 rows. The schema quota doesn’t consider temporary tables created as part of a temporary namespace or schema.
- Create the same table
lineitem
in thepublic
schema and load data into it using the following code:
It should return 59,986,052 rows.
Ingestion into different schemas has no effect and the transaction is committed.
- Append data into the table
lineitem
insales_schema
from another table using the ALTER TABLE APPEND See the following code:
It should return an error that the transaction is aborted due to exceeding the disk space quota. Quota violation is applicable to the full schema.
- Append data from the
sales_schema lineitem
table to another table in thepublic
schema with the following code:
It should return a record count of zero from the SELECT COUNT
statement and the disk usage drop. The ALTER TABLE APPEND
statement empties the table immediately and reclaims disk space when the transaction is complete. See the following code:
Concurrent transactions
To test concurrent transactions use case, complete the following steps:
- Increase the quota to 6 GB using the following code:
You need two separate SQL client sessions connected to the Amazon Redshift cluster and run the code concurrently for the following test.
- On the first session, enter the following code:
- While the first session is still running, on the second session, enter the following code:
The statement from the first session should return a warning message:
This is a warning message that the schema is close to its quota, but the statement is still complete. See the following code:
The statement from the second session should return an error that the transaction is aborted due to exceeding the disk space quota. Because the first transaction from the first session was committed first, the disk space usage increased to close to the quota. The second transaction checks the quota violation before committing the transaction and thus aborts the transaction.
Reset quota below existing disk space usage
To test resetting quota below existing disk space usage use case, complete the following:
- Run the following code to reset the quota to 2 GB:
You can set the quota below the existing disk space usage; however, all future ingestion transactions are aborted until you free disk space. See the following code:
Multi-statement transactions
To test multi-statement transaction use case, complete the following steps:
- Enter the following code to test a multi-statement transaction:
All INSERT/UPDATE/COPY
transactions should be aborted because the quota is exceeded. See the following code:
- Enter the following code to test a multi-statement transaction with DROP TABLE at the same transaction:
The transaction should be committed successfully. The COPY/INSERT/DELETE
transaction into a table that gets dropped at the same time the transaction is committed. See the following code:
- Run shrink-only statements in a multi-statement transaction with the following code:
The transaction should be committed successfully. A transaction containing shrink-only statements should succeed. See the following code:
Best practices
These following recommended practices can help you optimize your workload using storage quotas:
- The schema feature works best when users are aligned with schemas – Use schemas to logically segregate the users, for example, the
MAIN
schema for the base and aggregated tables owned by data engineers, and theANALYST
schema controlled with storage quota owned by data analysts that perform their own aggregation by querying the base data from theMAIN
schema into theANALYST
The data analyst only has read permissions on theMAIN
schema but read and write permissions on theANALYST
schema. - Revoke usage on public schema – All users have
CREATE
andUSAGE
privileges on thePUBLIC
schema of a database by default. Because storage quota isn’t applicable to thePUBLIC
schema by default, all users should be disallowed from creating objects in thePUBLIC
schema of a database. They should instead use their own aligned schemas as per the previous best practice. Use the REVOKE command to remove that privilege. - Use physical tables instead of temporary tables – Because temporary tables aren’t subject to storage quota, use physical tables instead—but remember to drop the physical tables at the end of the transaction. An alternative is to create a separate schema to cater to temporary physical tables with storage quota, so instead of using temporary tables, create physical tables in the dedicated schema. This would ease administration and you can perform an automation to clean up tables in this dedicated schema on a regular basis.
- Use data shrinking operations – Use data shrinking operations like
DELETE
,VACUUM DELETE
,DROP
, andTRUNCATE
to reclaim disk space, especially if it’s nearing the quota. - Use workload management (WLM) with query priority to control resources for different users – Data ingestion and aggregation performed by different users might be resource-intensive and impact the throughput of other mission-critical loads or queries running on the cluster. Amazon Redshift Advisor automatically analyzes the current WLM usage and can make recommendations to get more throughput from your cluster. Periodically reviewing the suggestions from Advisor helps you get the best performance.
Monitoring
You should periodically monitor for any storage quota violations so that you can take corrective action and provide business continuity. The system table stl_schema_quota_violations
is a good resource to get quota violation events. To query the violation information, enter the following code:
You can automate the execution of the preceding query and send notifications to users if any violations occur using the following solutions:
- AWS Lambda – The Amazon Redshift WLM query monitoring rule (QMR) action notification utility is a good example for this solution. This utility queries the
stl_wlm_rule_action
system table and publishes the record to Amazon Simple Notification Service (Amazon SNS) You can modify the Lambda function to querystl_schema_quota_violations
instead of thestl_wlm_rule_action
system table to query quota violation events. - Amazon CloudWatch metrics for Amazon Redshift – Use Amazon CloudWatch metrics below to get useful insight about schema quota:
-
NumExceededSchemaQuotas
– (Overall) The number of schemas with exceeded quotas. -
SchemaQuota
– (Per Schema) The configured quota for a schema. -
StorageUsed
– (Per Schema) The disk or storage space used by a schema. -
PercentageQuotaUsed
– (Per Schema) The percentage of disk or storage space used relative to the configured schema quota.
-
Cleaning up
When you’re done with testing, you can remove the objects and users with the following code:
Summary
Amazon Redshift supports stringent compliance and security requirements with no extra cost, which makes it ideal for highly regulated industries. Data storage operation with quotas is part of an important data governance framework that provides compliance and operation efficiency. A simple QUOTA
parameter in a CREATE/ALTER SCHEMA
statement enables disk usage quotas on an Amazon Redshift schema. You can assign Amazon Redshift database users who represent personas in the organization to different schemas with quotas to control the amount of disk space they can consume. This is crucial for data governance and cost control, and avoids running into operation issues like disk full errors, especially if they keep creating objects and ingesting data that might halt other mission-critical analytic operations.
This post showed you how easy it is to control the storage quota for Amazon Redshift using schemas. You went through the steps to create and alter the quota on a schema and authorize specific users to the schema. You also explored different use cases of ingesting data into the schema and monitoring the different outcomes and disk usage using the system view. You can adopt this feature to support your business needs.
About the Authors
BP Yau is a Data Warehouse Specialist Solutions Architect at AWS His role is to help customers architect big data solutions to process data at scale. Before AWS, he helped Amazon.com Supply Chain Optimization Technologies migrate the Oracle Data Warehouse to Amazon Redshift and built the next generation big data analytics platform using AWS technologies.
Ahmad Barghout is a Software Development Engineer on the Amazon Redshift storage team. He has worked on cluster resize optimizations and currently works on ease-of-use features such as schema quota. Before that, he was a full stack developer at Cisco. He holds a master’s degree from The University of Ottawa.
Himanshu Raja is a Principal Product Manager for Amazon Redshift. Himanshu loves solving hard problems with data and cherishes moments when data goes against intuition. In his spare time, Himanshu enjoys cooking Indian food and watching action movies.
Audit History
Last reviewed in September 2023 by Benita Owoghiri | ACCELERATION LAB SA,