Getting visibility into storage usage in multi-tenant Amazon S3 buckets

SaaS providers with multi-tenant environments use cloud solutions to dynamically scale their workloads as customer demand increases. As their cloud footprint grows, having visibility into each end-customer’s storage consumption becomes important to distribute resources accordingly. An organization can use storage usage data per customer (tenant) to adjust its pricing model or better plan its budget. In addition, it may be that a specific tenant is charged differently according to the volume of data.

Organizations with multi-tenant architectures often look to AWS for ways to help them identify and assign costs to their individual customers. There are Software-as-a-Service (SaaS) providers that use a multi-tenant model, storing end-customer data in unique prefixes within the same Amazon Simple Storage Service (Amazon S3) bucket. This architecture enables them to easily scale and makes it simple to use other AWS services to perform different data transformations on a single multi-tenant bucket rather than a unique bucket for each end customer.

In this post, we cover how to estimate end-customer usage of Amazon S3 storage in a multi-tenant environment based on the bucket prefix separation. We use Amazon S3 Inventory to gather object metadata, AWS Lambda and Amazon EventBridge to schedule the daily data aggregation process, and Amazon Athena to query the data. With this solution, you can get visibility into your end-customer’s cloud storage consumption and use this data for business model adjustment, forecasting, and more accurate tenant charging.

AWS Partner spotlight: Kaltura

This solution was developed in partnership with AWS Partner Kaltura. Kalture had the following to say about how estimating the Amazon S3 cost helped them:

“At Kaltura, we store tens of PBs of data, and our main Amazon S3 cost is the storage itself. Getting accurate visibility into the storage consumption of our customers allows us to have productive discussions on how we are managing our cloud storage costs. With this solution, developed in partnership with AWS, we can more accurately forecast future spend and also better understand our customers’ needs, thereby allowing us to improve our customers’ experience.”

– Omer Kolodny, Data and Foundation Group Leader

Walkthrough

AWS provides a wide number of tools to understand and analyze the cost of various services. For example, Amazon S3 offers cost allocation tags that you can assign to S3 buckets. You can use these tags to organize your resource in your cost allocation report, which enables you to assign costs to different departments or teams. You can also tag the bucket prefix, but the prefix tag cannot be set as a cost allocation tag.

To accurately identify and allocate storage costs in a multi-tenant tenant bucket, we complete the following steps:

Analyzing bucket multi-tenant structure and constructing the Amazon S3 Inventory report.
Inventory report cost per tenant.
Create Amazon Athena S3 output folder, and prepare database objects to store tenant data.
Create an AWS Lambda to run daily after the Amazon S3 inventory report is provided, and make sure it is scheduled to run every day by using Amazon EventBridge.
Prepare several ready-to-use views in Athena.

Prerequisites

Intermediate knowledge of SQL and Python is required.

1. Analyzing bucket multi-tenant structure and constructing the Amazon S3 Inventory report

To demonstrate the solution, we created a synthetic environment. In this sample bucket, each prefix represents a different end-customer. Note in the following figure (Figure 1) that each prefix holds a unique number of objects, with each object having a unique size.

Single bucket with multiple tenants (prefixes). The prefix name has the pattern "customer_id=CUSTOMER_IDENTIFIER"

Figure 1: Single bucket with multiple tenants (prefixes)

In Figure 2, you can see some of the contents of one of the prefixes in the aforementioned bucket.

Content of the prefix “customer_id-1”. Different files with different sizes

Figure 2: Content of the prefix “customer_id=1/”

Our first goal is to get a list of all the objects in the bucket. Although we can do this using the ListObjects S3 API, we can do this more efficiently through Amazon S3 Inventory. This Amazon S3 feature generates a list of all the objects and their corresponding metadata in an S3 bucket. You can schedule Amazon S3 Inventory reports to run daily or weekly. To enable Amazon S3 Inventory, go to your S3 bucket and select Management. Then, choose Create Inventory Configuration under Inventory Configuration. You can create the report with the standard settings for the purposes of this post, as seen in Figure 3.

Amazon S3 Inventory configuration

Figure 3: Amazon S3 Inventory configuration

To get accurate visibility into costs, we must query our inventory report each day to identify any changes in the storage of our prefixes, which hold our end-customer data. Using our Amazon S3 Inventory report, we extract the relevant information and store it in the external table to query.

Note that Amazon S3 Inventory reports are a snapshot of the objects in an S3 bucket at the moment the report is created. This means that some objects may be uploaded and others deleted between the Amazon S3 Inventory report’s creation and the following Amazon S3 Inventory report’s creation.

To work with Athena tables, we must define an Athena database, which is a logical grouping for tables. You can find more information about creating the database in the documentation.

CREATE DATABASE s3_pricing

We must define the table to hold the inventory data. More information about querying inventory data can be found in this documentation.

CREATE EXTERNAL TABLE s3_inventory_report(
   bucket string,
   key_name string,
   VersionId string, 
   IsLatest  adoop , 
   IsDeleteMarker  adoop , 
   Size bigint,
   LastModifiedDate string, 
   StorageClass string
  )
  PARTITIONED BY (
        dt string
  )
  ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.OpenCSVSerde’
  STORED AS INPUTFORMAT ‘org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat’
  OUTPUTFORMAT  ‘org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat’
  LOCATION ‘s3://INVENTORY_BUCKET//hive/’;

You should replace s3://INVENTORY_BUCKET/ with the correct inventory location.

2. S3 Inventory report cost per tenant

Our inventory report is actually our first candidate for cost contribution calculation. For this post, we are using the Northern Virginia AWS Region for our pricing. According to the official Amazon S3 pricing page, the Northern Virginia Region price for using an inventory report is 0.0025 per million listed objects (as of the publishing of this post). The following SQL query shows the cost of producing the inventory report for a specific customer (tenant):

   Select cast(count(1)/1000000.0*0.0025 as DECIMAL(20, 10)) price,
         customer_id
         from
         (
         select dt,
                regexp_extract(
                regexp_extract(url_decode(key_name), ‘customer_id=[\d]{0,4}’),
                ‘[\d]{0,4}$’
                ) customer_id
         from s3_inventory_report
         )
         group by customer_id

Figure 4 shows an example output, with price and customer_id as columns:

Query result for inventory cost per tenant

Figure 4: Inventory cost example per tenant

3. Create Amazon Athena S3 output folder, and prepare database objects to store tenant data

Using the following command, we create another table to hold aggregation data for our storage costs.

CREATE EXTERNAL TABLE s3_cost_per_tenant
(
    customer_id    string,
    storage_class  string,
    size_gb        bigint
)
    PARTITIONED BY (dt string)
    STORED AS PARQUET
    LOCATION ‘s3://my-bucket/s3_cost_per_tenant/’;

You should replace s3://my-bucket/s3_cost_per_tenant/ with the correct location for your environment. After creating the table, run MSCK Repair table. Otherwise you get an empty result set for the query. Read about this command in this documentation.

Our example only analyzes the storage costs for objects in our un-versioned S3 bucket. However, using the Amazon S3 inventory lets you do the analysis for versioned objects, where you could analyze both current and non-current version of S3 objects.

Running the following basic query from the inventory table returns the results shown in Figure 5, unaggregated data with extra some unnecessary information.

select bucket,
 url_decode(key_name) object_name,
 VersionId,
 IsLatest,
 IsDeleteMarker,
 Size,
 from_iso8601_timestamp(lastmodifieddate) lastmodifieddate,
       StorageClass
from s3_inventory_report
where isLatest = true

Basic query from inventory table

Figure 5: Basic query from inventory table

As shown in Figure 6, we can use the following query to extract only the relevant information from this data, like storage class and data stored in GB, to help remove extraneous fields that are unneeded for our storage cost analysis.

select regexp_extract(
       regexp_extract(url_decode(key_name), 'customer_id=[\d]{0,4}'),
                '[\d]{0,4}$'
                ) customer_id,
       storageclass,
       sum(size)/1024/1024/1024 size_gb,
       dt
from s3_inventory_report
where dt='2022-12-06-01-00'
group by 1,2,4

More accurate data from the inventory table with "customer_id", "storage" and "storage_size" columns

Figure 6: More accurate data from inventory table

4. Create an AWS Lambda function to run daily and schedule it to run every day by using Amazon EventBridge

Now the picture is more clear. We see the size of the objects per customer on a specific day. The idea is to use Amazon EventBridge, scheduled to run every day, and invoke AWS Lambda to update the aggregation table with last-day data. The default runtime for Lambda is three seconds. We suggest increasing this value to 30 seconds, as shown in Figure 7.

Lambda Timeout settings

Figure 7: Lambda Timeout settings

You can find a basic example of invoking an Amazon Athena query from Python, including the AWS Identity and Access Management (IAM) permissions needed, on AWS re:Post.

Here is the AWS Lambda function to update the aggregation table with last-day data:

from datetime import datetime
import time

import boto3

ATHENA_CLIENT = boto3.client("athena")


def create_queries():
    insert_query = (
        "INSERT INTO s3_cost_per_tenant "
        "SELECT customer_id, storageclass, size_gb, dt "
        "FROM ("
        "  SELECT REGEXP_EXTRACT("
        "    REGEXP_EXTRACT("
        "      url_decode(key_name),"
        "      'customer_id=[\\d]{0,4}'"
        "    ),"
        "    '[\\d]{0,4}$'"
        "  ) customer_id, "
        "  CASE "
        "    WHEN storageclass='INTELLIGENT_TIERING' AND size/1024 < 128 "
        "    THEN 'STANDARD' "
        "    ELSE storageclass "
        "  END "
        "  storageclass, "
        "  SUM(size)/1024/1024/1024 size_gb, "
        "  dt "
        "  FROM s3_inventory_report "
        f"  WHERE dt = '{datetime.now().strftime('%Y-%m-%d-01-00')}' "
        "   AND size <> 0 "
        "  GROUP BY 1, 2, 4"
        ") storage_breakdown"
    )
    return [
        "MSCK REPAIR TABLE s3_inventory_report",
        insert_query,
    ]


def execute_query(query):
    response = ATHENA_CLIENT.start_query_execution(
        QueryString=query,
        QueryExecutionContext={"Database": "s3_pricing"},
        ResultConfiguration={"OutputLocation": "s3://s3-blog-athena-result/"},
    )
    return response["QueryExecutionId"]


def wait_for_query_to_complete(query_execution_id):
    while True:
        response_query_execution = ATHENA_CLIENT.get_query_execution(
            QueryExecutionId=query_execution_id
        )
        state = response_query_execution["QueryExecution"]["Status"]["State"]

        if state in ("FAILED", "CANCELLED"):
            print(response_query_execution)
            raise Exception(f"Query did not succeed; state = {state}")
        elif state == "SUCCEEDED":
            print("Query succeeded")
            return

        time.sleep(1)


def lambda_handler(event, context):
    for query in create_queries():
        query_execution_id = execute_query(query)
        wait_for_query_to_complete(query_execution_id)

In developing this query that appears in the preceding function, we were careful to distinguish the following Amazon S3 Intelligent-Tiering storage class behaviors:

The Intelligent Tiering storage class is limited to objects that are greater than 128 KB. Objects that are less than 128 KB are charged as Amazon S3 Standard storage class objects.
S3 objects in the Intelligent Tiering storage class can rotate up and down Access Tiers based on customer access patterns. It’s possible that an object was in one tier for half of the month, and then it moved to another in the second half. To more accurately estimate costs, we must account for this behavior. We store the “storage_class” data per day, which we use to calculate the exact portion of the month that the object was stored in the specific tier.

And here is what we didn’t consider:

For the purpose of the demo, we only focused on specific S3 Intelligent-Tiering Access Tiers. If you are using the asynchronous access tiers, then you must add the additional tiers to your logic (Archive Access and Deep Archive Access).

5. Prepare several ready-to-use views in Athena

We can work directly with the s3_cost_per_tenant table. However, we also decided to create several views to provide different insights. We create a view to calculate the cost per customer for the current month, as shown in the following statement:

create or replace view cost_per_month_view as
select customer_id,
       ROUND
       (
           (case 
           when (size_gb/1024) < 50 then size_gb*0.023
           when (size_gb/1024) >= 50 and  (size_gb/1024) <= 450 then 50*1024*0.023+(size_gb/1024-50)*1024*0.022 
           when (size_gb/1024) > 450    then 50*1024*0.023+400*1024*0.022+  (size_gb/1024-450)*1024*0.021 
           end)*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),
       2) cost 
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'STANDARD'
group by customer_id
)
union
select customer_id,
       ROUND(size_gb*0.0125*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'STANDARD_IA'
group by customer_id
)
union
select customer_id,
       ROUND(size_gb*0.01*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'ONEZONE_IA'
group by customer_id
)
union
select customer_id,
       ROUND(size_gb*0.0036*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'INTELLIGENT_TIERING'
group by customer_id

As we mentioned previously, it is important to know how many days during the month objects were in each Intelligent Tiering Access Tier. As in the preceding, we use an official price list for the Northern Virginia Region to calculate the price of each storage class for each object. It is possible to use the AWS Price List API to get the current cost in each AWS Region for each service. Using this API is beyond the scope of this post, but readers can consider writing a Lambda function to extract this data, populate custom tables in Athena, and use most-to-day data in calculations.

Note that the storage costs are the same regardless of the amount of data stored except for the “Standard” tier. For the “Standard” tier, we see that the more data stored, the lower the price to store additional data. This is the reason for the CASE WHEN statement in our preceding Athena query.

We can calculate the number of days objects “spent” in a specific tier by using the UNION operator. Since our original data is aggregated per day, this column holds exactly how many days in the month the object was in a specific tier.

For example, we know that for customer = 1, 10 GB were in the Frequent Access tier for over 20 days. The price for 1GB is 0.023 per month in the Northern Virginia Region. For the last 10 days of the month, the objects were in the Infrequent Access tier. We must calculate the portion of the month that is relevant for 20 days only. The equation is 10*0.023 * 20/total days in the month. To obtain the number of days in a given month, we can use the following statement:

date_trunc('month', current_date) + interval '1' month - interval '1' day

Sample result breaking down cost per end-customer

Figure 8: Sample result breaking down cost per end-customer

The view we just created, as seen in Figure 8, gives us the total cost for storage per customer. However, if we want to see the distribution of tiers by customer, we must modify the view statement slightly to group the data also by storage class.

create or replace view cost_per_month_and_cust_view as
select customer_id,
       ROUND
       (
           (case 
           when (size_gb/1024) < 50 then size_gb*0.023
           when (size_gb/1024) > 50 and  (size_gb/1024) < 450 then 50*1024*0.023+(size_gb/1024-50)*1024*0.022 
           when (size_gb/1024) > 450    then 50*1024*0.023+400*1024*0.022+  (size_gb/1024-450)*1024*0.021 
           end)*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),
       2) cost,
       storage_class
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec,
       storage_class
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'STANDARD'
group by customer_id,storage_class
)
union
select customer_id,
       ROUND(size_gb*0.0125*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost,
       storage_class
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec,
       storage_class
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'STANDARD_IA'
group by customer_id,storage_class
)
union
select customer_id,
       ROUND(size_gb*0.01*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost,
       storage_class
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec,
       storage_class
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'ONEZONE_IA'
group by customer_id,storage_class
)
union
select customer_id,
       ROUND(size_gb*0.0036*total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost,
       storage_class
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec,
       storage_class
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'INTELLIGENT_TIERING'
group by customer_id,storage_class

In this view, we added storage_class to group by expression, as shown in Figure 9.

Storage class cost per end-customer

Figure 9: Storage class cost per end-customer

Additional cost consideration for object metadata

When the objects are stored in the S3 Glacier Flexible Retrieval or S3 Glacier Deep Archive storage classes, each object has additional metadata added. This metadata is used for ListObjects S3 API.

The S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes add 40 KB of metadata to each object. This includes 8 KB that charged according to the S3 Standard price, and 32 KB charged according to S3 Glacier Deep Archive pricing.

Your organization must decide if you want to calculate the metadata cost. However, if you do, then you must modify the views we created in the preceding section. An example of modification of the cost_per_month_and_cust_view for S3 Glacier Deep Archive storage is to add the following section to the union:

select customer_id,
       ROUND( ( (size_gb+32/1024/1024)*0.00099+8/1024/1024*0.023 )  *total_rec/day(date_trunc('month', current_date) + interval '1' month - interval '1' day),2) cost,
       storage_class
from
(
select customer_id,
       sum(size_gb) size_gb,
       count(storage_class) total_rec,
       storage_class
from s3_cost_per_tenant 
where month(date_parse(dt,'%Y-%m-%d-%h-00'))=month(current_date)
and storage_class = 'DEEP_ARCHIVE'
group by customer_id,storage_class
)

The following statement considers the metadata cost of 32 KB of each object charged S3 Glacier Deep Archive storage class pricing, and the other 8 KB of metadata charged at the S3 Standard price.

(size_gb+32/1024/1024)*0.00099+8/1024/1024*0.023

metadata pricing calculation

More information can be found in this documentation, specifically in the ”Cost considerations” section.

Cleaning up

You must delete the inventory configuration. When you delete the configuration, note the destination S3 bucket that is used to store the inventory files. You must also delete the Amazon S3 Inventory reports that you stored during this exercise, as shown in Figure 10.

Disable inventory

Figure 10: Disable inventory

Additionally, you must delete the AWS Lambda function and Amazon EventBridge settings. If new roles were created as part of the implementation process, then they must also be removed. Finally, you must delete the Amazon Athena database as described here.

Conclusion

In this post, we showed how you can get visibility into how much cloud storage each tenant in a multi-tenant environment is consuming using S3 Inventory, Amazon Athena, AWS Lambda, and Amazon EventBridge. These insights can help SaaS providers like Kaltura validate that their pricing plans match their business model, make more granular analyses, and perform forecasting. Additionally, these tools enable you to automate regular updates on their storage spend.

Thank you for reading this post. If you have any comment or questions, don’t hesitate to leave them in the comments section.