Amazon DynamoDB on Production: FinBox’s Compilation of Lessons Learned in a Year
Guest post by Shourya Pratap Singh, Product Manager and Backend Engineer, FinBox
FinBox is a comprehensive digital lending platform with a focus on underwriting using alternative data. For one of FinBox’s products DeviceConnect, we provide a credit score based on enriched mobile device data for customers. At the time of writing this article, we were scoring close to a million customers per month and ingesting close to 80 GB of new data every day. DeviceConnect makes heavy use of Amazon DynamoDB. Here are the lessons we learned after using DynamoDB in the product for the last year.
Lesson 1: Be sure whether you need Amazon DynamoDB.
DynamoDB makes itself pretty easy to start with. But before even thinking of starting, make sure you have the answer to the question — “Why DynamoDB and is it a good choice for me?” In the case of DeviceConnect, with fixed access patterns and polymorphic data, we needed a highly available NoSQL store to quickly store and retrieve enriched device data. 90% of our tech stack was already serverless (AWS Lambda), and DynamoDB integrates pretty well with it. Also, being a small team, we didn’t want to invest much in operations, so DynamoDB being a managed service became our choice.
Lesson 2: Know your access patterns before designing the schema.
This is something true for most of the NoSQL databases — you need to know your access patterns before designing your schema.
In DynamoDB, since you are charged based on throughput in DynamoDB (capacity or request units consumed), efficient reading becomes pretty important. By knowing the access patterns before, you can design schemas with appropriate partition and sort key so that there is the least amount of data scanned (and lower bills), every time you query.
For example, while querying, adding filter expressions doesn’t influence the cost. This is because the filter is applied after reading all the items fetched based on partition key and sort key range conditions. Patterns like Hierarchical Sort Key (Composite Sort Key) can be used to address this problem. As illustrated in the two diagrams below, using a composite sort key reduces the rows fetched during a query operation, shown with the rectangle.
In this amazing talk at AWS re:Invent 2018, Rick Houlihan showed that knowing the access patterns beforehand, a single DynamoDB Table can handle access patterns of a multi-table relational database. The AWS Developer Guide also mentions that most well-designed applications require only one table. Some of the techniques that help us design such a table are:
· Index Overloading: Here a single attribute pk is used to store primary keys for all different modeled relational tables.
· Sparse Index: this is used to model tables when we have to access a small subsection of it. An example would be using an alternate sort key (LSI),which is present only in a few rows.
· Adjacency List: This concept is derived from graph theory and can be used to model many-to-many relationships. Here, each edge is represented by an edge, and partition key denotes the source node and sort key the target nodes. This is shown in the diagram below, where an invoice contains multiple bills and one bill can be part of multiple invoices as well.
The problem with designing a single DynamoDB table is that only the people who designed it can understand the data by looking at it. It requires proper design documentation explaining the design choices based on access patterns. If not documented well, it can be difficult while onboarding new employees.
Lesson 3: LSI can only be created during table creation, and GSI creation later on large tables can be expensive.
Local Secondary Indexes (LSI) can only be defined at the time of table creation, making it pretty difficult to implement new access patterns at a later time (adding LSI restrict the size per partition to 10 GB).
Global Secondary Indexes (GSI) have the flexibility of being added a later time as well. But, the lesson we learned is that GSI creation can be super expensive if the table is huge, so it’s better to think access patterns beforehand.
In one of our tables with over a billion records, adding a GSI costed us close to $1,000 in a single day! This happened because of the replication happening to create this GSI. If this amount was gradual and distributed over time, this single day bill wouldn’t have hurt much. Also, creating GSI at a later time can take time; in our case, it took close to 24 hours.
To reduce the costs, it is always advisable to specify only required attributes as projections while defining the secondary indexes. By doing so only the projected attributes are replicated instead of all attributes.
For creating GSI at the start, you need to know your access patterns before, so one should spend a good amount of time discussing and then designing the schema before starting.
Lesson 4: Provide sufficient capacity for GSI.
One of the best advantages of GSI is the fact that you specify separate capacity for it, unlike LSI where capacity is shared with the main table. If GSI is specified with less capacity, it can throttle your main table’s write requests!
Whenever new updates are made to the main table, it is also updated in the GSI. In case of high write rate for the main table and the GSI is not getting time to be updated (due to low provisioned capacity), write requests in the main table will throttle until the GSI is updated. The AWS Developer Guide also mentions that to avoid potential throttling, the provisioned write capacity for a GSI should be equal or greater than the write capacity of the base table.
Lesson 5: Beware of hot partitions!
DynamoDB hashes a partition key and maps to a keyspace in which different ranges point to different partitions. If your access patterns involve querying against the same partition key again and again (“hot” key), then you may end up with hot partitions, which leads to your read / write requests to the same partition getting throttled. DynamoDB has tried to solve this problem up to some extent by burst capacity and instant adaptive capacity.
In burst capacity, the unused read/write capacity is retained and given at end of every 5-minute window. As stated in the developer guide, DynamoDB can also consume burst capacity for background maintenance and other tasks without prior notice, so one should not rely entirely on it. Also, initially, your provisioned capacity is equally distributed over the partitions and the unused capacity from other partitions is instantly available over the hot partitions if required by instant adaptive capacity.
In late 2019, they also announced the feature of isolating frequently accessed items automatically based on access patterns. But this doesn’t work for tables with DynamoDB streams. Instant adaptive capacity can still never cross the limit of 3,000 capacity units of reading and 1,000 capacity units of writing for each partition.
To avoid hot partitions, you should select your partition key appropriately so that requests are distributed uniformly over the partitions and the case of “hot” key never arises, but it’ something that’s not always possible with every access pattern. At FinBox, we had a few “bad” users who sent huge amounts of data so quickly at times which ends up throttling the other read/write requests. To tackle this, we followed what Segment did, blocking or limiting the requests from certain users as listed on their blog post “The million-dollar engineering problem.” It used to be a long process requesting the hot partition keys from AWS and getting bad users by logs, but Amazon released the CloudWatch Contributor Insights for DynamoDB that let us find those keys/items by ourselves.
Lesson 6: Use backoffs or jitters while writing or reading data
There are many reasons throttling of read/write requests can occur, including under-provisioned capacity or hot partitions. A general rule of thumb to handle throttles is to have retries with exponential backoffs and jitters while writing/reading data. Most of the AWS client libraries already have options available for retries with exponential backoffs.
Lesson 7: Prefer provisioned capacity over on-demand
Every DynamoDB table must be in one of the two capacity modes — either On-Demand or Provisioned. In On-Demand, you are charged based on the request units while in provisioned you setup the capacity limits for read and write (can put autoscaling settings as well) and are charged for the provisioned or scaled capacity units.
While On-Demand sounds good, it can get pretty costly. Recently, we changed some code that required a particular table to be accessed more often, so we switched to On-Demand from Provisioned to monitor the capacities, and it costed us almost 2x! Also, it is to be noted that On-Demand to provisioned capacity mode conversion is allowed only once per day. As a general rule of thumb, avoid On-Demand as most of the use-cases have a predictable load.
Lesson 8: Autoscaling has its limits
While scaling up can happen any number of times in provisioned capacity mode, scaling down is limited with a cap of 27 times — 4 decreases in the first hour, and 1 decrease for each of the subsequent 1-hour windows in a day. Hence, for one “bad” user, if scaling up occurs, it will take a lot of time to come down, making you pay more on your AWS bill, as provisioned capacity will be high. Identification of “bad” users by proper monitoring was pretty much required in FinBox’s case to handle this situation.An additional strategy to save costs can be setting up scheduled autoscaling where different policies can work at different times based on usage time, like less capacity at night and more during the day for applications expected to be used more during day time.
Lesson 9: Be super clear with unit calculations and prefer eventual consistent reads.
It is super important to be clear with how much unit each request (write or read) is going to consume as it directly influences the bill. Strongly consistent reads consume twice the unit as eventual consistent reads, so you should always prefer eventual over strongly consistent read unless specifically required. Querying, in general, is more efficient than scanning. Scanning involves going through the entire table or projected attributes (in case of secondary indexes), so you end up paying more cost for scanning. A well-designed schema will have partition and sort key chosen well so that you never have to scan and can always go for query, saving you costs. It is also worth noting that there is an option of requesting only a subset of attributes while querying but it has no impact over the item size calculations and hence the cost.
The Unit calculation can get tricky based on DynamoDB APIs. For example, BatchGetItem rounds of each item to nearest 4 KB boundary, while in the Query the sum of items are rounded to the nearest 4 KB boundary. As an example, if there were 2 items fetched of 1.5 KB and 6.5 KB respectively. In BatchGetItem, 12 KB (4 KB + 8 KB) will be considered, while in Query 8 KB will be considered. Similarly, while updating even a subset of attributes using UpdateItem, the write unit calculation is based on the larger of the complete item size before and after the update, irrespective of attributes being updated.
Every year, new features and improvements are released on DynamoDB, so make sure to be updated with the current limits and optimizations by referring to the AWS Developer Guide and documentation of DynamoDB and use the knowledge wisely to save costs over time.
Lesson 10: Use streams and relational databases for ad-hoc queries and analytics.
Sometimes it’s required to run ad-hoc queries or analytics queries over the data. In DynamoDB, this can turn pretty painful, because keys are selected based on the usual access pattern. We can often end up scanning the table for such queries, costing us high bills.
At FinBox, our data science team continuously works on newer credit models, and the business development team also requires looking at analytics. To fulfill our needs, we make use of DynamoDB Streams, streaming as soon as the data arrives to a relational store — Amazon RDS or Amazon Redshift— and use that as a source for such queries. It makes it more flexible in terms of queries we can run with no capacity of the main table getting consumed.
Lesson 11: Storage can be expensive, so use TTL on items whenever possible.
Other than throughput costs, people often ignore the storage costs involved with DynamoDB. We at FinBox learned it the hard way! As you can compare in graph, DynamoDB can cost you quite some amount with the storage costs getting combined with usual throughput costs. A good strategy to address this issue is to use the TTL (Time To Live) attributes. With automatic data deletion due to TTL expiry there are no charges incurred, in comparison to the DeleteItem operation, which is charged based on the size of the deleted item. Therefore, one must go for table deletion instead of individual delete items if you ever have to delete all items from a DynamoDB table.
At FinBox, since we stream data to relational databases, even with items expiring due to TTL, we have the data in relational storage. This helps us serve the requests for older data (with higher latency) later on if requested by clients. Another strategy we follow for a few of our tables is to use streams to capture deletes happening due to TTL expiry and then archive them on S3 on a lambda invocation.
Backup and Restore
Since we are discussing storage, let’s talk about backup and restore as well. Restore in DynamoDB works only by creating a new table, and you cannot restore by overwriting the table from which backup was made. Compared to RDS where you are not charged for backup snapshots up to the size of your instance (or until your instance is terminated then $0.095/GB/month), the DynamoDB charges $0.114/GB/month for on-demand backups, $0.228/GB/month for continuous backups, and $0.171/GB for restores in ap-south-1 region.
In case you are using DynamoDB Streams and streaming the data to RDS, you also have the option of avoiding backups for DynamoDB (since they are costly) and set them up for RDS instead and use them in case of disaster recovery. Also, S3 can be a good choice for backups.
Storing Large Items
Also to point out, DynamoDB has a limit of 400 KB per item. This can be insufficient in some cases. Patterns for storing large items involve using compression or storing a pointer to an S3 object.
Lesson 12: Save costs on local and dev environments
While developing features using DynamoDB, developers would want to test their code. To reduce this cost to some extent, DynamoDB Local or mock platforms like Localstack can be used. Also, unless load testing, you should always have a lower capacity setup for DynamoDB tables being used solely in development environments.
Lesson 13: Monitor things well
Some useful metrics to monitor are:
· Difference between provisioned and consumed throughput: this can help you identify over-provisioning
· Throttled reads/writes: to identify under-provisioning or hot partitions
· System errors: these are errors thrown by AWS
· CloudWatch contributor insights: to identify hot partitions and specific keys for which throttling occurred
Lesson 14: Buy reserved capacity whenever possible
Last but not the least, for predictable capacity consumptions (capacity units read/write) over the year, you can also go for buying reserved capacity. Reserved capacity is a billing feature provided by AWS. By paying an up-front fee, this feature allows you to lock-in a significant saving (50-70% in cost) in exchange for a 1-year or 3-years contract.
To summarize the lessons, think through your access patterns and design the DynamoDB table appropriately, choosing the correct set of partition key, sort key, and secondary index at the start. Be updated with the latest documentation and monitor things well. Be super clear with the unit calculation, and go for reserved provisioned capacity over on-demand to save costs.