Amazon DevCon – Margo Seltzer

Margo Seltzer is CTO of SleepyCat

Sleepycat is a successful model of how to take OSS and make it available for both commercial and non-commercial use.

Survey, who uses BDB, who knows about it, who has used it (hands all around).

First half of talk is intro to BDB, is new goodies and product direction. Second is info based on feedback she’s received before the talk.

Introduction

Keyt tenets are enterprise data management, fast in-process model, bytes to petabytes, 8+ years in production, code 12 or 13 years old. Highly concurrent, transactions, highly available. All linked in to application, available in your favorite language (C, C++, Java, Perl. Python, PHP, Ruby, Tcl, Eiffel, etc). Ported to over 50 OS’s.

1/2 MB process footprint, all administration is programatic. Open source, dual license, complete source, download from sleepycat.com . Distribute proprietary app, get license to do that. That’s their main revenue model. Support and service also available.

Not relational, not OO. No SQL, no JDBC, ODBC. No client server. Keys and data, stored very efficiently. New XML product, some level of schema. No translation layer needed. Store objects, rows, whatever.

Key features: Indexed or sequential retrieval. Cached data for high perf, locking for concurrent access, ACID transactions, system and application failure recovery, HA product for single master, multi-slave replication.

DS – Core product, like BDB 1.85, fast indexed data lookup, no concurrency control.

CDS – Concurrent data store. API locking, moderate levels of concurrency.

TDS – Transactional, ACID.

HA – High availability.

Native C API, C++ and Java (using JNI) above it. Schema layer above C++ for XML. Lookup via XPath and XQuery.

Also did Java Edition as native Java version, sans JNI. Core code of DS is C, concepts don’t map 100% to Java. Ported functionality, but designed to work with JVM. Log-structured storage. Common Java collection API across C and Java products.

One good thing about working on this is that she can actually explain to people what they do, e.g. that they are behind Amazon’s personalized pages.

How are others using it? Backing store for internet services. LDAP, Sun ONE, Publish/subscribe store, Google accounts. Telecommunicatons, Cisco cable provisioning, DHCP state management, resident on Nokia cell phones. EMC, Panasas, HP, Veritas for storage metadata.

On to the technical details.

Vocabulary: environment (bunch of related DBs and the BDB infrastructure). Share common locking infrastructure and logging. Unitof recoverability.

No partitioning yet, always one file per database,must partition key space to do this. DB is collection of key-data pairs. Think of a DB as a relation. Open DB, get back a DB handle.

Key-data pairs are pairs of byte strings. Pointer and length. No interpretation done. Secondary indexes are done by user-supplied functions.

Storage models: BTree, Hash (no sorting, good for stuff with no order or locality), RecNo (just a logical record number), Queue, fixed size records, record-level locking.

Cursors, position in the DB. Next, Prev, next non-duplicate, next duplicate only. Safe across calls to DB, position maintained. Generally short-lived.

Transactions and recoverability. Applications care about atomicity, concurrency, recoverability. Apps locking multiple objects can deadlock, must handle at application level.

Durability of ACID can be configured separately, 3 levels of isolation. Performance vs. isolation tradeoffs. Can sometimes afford to lose updates if using HA, then rely on replication to sort it all out. Network faster than disk, so this works well. Unrepeatable reads good for rapid scans across data, e.g. compute an average. Each level specified at transaction level.

Q: Without key range level locking, how do you get isolation level 3.

A: With page locking.

Theme through product: application specifies policy (reminiscent of X11). Logging, blocking, page sites, dupe handling, cache sizes, transactions. Default parameters work well if you don’t want to fiddle with all of this.

Administrative programs, all functionality also exists as library calls. db_archive, db_checkpoint, db_dump, db_verify, db_salvage, etc.

Q: secondary index creation and callbacks, how does it work.

A: DB sees key and opaque block of data. App knows how it is delimited and structured, so DB calls app and asks where the secondary index is. Apparently done at every access to the data.

More Q+A about very detailed aspects of locks while iterating through large key sets.

Q: Cache management, limiting max size of cache?

A: Use BDB in front of something like Oracle, how do you choose which entry to eject. Suggest building a secondary index that’s age or timestamp, essentially this becomes an LRU queue. Maybe evict in background. Basically roll your own algorithm.

Q: Do you use Lehman-Yao lock coupling?
A: This anticipates the next page, which could be wrong. But it really complicates splits when this happens. Design tradeoff.

Product Roadmap

End of 2004 really busy, 3 releases in 4 weeks. BDB 4.4 (Nov 2004), Paxos compliance in replication election, 64-bit support for RPCGEN, bug fixes, perf. Perf and features are driven by customers. DB/XML 1.2, (Dec 2004). Added XQuery support to existing XPath. Also added PHP API, multi-granularity storage. 10x speedup on large XML documents (def of large based on how much of doc gets modified) , ability to stream documents from a URI. Java 1.7 added dynamic cache tuning, memory management, bettter performance.

Q: limits on size of key-data.

A: Need to be able to fit 2 keys in memory, that’s it. Big key defined as taking up more than 1/4 of a page, due to the way that the B-Tree pushes the key off to a page of its own. Expectation is that reasonable keys will fit 8,16, or even 32 to a page.

Q: What about, say, a 10 MB object and 64KB pages, any guarantee of contiguous storage?

A: No, will be stored as linked list. New stuff done only on customer demand, and a customer could do it on their own. Market-based economy to make clear what features are important.

Q: Compat between DB 1.85 and DB 4.0?

A: Dump 1.85 to file, import into 4.0. Any of the 2.0 and above formats have upgrade utilities.

Q: Memory management in Java, how much due to garbage collector?

A: None, we put our objects on diet and shrunk down the size of what is stored in RAM for a key-data pair.

Q: Is JE a JNI-powered wrapper?

A: No, it is Java, there’s a separate JNI-based API on our original C API

Q: How much perf impact seen based on choice of garbage collector?

A: There’s no global answer, it is totally dependent on workload.

Coming in 2005

BDB 4.4, more replication. Bulk log transmission, P2P log transmission, between replicas, to unburden the master. Process registration, with automatic transaction cleanup after process exit. Online B-tree compaction, B-tree bulk load.Hot backup utility, Windows installer, C# and .Net support. Better getting started guides across all products.

(10 minutes left in talk; 27 on battery.This is going to be interesting).

JE 2.0, adding JTA, JCA, JMX support. Degree 2 isolation, sequence numbers, performance enhancements, better low-memory stability.

Let’s talk about replication. 0 down time is the goal, multiple sites have copies of the data. Write to master, propagate to replicas. App chooses communication infrastructure (send and receive functions), degree of synchronization (on master, on some replicas, on all replicas), transactional guarantees. Master election, replicas hold election, master rejoins group after rebooting. If replica fails, rejoins group after rebooting.

Last few minutes lost to dead battery; interesting information about how to optimize.

AWS News Blog

Amazon DevCon – Margo Seltzer

Resources

Follow