|
| 1 | +**LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.** |
| 2 | + |
| 3 | +Authors: Sanjay Ghemawat (sanjay@google.com) and Jeff Dean (jeff@google.com) |
| 4 | + |
| 5 | +# Features |
| 6 | + * Keys and values are arbitrary byte arrays. |
| 7 | + * Data is stored sorted by key. |
| 8 | + * Callers can provide a custom comparison function to override the sort order. |
| 9 | + * The basic operations are `Put(key,value)`, `Get(key)`, `Delete(key)`. |
| 10 | + * Multiple changes can be made in one atomic batch. |
| 11 | + * Users can create a transient snapshot to get a consistent view of data. |
| 12 | + * Forward and backward iteration is supported over the data. |
| 13 | + * Data is automatically compressed using the [Snappy compression library](http://code.google.com/p/snappy). |
| 14 | + * External activity (file system operations etc.) is relayed through a virtual interface so users can customize the operating system interactions. |
| 15 | + * [Detailed documentation](http://htmlpreview.github.io/?https://github.com/google/leveldb/blob/master/doc/index.html) about how to use the library is included with the source code. |
| 16 | + |
| 17 | + |
| 18 | +# Limitations |
| 19 | + * This is not a SQL database. It does not have a relational data model, it does not support SQL queries, and it has no support for indexes. |
| 20 | + * Only a single process (possibly multi-threaded) can access a particular database at a time. |
| 21 | + * There is no client-server support builtin to the library. An application that needs such support will have to wrap their own server around the library. |
| 22 | + |
| 23 | +# Performance |
| 24 | + |
| 25 | +Here is a performance report (with explanations) from the run of the |
| 26 | +included db_bench program. The results are somewhat noisy, but should |
| 27 | +be enough to get a ballpark performance estimate. |
| 28 | + |
| 29 | +## Setup |
| 30 | + |
| 31 | +We use a database with a million entries. Each entry has a 16 byte |
| 32 | +key, and a 100 byte value. Values used by the benchmark compress to |
| 33 | +about half their original size. |
| 34 | + |
| 35 | + LevelDB: version 1.1 |
| 36 | + Date: Sun May 1 12:11:26 2011 |
| 37 | + CPU: 4 x Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GHz |
| 38 | + CPUCache: 4096 KB |
| 39 | + Keys: 16 bytes each |
| 40 | + Values: 100 bytes each (50 bytes after compression) |
| 41 | + Entries: 1000000 |
| 42 | + Raw Size: 110.6 MB (estimated) |
| 43 | + File Size: 62.9 MB (estimated) |
| 44 | + |
| 45 | +## Write performance |
| 46 | + |
| 47 | +The "fill" benchmarks create a brand new database, in either |
| 48 | +sequential, or random order. The "fillsync" benchmark flushes data |
| 49 | +from the operating system to the disk after every operation; the other |
| 50 | +write operations leave the data sitting in the operating system buffer |
| 51 | +cache for a while. The "overwrite" benchmark does random writes that |
| 52 | +update existing keys in the database. |
| 53 | + |
| 54 | + fillseq : 1.765 micros/op; 62.7 MB/s |
| 55 | + fillsync : 268.409 micros/op; 0.4 MB/s (10000 ops) |
| 56 | + fillrandom : 2.460 micros/op; 45.0 MB/s |
| 57 | + overwrite : 2.380 micros/op; 46.5 MB/s |
| 58 | + |
| 59 | +Each "op" above corresponds to a write of a single key/value pair. |
| 60 | +I.e., a random write benchmark goes at approximately 400,000 writes per second. |
| 61 | + |
| 62 | +Each "fillsync" operation costs much less (0.3 millisecond) |
| 63 | +than a disk seek (typically 10 milliseconds). We suspect that this is |
| 64 | +because the hard disk itself is buffering the update in its memory and |
| 65 | +responding before the data has been written to the platter. This may |
| 66 | +or may not be safe based on whether or not the hard disk has enough |
| 67 | +power to save its memory in the event of a power failure. |
| 68 | + |
| 69 | +## Read performance |
| 70 | + |
| 71 | +We list the performance of reading sequentially in both the forward |
| 72 | +and reverse direction, and also the performance of a random lookup. |
| 73 | +Note that the database created by the benchmark is quite small. |
| 74 | +Therefore the report characterizes the performance of leveldb when the |
| 75 | +working set fits in memory. The cost of reading a piece of data that |
| 76 | +is not present in the operating system buffer cache will be dominated |
| 77 | +by the one or two disk seeks needed to fetch the data from disk. |
| 78 | +Write performance will be mostly unaffected by whether or not the |
| 79 | +working set fits in memory. |
| 80 | + |
| 81 | + readrandom : 16.677 micros/op; (approximately 60,000 reads per second) |
| 82 | + readseq : 0.476 micros/op; 232.3 MB/s |
| 83 | + readreverse : 0.724 micros/op; 152.9 MB/s |
| 84 | + |
| 85 | +LevelDB compacts its underlying storage data in the background to |
| 86 | +improve read performance. The results listed above were done |
| 87 | +immediately after a lot of random writes. The results after |
| 88 | +compactions (which are usually triggered automatically) are better. |
| 89 | + |
| 90 | + readrandom : 11.602 micros/op; (approximately 85,000 reads per second) |
| 91 | + readseq : 0.423 micros/op; 261.8 MB/s |
| 92 | + readreverse : 0.663 micros/op; 166.9 MB/s |
| 93 | + |
| 94 | +Some of the high cost of reads comes from repeated decompression of blocks |
| 95 | +read from disk. If we supply enough cache to the leveldb so it can hold the |
| 96 | +uncompressed blocks in memory, the read performance improves again: |
| 97 | + |
| 98 | + readrandom : 9.775 micros/op; (approximately 100,000 reads per second before compaction) |
| 99 | + readrandom : 5.215 micros/op; (approximately 190,000 reads per second after compaction) |
| 100 | + |
| 101 | +## Repository contents |
| 102 | + |
| 103 | +See doc/index.html for more explanation. See doc/impl.html for a brief overview of the implementation. |
| 104 | + |
| 105 | +The public interface is in include/*.h. Callers should not include or |
| 106 | +rely on the details of any other header files in this package. Those |
| 107 | +internal APIs may be changed without warning. |
| 108 | + |
| 109 | +Guide to header files: |
| 110 | + |
| 111 | +* **include/db.h**: Main interface to the DB: Start here |
| 112 | + |
| 113 | +* **include/options.h**: Control over the behavior of an entire database, |
| 114 | +and also control over the behavior of individual reads and writes. |
| 115 | + |
| 116 | +* **include/comparator.h**: Abstraction for user-specified comparison function. |
| 117 | +If you want just bytewise comparison of keys, you can use the default |
| 118 | +comparator, but clients can write their own comparator implementations if they |
| 119 | +want custom ordering (e.g. to handle different character encodings, etc.) |
| 120 | + |
| 121 | +* **include/iterator.h**: Interface for iterating over data. You can get |
| 122 | +an iterator from a DB object. |
| 123 | + |
| 124 | +* **include/write_batch.h**: Interface for atomically applying multiple |
| 125 | +updates to a database. |
| 126 | + |
| 127 | +* **include/slice.h**: A simple module for maintaining a pointer and a |
| 128 | +length into some other byte array. |
| 129 | + |
| 130 | +* **include/status.h**: Status is returned from many of the public interfaces |
| 131 | +and is used to report success and various kinds of errors. |
| 132 | + |
| 133 | +* **include/env.h**: |
| 134 | +Abstraction of the OS environment. A posix implementation of this interface is |
| 135 | +in util/env_posix.cc |
| 136 | + |
| 137 | +* **include/table.h, include/table_builder.h**: Lower-level modules that most |
| 138 | +clients probably won't use directly |
0 commit comments