1edf1bca6SAaron Miller# Couchstore Database File Format
2edf1bca6SAaron Miller
3edf1bca6SAaron Miller### Compiled by Aaron Miller and Jens Alfke, with help from Damien Katz
4edf1bca6SAaron Miller
5edf1bca6SAaron MillerThis is the current database file format used by Couchbase Server 2.0.
6edf1bca6SAaron MillerIt's similar, but not identical, to the format used by [Apache
7edf1bca6SAaron MillerCouchDB][COUCHDB] 1.2. It is implemented in a Couchbase 2.0 fork of
8edf1bca6SAaron MillerCouchDB (in Erlang), and also by [Couchstore][COUCHSTORE] in C.
9edf1bca6SAaron Miller
10edf1bca6SAaron Miller## How A File Is Written
11edf1bca6SAaron Miller
12edf1bca6SAaron MillerThe most important thing to understand about a CouchDB file is that it
13edf1bca6SAaron Milleris **append-only**.  The only way the file is modified is to append new
14edf1bca6SAaron Millerdata at the end; bytes written are never overwritten.  As a consequence,
15edf1bca6SAaron Millerthe critical file "header" data actually lives at the tail end of the
16edf1bca6SAaron Millerfile, since it has to be re-appended every time the file is changed.
17edf1bca6SAaron Miller
18edf1bca6SAaron MillerThis has several advantages:
19edf1bca6SAaron Miller
20edf1bca6SAaron Miller* The data format is extremely robust, since the file is never in an
21edf1bca6SAaron Miller  inconsistent state. Even if the process crashes or the kernel panics
22edf1bca6SAaron Miller  partway through writing an update, the software can recover simply by
23edf1bca6SAaron Miller  scanning back from the end of the file to the last valid header.
24edf1bca6SAaron Miller* Writers don't disturb readers at all, allowing concurrent access
25edf1bca6SAaron Miller  without read locks.
26edf1bca6SAaron Miller* By default, earlier versions of a record remain in the file, making it
27edf1bca6SAaron Miller  easy to implement a version history and multi-version concurrency
28edf1bca6SAaron Miller  control (as exposed in the CouchDB API.)
29edf1bca6SAaron Miller
30edf1bca6SAaron MillerThe disadvantage, of course, is that the file grows without bound as
31edf1bca6SAaron Millerit's modified. To work around this, the file is periodically compacted
32edf1bca6SAaron Millerby writing the live data to a new file, then replacing the old file with
33edf1bca6SAaron Millerthe new one.  This can be done in the background without locking out
34edf1bca6SAaron Millereither readers or writers.
35edf1bca6SAaron Miller
36edf1bca6SAaron Miller## Numbers And Alignment
37edf1bca6SAaron Miller
38edf1bca6SAaron Miller * Numbers are in big-endian byte order (most significant byte first).
39edf1bca6SAaron Miller * All values are tightly packed; there is no padding or multi-byte
40edf1bca6SAaron Miller   alignment.
41edf1bca6SAaron Miller * Some values occupy partial bytes and have lengths that are not a
42edf1bca6SAaron Miller   multiple of 8 bits.  These are stored in most-significant to
43edf1bca6SAaron Miller   least-significant bit order. So for example, if a 1-bit field is
44edf1bca6SAaron Miller   followed by a 47-bit field, the first field occupies the MSB of the
45edf1bca6SAaron Miller   first byte, and the second occupies the rest of the first byte and
46edf1bca6SAaron Miller   the entirety of the next five bytes.
47edf1bca6SAaron Miller
48edf1bca6SAaron Miller## File Blocks
49edf1bca6SAaron Miller
50edf1bca6SAaron MillerFor purposes of locating the current file header (q.v.), the file is
51edf1bca6SAaron Millerorganized into 4096-byte blocks. The first byte of every block is
52edf1bca6SAaron Millerreserved and identifies whether it's a data block (zero) or a header
53edf1bca6SAaron Millerblock (nonzero). Therefore only 4095 of the bytes are available for
54edf1bca6SAaron Millerstoring data.
55edf1bca6SAaron Miller
56edf1bca6SAaron MillerAbove the block level, these prefix bytes are invisible and simply
57edf1bca6SAaron Millerskipped. So a data value that spans a block boundary will be written out
58edf1bca6SAaron Millerwith a zero byte inserted at the boundary, and this byte will be removed
59edf1bca6SAaron Millerwhen reading the value.
60edf1bca6SAaron Miller
61edf1bca6SAaron Miller## Data Chunks
62edf1bca6SAaron Miller
63edf1bca6SAaron MillerThe data in the file (above the block level) is grouped into
64edf1bca6SAaron Millervariable-length **chunks**.  All chunks are prefixed with their length
65edf1bca6SAaron Milleras a 32-bit big endian integer, followed by the [CRC32][CRC32] checksum
66edf1bca6SAaron Millerof the data. The CRC32 is *not* included in the length.
67edf1bca6SAaron Miller
68edf1bca6SAaron Millerlength   | content
69edf1bca6SAaron Miller---------|--------
70edf1bca6SAaron Miller 32 bits | body length
71edf1bca6SAaron Miller 32 bits | CRC32 checksum of body
72edf1bca6SAaron Miller
73edf1bca6SAaron Millerfollowed by the body data
74edf1bca6SAaron Miller
75edf1bca6SAaron Miller## File Header
76edf1bca6SAaron Miller
77edf1bca6SAaron MillerA file header always appears on a 4096-byte block boundary, and the
78edf1bca6SAaron Millerfirst byte of the block is nonzero to signal that it contains a header.
79edf1bca6SAaron MillerA file will contain many headers as it grows, since a new updated one is
80edf1bca6SAaron Millerappended whenever data is saved; the current header is the _last_ one in
81edf1bca6SAaron Millerthe file. So the algorithm to find the header when opening the file is
82edf1bca6SAaron Millerto seek to the last block boundary, read one byte, and keep skipping
83edf1bca6SAaron Millerback 4096 bytes until the byte read is nonzero.
84edf1bca6SAaron Miller
85edf1bca6SAaron MillerThe file header is prefixed with a 32 bit length and a checksum,
86edf1bca6SAaron Millersimilarly to other data chunks, but the length field **does** include
87edf1bca6SAaron Millerthe length of the hash.
88edf1bca6SAaron Miller
89edf1bca6SAaron MillerValues in the body of a file header:
90edf1bca6SAaron Miller
91edf1bca6SAaron Millerlength  | content
92edf1bca6SAaron Miller--------|--------
93edf1bca6SAaron Miller8 bits  | File format version (Currently 10)
94edf1bca6SAaron Miller48 bits | Sequence number of next update.
95edf1bca6SAaron Miller48 bits | Purge counter.
96edf1bca6SAaron Miller48 bits | Purged documents pointer. (unused)
97edf1bca6SAaron Miller16 bits | Size of by-sequence B-tree root
98edf1bca6SAaron Miller16 bits | Size of by-ID B-tree root
99edf1bca6SAaron Miller16 bits | Size of local documents B-tree root
100edf1bca6SAaron Miller
101edf1bca6SAaron Miller * The B-tree roots, in the order of the sizes, are B-tree node pointers as
102edf1bca6SAaron Miller   described in the "Node Pointers" section.
103edf1bca6SAaron Miller
104edf1bca6SAaron Miller## B-Tree Format
105edf1bca6SAaron Miller
106edf1bca6SAaron MillerThe B-trees used in CouchDB files are a bit different than in a typical
107edf1bca6SAaron Millerimplementation, because the file is append-only.  A tree node is never
108edf1bca6SAaron Millerupdated in place; instead, a new copy of the updated node is written at
109edf1bca6SAaron Millerthe end of the file.  Of course this means that the node's parent also
110edf1bca6SAaron Millerhas to be updated to point at the updated node, so the effect is that
111edf1bca6SAaron Millerevery modification has to rewrite all the nodes from the affected leaf
112edf1bca6SAaron Millerup to the root. In practice this isn't too expensive, especially if
113edf1bca6SAaron Millermultiple writes are batched together into one update.
114edf1bca6SAaron Miller
115edf1bca6SAaron MillerCouchDB's B-trees also have to forgo the sibling-node chaining that's
116edf1bca6SAaron Millertypically used to speed up sequential access. The reason is that
117edf1bca6SAaron Millerupdating a node would invalidate the pointers to it in its siblings,
118edf1bca6SAaron Millerforcing those nodes to be updated as well, resulting in a huge cascade
119edf1bca6SAaron Millerof updates. Instead, the iteration algorithm remembers the path back to
120edf1bca6SAaron Millerthe root and periodically backtracks to find the next sibling node.
121edf1bca6SAaron Miller
122edf1bca6SAaron Miller### Nodes On Disk
123edf1bca6SAaron Miller
124edf1bca6SAaron MillerAll B-tree nodes are compressed using the [Snappy][SNAPPY] algorithm.
125edf1bca6SAaron MillerThe descriptions following all refer to the uncompressed form.
126edf1bca6SAaron Miller
127edf1bca6SAaron Miller * First byte -- 1 if a leaf (key/value) node, 0 if an interior
128edf1bca6SAaron Miller   (key/pointer) node
129edf1bca6SAaron Miller * A list of key-value or key-pointer pairs, each of which is:
130edf1bca6SAaron Miller	* 12 bits -- Key size
131edf1bca6SAaron Miller	* 28 bits -- Value size
132edf1bca6SAaron Miller	* Followed by the Key data,
133edf1bca6SAaron Miller	* then the Value or Pointer data
134edf1bca6SAaron Miller
135edf1bca6SAaron MillerIn interior nodes the Value parts of these pairs are pointers to another
136edf1bca6SAaron MillerB-tree node, where keys less than or equal to that pair's Key will be.
137edf1bca6SAaron Miller
138edf1bca6SAaron MillerIn leaf nodes the values are interpreted differently by each index; see
139edf1bca6SAaron Millerthe Indexes section below.
140edf1bca6SAaron Miller
141edf1bca6SAaron Miller### Node Pointers
142edf1bca6SAaron Miller
143edf1bca6SAaron Miller * 48 bits -- Node position in file
144edf1bca6SAaron Miller * 48 bits -- Sub tree size, or disk size of B-tree data below this node. For
145edf1bca6SAaron Miller   pointers to leaf nodes this is the size on disk of the pointed to node,
146edf1bca6SAaron Miller   otherwise it is the sum of the size of the pointed to node and all
147edf1bca6SAaron Miller   the values of this field on the pointers in that node.
148edf1bca6SAaron Miller * 16 bits -- Reduce value size (**NOT** present in the root pointers
149edf1bca6SAaron Miller   embedded in the file header, as it can be inferred from the lengths
150edf1bca6SAaron Miller   in the header)
151edf1bca6SAaron Miller * Reduce value -- Stores statistics about the subtree beneath this
152edf1bca6SAaron Miller   node. The exact data format is different in each type of B-tree, and
153edf1bca6SAaron Miller   is detailed below.
154edf1bca6SAaron Miller
155edf1bca6SAaron Miller## Indexes
156edf1bca6SAaron Miller
157edf1bca6SAaron MillerA CouchDB file contains three B-trees:
158edf1bca6SAaron Miller
159edf1bca6SAaron Miller * The **by-ID** index, which maps document IDs to values.
160edf1bca6SAaron Miller * The **by-sequence** index, which maps sequence numbers (database
161edf1bca6SAaron Miller   change numbers, which increase monotonically with every change) to
162edf1bca6SAaron Miller   document IDs and values.
163edf1bca6SAaron Miller * The **local-documents** index, which is conceptually the same as the
164edf1bca6SAaron Miller   by-ID index except that the documents in it do not appear in the
165edf1bca6SAaron Miller   by-sequence index, and by CouchDB convention the document IDs all
166edf1bca6SAaron Miller   begin with the ASCII sequence `_local/`.
167edf1bca6SAaron Miller
168edf1bca6SAaron Miller### The By-ID Index
169edf1bca6SAaron Miller
170edf1bca6SAaron MillerThis B-tree is an index of documents by their IDs.  The keys are simply
171edf1bca6SAaron Millerthe document IDs, ordered lexicographically by raw bytes (as by the
172edf1bca6SAaron Miller`memcmp` function.)
173edf1bca6SAaron Miller
174edf1bca6SAaron MillerThe values are:
175edf1bca6SAaron Miller
176edf1bca6SAaron Miller
177edf1bca6SAaron Millerlength  | content
178edf1bca6SAaron Miller--------|--------
179edf1bca6SAaron Miller48 bits | The sequence number this document was last modified at. If another revision of this document is later saved, this sequence number should be *removed* from the by-sequence index.
180edf1bca6SAaron Miller32 bits | Size of the document data
181edf1bca6SAaron Miller1 bit   | Deleted flag. If this is set this document should be ignored in indexing.
182edf1bca6SAaron Miller47 bits | Position of the document content on disk
183edf1bca6SAaron Miller1 bit   | 1 if the value is compressed using the Snappy algorithm
184edf1bca6SAaron Miller7 bits  | Content type code (q.v.)
185edf1bca6SAaron Miller48 bits | Document revision number (rev\_seq)
186edf1bca6SAaron Miller
187edf1bca6SAaron MillerFollowed by
188edf1bca6SAaron Miller
189edf1bca6SAaron Miller * Revision Metadata (q.v.)
190edf1bca6SAaron Miller
191edf1bca6SAaron MillerCouchstore does not react to the content type code, document revision
192edf1bca6SAaron Millernumber, or revision metadata.  These values are used by CouchDB and
193edf1bca6SAaron MillerCouchbase Server.
194edf1bca6SAaron Miller
195edf1bca6SAaron MillerCouchstore can optionally handle the deleted flag when asked to (users
196edf1bca6SAaron Millercan read the sequence index without deleted records, or run a "purge"
197edf1bca6SAaron Millercompaction that drops deleted items.)
198edf1bca6SAaron Miller
199edf1bca6SAaron MillerThe reduce value in interior nodes of this B-tree is:
200edf1bca6SAaron Miller
201edf1bca6SAaron Millerlength  | content
202edf1bca6SAaron Miller--------|--------
203edf1bca6SAaron Miller40 bits | Number of documents that are *not* flagged as deleted.
204edf1bca6SAaron Miller40 bits | Number of documents that *are* flagged as deleted
205edf1bca6SAaron Miller48 bits | Total disk size of document content
206edf1bca6SAaron Miller
207edf1bca6SAaron Miller#### Content Type Code
208edf1bca6SAaron Miller
209edf1bca6SAaron MillerCouchDB uses this to identify the type of the document data.
210edf1bca6SAaron Miller
211edf1bca6SAaron Miller *  0 -- JSON data
212edf1bca6SAaron Miller *  1 -- BLOB data (Attempted to parse as JSON but failed)
213edf1bca6SAaron Miller *  2 -- BLOB data (Parsed as syntactically valid JSON, but illegally
214edf1bca6SAaron Miller	contains reserved keys) There are no longer reserved keys in
215edf1bca6SAaron Miller	Couchbase, so this content-type code should never be seen. It should
216edf1bca6SAaron Miller	be okay to repurpose it at some point
217edf1bca6SAaron Miller
218edf1bca6SAaron Miller *  3 -- BLOB data (No attempt to parse as JSON)
219edf1bca6SAaron Miller
220edf1bca6SAaron Miller#### Revision Metadata
221edf1bca6SAaron Miller
222edf1bca6SAaron MillerThis is identifying information about a document revision that should be
223edf1bca6SAaron Millersmall enough to be kept in the indexes. This value is not interpreted by
224edf1bca6SAaron MillerCouchstore.
225edf1bca6SAaron Miller
226edf1bca6SAaron MillerIt's currently used by Couchbase as
227edf1bca6SAaron Miller
228edf1bca6SAaron Millerlength   | content
229edf1bca6SAaron Miller---------|--------
230edf1bca6SAaron Miller64 bits  | Memcached CAS value (unique cookie used by memcache to allow atomic writes)
231edf1bca6SAaron Miller32 bits  | Expiration time of a value
232edf1bca6SAaron Miller32 bits  | User flags as specified by Memcache protocol
233edf1bca6SAaron Miller
234edf1bca6SAaron Miller
235edf1bca6SAaron Miller### The By-Sequence Index
236edf1bca6SAaron Miller
237edf1bca6SAaron MillerThe keys in this B-tree are 48-bit numbers representing the sequence
238edf1bca6SAaron Millernumber of a change to the database. This number is strictly increasing.
239edf1bca6SAaron Miller
240edf1bca6SAaron MillerThe values are:
241edf1bca6SAaron Miller
242edf1bca6SAaron Millerlength  | content
243edf1bca6SAaron Miller--------|-------
244edf1bca6SAaron Miller12 bits | Size of the document ID
245edf1bca6SAaron Miller28 bits | Size of the document data
246edf1bca6SAaron Miller1 bit   | Deleted flag (this item in the by-sequence index represents a deletion action if this is set)
247edf1bca6SAaron Miller47 bits | Position of the document content on disk
248edf1bca6SAaron Miller1 bit   | 1 if the value is compressed using the Snappy algorithm
249edf1bca6SAaron Miller7 bits  | Content type code
250edf1bca6SAaron Miller48 bits | Document revision number (rev\_seq)
251edf1bca6SAaron Miller
252edf1bca6SAaron MillerFollowed by
253edf1bca6SAaron Miller
254edf1bca6SAaron Miller * Document ID
255edf1bca6SAaron Miller * Revision Metadata (variable size; extends to the end of the value
256edf1bca6SAaron Miller   record)
257edf1bca6SAaron Miller
258edf1bca6SAaron MillerThe reduce value in interior nodes of this B-tree is a 40 bit number of records
259edf1bca6SAaron Miller
260edf1bca6SAaron Miller### The Local Documents Index
261edf1bca6SAaron Miller
262edf1bca6SAaron MillerLocal documents are used for file-level configuration and metadata.
263edf1bca6SAaron Miller
264edf1bca6SAaron MillerThe keys in the local document b-tree are the document IDs, including
265edf1bca6SAaron Millerthe `_local/` prefix, and as in the ID tree sorted by `memcmp`.
266edf1bca6SAaron Miller
267edf1bca6SAaron MillerThe value in the local document b-tree is the raw JSON body of the local
268edf1bca6SAaron Millerdocument.
269edf1bca6SAaron Miller
270edf1bca6SAaron Miller[COUCHDB]: 		http://couchdb.apache.org
271edf1bca6SAaron Miller[COUCHSTORE]:	https://github.com/couchbaselabs/couchstore
272edf1bca6SAaron Miller[CRC32]:		https://en.wikipedia.org/wiki/Cyclic\_redundancy\_check
273edf1bca6SAaron Miller[SNAPPY]:		http://code.google.com/p/snappy/
274edf1bca6SAaron Miller