1# Couchstore Database File Format
3### Compiled by Aaron Miller and Jens Alfke, with help from Damien Katz
5This is the current database file format used by Couchbase Server 2.0.
6It's similar, but not identical, to the format used by [Apache
7CouchDB][COUCHDB] 1.2. It is implemented in a Couchbase 2.0 fork of
8CouchDB (in Erlang), and also by [Couchstore][COUCHSTORE] in C.
10## How A File Is Written
12The most important thing to understand about a CouchDB file is that it
13is **append-only**.  The only way the file is modified is to append new
14data at the end; bytes written are never overwritten.  As a consequence,
15the critical file "header" data actually lives at the tail end of the
16file, since it has to be re-appended every time the file is changed.
18This has several advantages:
20* The data format is extremely robust, since the file is never in an
21  inconsistent state. Even if the process crashes or the kernel panics
22  partway through writing an update, the software can recover simply by
23  scanning back from the end of the file to the last valid header.
24* Writers don't disturb readers at all, allowing concurrent access
25  without read locks.
26* By default, earlier versions of a record remain in the file, making it
27  easy to implement a version history and multi-version concurrency
28  control (as exposed in the CouchDB API.)
30The disadvantage, of course, is that the file grows without bound as
31it's modified. To work around this, the file is periodically compacted
32by writing the live data to a new file, then replacing the old file with
33the new one.  This can be done in the background without locking out
34either readers or writers.
36## Numbers And Alignment
38 * Numbers are in big-endian byte order (most significant byte first).
39 * All values are tightly packed; there is no padding or multi-byte
40   alignment.
41 * Some values occupy partial bytes and have lengths that are not a
42   multiple of 8 bits.  These are stored in most-significant to
43   least-significant bit order. So for example, if a 1-bit field is
44   followed by a 47-bit field, the first field occupies the MSB of the
45   first byte, and the second occupies the rest of the first byte and
46   the entirety of the next five bytes.
48## File Blocks
50For purposes of locating the current file header (q.v.), the file is
51organized into 4096-byte blocks. The first byte of every block is
52reserved and identifies whether it's a data block (zero) or a header
53block (nonzero). Therefore only 4095 of the bytes are available for
54storing data.
56Above the block level, these prefix bytes are invisible and simply
57skipped. So a data value that spans a block boundary will be written out
58with a zero byte inserted at the boundary, and this byte will be removed
59when reading the value.
61## Data Chunks
63The data in the file (above the block level) is grouped into
64variable-length **chunks**.  All chunks are prefixed with their length
65as a 32-bit big endian integer, followed by the [CRC32][CRC32] checksum
66of the data. The CRC32 is *not* included in the length.
68length   | content
70 32 bits | body length
71 32 bits | CRC32 checksum of body
73followed by the body data
75## File Header
77A file header always appears on a 4096-byte block boundary, and the
78first byte of the block is nonzero to signal that it contains a header.
79A file will contain many headers as it grows, since a new updated one is
80appended whenever data is saved; the current header is the _last_ one in
81the file. So the algorithm to find the header when opening the file is
82to seek to the last block boundary, read one byte, and keep skipping
83back 4096 bytes until the byte read is nonzero.
85The file header is prefixed with a 32 bit length and a checksum,
86similarly to other data chunks, but the length field **does** include
87the length of the hash.
89Values in the body of a file header:
91length  | content
938 bits  | File format version (Currently 10)
9448 bits | Sequence number of next update.
9548 bits | Purge counter.
9648 bits | Purged documents pointer. (unused)
9716 bits | Size of by-sequence B-tree root
9816 bits | Size of by-ID B-tree root
9916 bits | Size of local documents B-tree root
101 * The B-tree roots, in the order of the sizes, are B-tree node pointers as
102   described in the "Node Pointers" section.
104## B-Tree Format
106The B-trees used in CouchDB files are a bit different than in a typical
107implementation, because the file is append-only.  A tree node is never
108updated in place; instead, a new copy of the updated node is written at
109the end of the file.  Of course this means that the node's parent also
110has to be updated to point at the updated node, so the effect is that
111every modification has to rewrite all the nodes from the affected leaf
112up to the root. In practice this isn't too expensive, especially if
113multiple writes are batched together into one update.
115CouchDB's B-trees also have to forgo the sibling-node chaining that's
116typically used to speed up sequential access. The reason is that
117updating a node would invalidate the pointers to it in its siblings,
118forcing those nodes to be updated as well, resulting in a huge cascade
119of updates. Instead, the iteration algorithm remembers the path back to
120the root and periodically backtracks to find the next sibling node.
122### Nodes On Disk
124All B-tree nodes are compressed using the [Snappy][SNAPPY] algorithm.
125The descriptions following all refer to the uncompressed form.
127 * First byte -- 1 if a leaf (key/value) node, 0 if an interior
128   (key/pointer) node
129 * A list of key-value or key-pointer pairs, each of which is:
130	* 12 bits -- Key size
131	* 28 bits -- Value size
132	* Followed by the Key data,
133	* then the Value or Pointer data
135In interior nodes the Value parts of these pairs are pointers to another
136B-tree node, where keys less than or equal to that pair's Key will be.
138In leaf nodes the values are interpreted differently by each index; see
139the Indexes section below.
141### Node Pointers
143 * 48 bits -- Node position in file
144 * 48 bits -- Sub tree size, or disk size of B-tree data below this node. For
145   pointers to leaf nodes this is the size on disk of the pointed to node,
146   otherwise it is the sum of the size of the pointed to node and all
147   the values of this field on the pointers in that node.
148 * 16 bits -- Reduce value size (**NOT** present in the root pointers
149   embedded in the file header, as it can be inferred from the lengths
150   in the header)
151 * Reduce value -- Stores statistics about the subtree beneath this
152   node. The exact data format is different in each type of B-tree, and
153   is detailed below.
155## Indexes
157A CouchDB file contains three B-trees:
159 * The **by-ID** index, which maps document IDs to values.
160 * The **by-sequence** index, which maps sequence numbers (database
161   change numbers, which increase monotonically with every change) to
162   document IDs and values.
163 * The **local-documents** index, which is conceptually the same as the
164   by-ID index except that the documents in it do not appear in the
165   by-sequence index, and by CouchDB convention the document IDs all
166   begin with the ASCII sequence `_local/`.
168### The By-ID Index
170This B-tree is an index of documents by their IDs.  The keys are simply
171the document IDs, ordered lexicographically by raw bytes (as by the
172`memcmp` function.)
174The values are:
177length  | content
17948 bits | The sequence number this document was last modified at. If another revision of this document is later saved, this sequence number should be *removed* from the by-sequence index.
18032 bits | Size of the document data
1811 bit   | Deleted flag. If this is set this document should be ignored in indexing.
18247 bits | Position of the document content on disk
1831 bit   | 1 if the value is compressed using the Snappy algorithm
1847 bits  | Content type code (q.v.)
18548 bits | Document revision number (rev\_seq)
187Followed by
189 * Revision Metadata (q.v.)
191Couchstore does not react to the content type code, document revision
192number, or revision metadata.  These values are used by CouchDB and
193Couchbase Server.
195Couchstore can optionally handle the deleted flag when asked to (users
196can read the sequence index without deleted records, or run a "purge"
197compaction that drops deleted items.)
199The reduce value in interior nodes of this B-tree is:
201length  | content
20340 bits | Number of documents that are *not* flagged as deleted.
20440 bits | Number of documents that *are* flagged as deleted
20548 bits | Total disk size of document content
207#### Content Type Code
209CouchDB uses this to identify the type of the document data.
211 *  0 -- JSON data
212 *  1 -- BLOB data (Attempted to parse as JSON but failed)
213 *  2 -- BLOB data (Parsed as syntactically valid JSON, but illegally
214	contains reserved keys) There are no longer reserved keys in
215	Couchbase, so this content-type code should never be seen. It should
216	be okay to repurpose it at some point
218 *  3 -- BLOB data (No attempt to parse as JSON)
220#### Revision Metadata
222This is identifying information about a document revision that should be
223small enough to be kept in the indexes. This value is not interpreted by
226It's currently used by Couchbase as
228length   | content
23064 bits  | Memcached CAS value (unique cookie used by memcache to allow atomic writes)
23132 bits  | Expiration time of a value
23232 bits  | User flags as specified by Memcache protocol
235### The By-Sequence Index
237The keys in this B-tree are 48-bit numbers representing the sequence
238number of a change to the database. This number is strictly increasing.
240The values are:
242length  | content
24412 bits | Size of the document ID
24528 bits | Size of the document data
2461 bit   | Deleted flag (this item in the by-sequence index represents a deletion action if this is set)
24747 bits | Position of the document content on disk
2481 bit   | 1 if the value is compressed using the Snappy algorithm
2497 bits  | Content type code
25048 bits | Document revision number (rev\_seq)
252Followed by
254 * Document ID
255 * Revision Metadata (variable size; extends to the end of the value
256   record)
258The reduce value in interior nodes of this B-tree is a 40 bit number of records
260### The Local Documents Index
262Local documents are used for file-level configuration and metadata.
264The keys in the local document b-tree are the document IDs, including
265the `_local/` prefix, and as in the ID tree sorted by `memcmp`.
267The value in the local document b-tree is the raw JSON body of the local
270[COUCHDB]: 		http://couchdb.apache.org
271[COUCHSTORE]:	https://github.com/couchbaselabs/couchstore
272[CRC32]:		https://en.wikipedia.org/wiki/Cyclic\_redundancy\_check
273[SNAPPY]:		http://code.google.com/p/snappy/