1# Couchstore Database File Format
2
3### Compiled by Aaron Miller and Jens Alfke, with help from Damien Katz
4
5This is the current database file format used by Couchbase Server 2.0.
6It's similar, but not identical, to the format used by [Apache
7CouchDB][COUCHDB] 1.2. It is implemented in a Couchbase 2.0 fork of
8CouchDB (in Erlang), and also by [Couchstore][COUCHSTORE] in C.
9
10## How A File Is Written
11
12The most important thing to understand about a CouchDB file is that it
13is **append-only**.  The only way the file is modified is to append new
14data at the end; bytes written are never overwritten.  As a consequence,
15the critical file "header" data actually lives at the tail end of the
16file, since it has to be re-appended every time the file is changed.
17
18This has several advantages:
19
20* The data format is extremely robust, since the file is never in an
21  inconsistent state. Even if the process crashes or the kernel panics
22  partway through writing an update, the software can recover simply by
23  scanning back from the end of the file to the last valid header.
24* Writers don't disturb readers at all, allowing concurrent access
25  without read locks.
26* By default, earlier versions of a record remain in the file, making it
27  easy to implement a version history and multi-version concurrency
28  control (as exposed in the CouchDB API.)
29
30The disadvantage, of course, is that the file grows without bound as
31it's modified. To work around this, the file is periodically compacted
32by writing the live data to a new file, then replacing the old file with
33the new one.  This can be done in the background without locking out
34either readers or writers.
35
36## Numbers And Alignment
37
38 * Numbers are in big-endian byte order (most significant byte first).
39 * All values are tightly packed; there is no padding or multi-byte
40   alignment.
41 * Some values occupy partial bytes and have lengths that are not a
42   multiple of 8 bits.  These are stored in most-significant to
43   least-significant bit order. So for example, if a 1-bit field is
44   followed by a 47-bit field, the first field occupies the MSB of the
45   first byte, and the second occupies the rest of the first byte and
46   the entirety of the next five bytes.
47
48## File Blocks
49
50For purposes of locating the current file header (q.v.), the file is
51organized into 4096-byte blocks. The first byte of every block is
52reserved and identifies whether it's a data block (zero) or a header
53block (nonzero). Therefore only 4095 of the bytes are available for
54storing data.
55
56Above the block level, these prefix bytes are invisible and simply
57skipped. So a data value that spans a block boundary will be written out
58with a zero byte inserted at the boundary, and this byte will be removed
59when reading the value.
60
61## Data Chunks
62
63The data in the file (above the block level) is grouped into
64variable-length **chunks**.  All chunks are prefixed with their length
65as a 32-bit big endian integer, followed by the [CRC32][CRC32] checksum
66of the data. The CRC32 is *not* included in the length.
67
68length   | content
69---------|--------
70 32 bits | body length
71 32 bits | CRC32 checksum of body
72
73followed by the body data
74
75## File Header
76
77A file header always appears on a 4096-byte block boundary, and the
78first byte of the block is nonzero to signal that it contains a header.
79A file will contain many headers as it grows, since a new updated one is
80appended whenever data is saved; the current header is the _last_ one in
81the file. So the algorithm to find the header when opening the file is
82to seek to the last block boundary, read one byte, and keep skipping
83back 4096 bytes until the byte read is nonzero.
84
85The file header is prefixed with a 32 bit length and a checksum,
86similarly to other data chunks, but the length field **does** include
87the length of the hash.
88
89Values in the body of a file header:
90
91length  | content
92--------|--------
938 bits  | File format version (Currently 10)
9448 bits | Sequence number of next update.
9548 bits | Purge counter.
9648 bits | Purged documents pointer. (unused)
9716 bits | Size of by-sequence B-tree root
9816 bits | Size of by-ID B-tree root
9916 bits | Size of local documents B-tree root
100
101 * The B-tree roots, in the order of the sizes, are B-tree node pointers as
102   described in the "Node Pointers" section.
103
104## B-Tree Format
105
106The B-trees used in CouchDB files are a bit different than in a typical
107implementation, because the file is append-only.  A tree node is never
108updated in place; instead, a new copy of the updated node is written at
109the end of the file.  Of course this means that the node's parent also
110has to be updated to point at the updated node, so the effect is that
111every modification has to rewrite all the nodes from the affected leaf
112up to the root. In practice this isn't too expensive, especially if
113multiple writes are batched together into one update.
114
115CouchDB's B-trees also have to forgo the sibling-node chaining that's
116typically used to speed up sequential access. The reason is that
117updating a node would invalidate the pointers to it in its siblings,
118forcing those nodes to be updated as well, resulting in a huge cascade
119of updates. Instead, the iteration algorithm remembers the path back to
120the root and periodically backtracks to find the next sibling node.
121
122### Nodes On Disk
123
124All B-tree nodes are compressed using the [Snappy][SNAPPY] algorithm.
125The descriptions following all refer to the uncompressed form.
126
127 * First byte -- 1 if a leaf (key/value) node, 0 if an interior
128   (key/pointer) node
129 * A list of key-value or key-pointer pairs, each of which is:
130	* 12 bits -- Key size
131	* 28 bits -- Value size
132	* Followed by the Key data,
133	* then the Value or Pointer data
134
135In interior nodes the Value parts of these pairs are pointers to another
136B-tree node, where keys less than or equal to that pair's Key will be.
137
138In leaf nodes the values are interpreted differently by each index; see
139the Indexes section below.
140
141### Node Pointers
142
143 * 48 bits -- Node position in file
144 * 48 bits -- Sub tree size, or disk size of B-tree data below this node. For
145   pointers to leaf nodes this is the size on disk of the pointed to node,
146   otherwise it is the sum of the size of the pointed to node and all
147   the values of this field on the pointers in that node.
148 * 16 bits -- Reduce value size (**NOT** present in the root pointers
149   embedded in the file header, as it can be inferred from the lengths
150   in the header)
151 * Reduce value -- Stores statistics about the subtree beneath this
152   node. The exact data format is different in each type of B-tree, and
153   is detailed below.
154
155## Indexes
156
157A CouchDB file contains three B-trees:
158
159 * The **by-ID** index, which maps document IDs to values.
160 * The **by-sequence** index, which maps sequence numbers (database
161   change numbers, which increase monotonically with every change) to
162   document IDs and values.
163 * The **local-documents** index, which is conceptually the same as the
164   by-ID index except that the documents in it do not appear in the
165   by-sequence index, and by CouchDB convention the document IDs all
166   begin with the ASCII sequence `_local/`.
167
168### The By-ID Index
169
170This B-tree is an index of documents by their IDs.  The keys are simply
171the document IDs, ordered lexicographically by raw bytes (as by the
172`memcmp` function.)
173
174The values are:
175
176
177length  | content
178--------|--------
17948 bits | The sequence number this document was last modified at. If another revision of this document is later saved, this sequence number should be *removed* from the by-sequence index.
18032 bits | Size of the document data
1811 bit   | Deleted flag. If this is set this document should be ignored in indexing.
18247 bits | Position of the document content on disk
1831 bit   | 1 if the value is compressed using the Snappy algorithm
1847 bits  | Content type code (q.v.)
18548 bits | Document revision number (rev\_seq)
186
187Followed by
188
189 * Revision Metadata (q.v.)
190
191Couchstore does not react to the content type code, document revision
192number, or revision metadata.  These values are used by CouchDB and
193Couchbase Server.
194
195Couchstore can optionally handle the deleted flag when asked to (users
196can read the sequence index without deleted records, or run a "purge"
197compaction that drops deleted items.)
198
199The reduce value in interior nodes of this B-tree is:
200
201length  | content
202--------|--------
20340 bits | Number of documents that are *not* flagged as deleted.
20440 bits | Number of documents that *are* flagged as deleted
20548 bits | Total disk size of document content
206
207#### Content Type Code
208
209CouchDB uses this to identify the type of the document data.
210
211 *  0 -- JSON data
212 *  1 -- BLOB data (Attempted to parse as JSON but failed)
213 *  2 -- BLOB data (Parsed as syntactically valid JSON, but illegally
214	contains reserved keys) There are no longer reserved keys in
215	Couchbase, so this content-type code should never be seen. It should
216	be okay to repurpose it at some point
217
218 *  3 -- BLOB data (No attempt to parse as JSON)
219
220#### Revision Metadata
221
222This is identifying information about a document revision that should be
223small enough to be kept in the indexes. This value is not interpreted by
224Couchstore.
225
226It's currently used by Couchbase as
227
228length   | content
229---------|--------
23064 bits  | Memcached CAS value (unique cookie used by memcache to allow atomic writes)
23132 bits  | Expiration time of a value
23232 bits  | User flags as specified by Memcache protocol
233
234
235### The By-Sequence Index
236
237The keys in this B-tree are 48-bit numbers representing the sequence
238number of a change to the database. This number is strictly increasing.
239
240The values are:
241
242length  | content
243--------|-------
24412 bits | Size of the document ID
24528 bits | Size of the document data
2461 bit   | Deleted flag (this item in the by-sequence index represents a deletion action if this is set)
24747 bits | Position of the document content on disk
2481 bit   | 1 if the value is compressed using the Snappy algorithm
2497 bits  | Content type code
25048 bits | Document revision number (rev\_seq)
251
252Followed by
253
254 * Document ID
255 * Revision Metadata (variable size; extends to the end of the value
256   record)
257
258The reduce value in interior nodes of this B-tree is a 40 bit number of records
259
260### The Local Documents Index
261
262Local documents are used for file-level configuration and metadata.
263
264The keys in the local document b-tree are the document IDs, including
265the `_local/` prefix, and as in the ID tree sorted by `memcmp`.
266
267The value in the local document b-tree is the raw JSON body of the local
268document.
269
270[COUCHDB]: 		http://couchdb.apache.org
271[COUCHSTORE]:	https://github.com/couchbaselabs/couchstore
272[CRC32]:		https://en.wikipedia.org/wiki/Cyclic\_redundancy\_check
273[SNAPPY]:		http://code.google.com/p/snappy/
274
275