The first time you write a database from scratch, the part you don't expect to spend the most time on is the file. Not the parser, not the optimizer, not joins — the file. The shape of the bytes on disk constrains every other decision you make for the rest of the project's life. Get it wrong early and every later feature is a tax.
This post is a walkthrough of how SQLRite currently stores rows on disk. It covers the four things that, together, are the whole storage subsystem:
- The page — a fixed-size unit on disk.
- The cell — how a single row is encoded into a page.
- The B-tree — how cells across many pages are organized.
- The pager and the WAL — how changes get from RAM to disk without corrupting anything when the lights go out.
If you are coming from "I read about B-trees once" and you want to see what the implementation actually looks like in a working embedded database, this is the post for you. The user-level surface — the SQL this engine accepts, the SDKs that wrap it — lives over in the docs; we'll only refer to it where it matters.
Pages
A SQLRite database is a single file. The file is divided into contiguous 4 KiB pages, numbered starting at 1. Page 0 is a header and Page 1 is the schema page; everything else is allocated on demand to tables and indexes.
Why 4 KiB?
- It matches the page size of every common filesystem (ext4, APFS, NTFS), so a single page write is almost always a single physical write.
- It matches what
pread/pwritesyscalls cost least to issue. - It matches the sweet spot for memory-mapped page caches.
- It is exactly the size SQLite uses by default, which means comparing performance numbers later is easier.
A page in SQLRite has a header (page type, free byte count, slot directory size, and parent/right-sibling pointers if it's a B-tree node) and a body. The body grows from the bottom of the page upward as cells; a slot directory at the top of the body grows downward. This is the classic slotted page layout that most disk-resident databases settle on, for the same reason everyone settles on it: it makes variable-length rows easy to insert, delete, and reorder without rewriting the whole page.
Cells
A "cell" is what SQLRite calls one row encoded into bytes. The encoding is very simple by design:
- A small per-cell header carrying the row's encoded payload size and a key (the integer primary key for tables, or the indexed value for index pages).
- A sequence of typed values:
INTEGER,TEXT,REAL,BOOLEAN,NULL,VECTOR(N),JSON. Each is length-prefixed when it needs to be.
What we deliberately do not do (yet) is the SQLite varint trick where each integer is encoded with the smallest number of bytes that will hold it. SQLite gets a real space win out of that — most integers in a real database are small. SQLRite uses fixed-width encoding for now because the win is modest at our scale and the debugging cost of a malformed varint is not modest. We will likely add it once the page format hits v5.
A row that's larger than will fit in a page — for example, a long
TEXT blob or a 1024-dimensional VECTOR — spills into an
overflow chain. The first part stays on the page, with a pointer
to a continuation page. Continuations chain until the value ends.
This is also the SQLite playbook; it is roughly the only known good
way to keep a fixed page size while supporting arbitrarily large
values.
B-trees
Every SQLRite table is a B-tree keyed on its INTEGER PRIMARY KEY
(or an implicit row ID if the schema doesn't declare one). Every
SQLRite index is a B-tree keyed on the indexed columns. The B-trees
share the same on-disk node layout — leaf vs. internal — so the
pager doesn't need to know whether it's looking at a table page or
an index page. That's a small implementation win that pays off every
time you go to write a new feature.
Two implementation decisions are worth calling out, because they are not the obvious ones:
Bottom-up rebuild on commit
Most database tutorials describe B-tree mutations as in-place edits: insert a key, find the right leaf, split the leaf if it overflows, propagate the split up the parent chain. SQLRite does this in memory during a transaction, against a snapshot of the in-memory state — but at commit time, instead of writing back the edited tree page-by-page, the entire B-tree is rebuilt bottom-up.
That sounds expensive. It mostly isn't. The leaf pages already exist in memory; rebuilding is a serialization pass and a couple of slot directory tweaks. The reason to do it this way is that bottom-up rebuild is correct by construction. Every page that gets persisted is a fresh, valid page, with no "did we update the parent before the crash?" failure mode. It also gives us a much simpler diffing pass for the WAL (more on that next).
The cost we are paying for this simplicity is O(N) rather than O(log N) per commit on a hot table. That ceiling will eventually need to come down — probably with an in-place split path — but it is a ceiling I'd rather raise after I have the rest of the system, not before.
Slot directory, not a sorted array
A leaf page's cells are not stored in key order in the page body. They're stored in arrival order. A separate slot directory lists cell offsets in key order. This means an insert never has to memmove the page body — only update the slot directory. It also means lookups are still O(log n_cells) on the slot directory.
Index pages use the same trick. Same wins.
The pager
The pager is the thing in the middle: it sits between the in-memory B-trees and the file. Its job is to:
- Read pages from disk on demand into a small in-memory page cache.
- Track which pages are dirty (changed since the last commit).
- On commit, write only the dirty pages, in a way that survives a crash.
That last clause is what the WAL is for.
The diff is the trick
A naïve pager writes every page that changed during the transaction,
in order, and then fsyncs. That works. It is also a lot of
writes. SQLRite's pager only writes pages whose bytes actually
changed. During a write transaction, the pager hashes each
candidate dirty page against its on-disk copy. If the hash matches,
the page isn't dirty after all — somebody touched the in-memory copy
but didn't change anything that affects the persisted bytes.
This is the diff-based pager. In practice, on workloads with
read-modify-write cycles or UPDATE … WHERE queries that match
nothing, it cuts disk writes dramatically. On a clean
CREATE TABLE + INSERT path, it's a no-op — every page is
genuinely new.
The WAL
The write-ahead log is a sidecar file:
<db>.sqlrite-wal. It exists only while writes are in flight; it is
truncated on a clean checkpoint.
Each commit appends one frame per dirty page. A frame is the page's full 4 KiB body, plus a small header carrying the page number and a per-frame checksum. The very last frame in a commit carries the new page-0 contents — the database header — and is marked as a commit frame.
On crash recovery, the rules are:
- Walk the WAL forward.
- Validate each frame's checksum.
- If a frame is torn (incomplete trailing bytes), or if its checksum doesn't match, truncate the WAL at the last good frame and stop.
- For every commit frame found, replay all frames up through the commit into the page cache.
- Treat the most recent commit's page-0 as the live database header, even if the main file's page 0 is older.
The contract is: a commit either landed in the WAL completely, or it didn't. There is no half-committed transaction.
Auto-checkpointing
A WAL that grows forever is a problem. SQLRite auto-checkpoints once the WAL crosses 100 frames. A checkpoint is straightforward:
- Acquire the writer lock (shared with the rest of the engine via
fs2advisory locks). - For each dirty page in the WAL, in WAL order, write it to its home offset in the main file.
fsyncthe main file.- Truncate the WAL to zero bytes.
- Drop the lock.
Readers using shared locks can keep reading from the WAL while the checkpoint runs; they'll see the WAL's view of any pages that haven't been pushed to the main file yet.
Crash safety, in three sentences
Every page write is preceded by a WAL frame. Every commit terminates with a special frame that validates the entire batch. On reopen, the WAL is the source of truth until checkpointed.
That's it. That's the whole crash-safety story.
Putting it together
Here is what the path looks like when you do a single INSERT:
- Parse the SQL into an AST. Bind it to a typed
InsertQuery. - The executor takes the in-memory snapshot of the table (a
BTree), inserts the row into a leaf, splits if needed, walks parents up. - On
COMMIT(or auto-commit, for a bareINSERT):- The pager finds the dirty pages of the rebuilt B-tree.
- It diffs each one against its on-disk copy and discards no-op writes.
- It encodes the survivors as WAL frames, computing checksums.
- It writes all frames, then a final commit frame carrying the new page-0.
- It
fsyncs the WAL.
- The auto-checkpointer fires later if the WAL is long enough.
The contract that sits over all of this: a COMMIT returns
success only after the commit frame is durably on disk. That's the
hard part.
What I'd do differently
A few decisions I'd revisit:
- Fixed-width integer encoding. The space win from varints is real on most workloads. Worth doing in v5.
- No per-table page chains. Tables and indexes share a page pool but don't pre-allocate contiguous runs. This makes seek patterns slightly worse than they need to be on rotational media, which doesn't matter today and might matter later.
- The B-tree rebuild ceiling. Eventually this needs an in-place split path. The bottom-up rebuild is great until your hot table is several gigabytes.
But I'd keep the file format. The shape of bytes on disk is the hardest thing to change later, and the shape SQLRite settled on is boring in all the right ways.
If you want to read the actual code, the page and B-tree types live
in src/sql/pager/
and the executor's commit path lives in
src/sql/db/database.rs.
The full file format spec is at
docs/file-format.md.
The next post in the series is the most fun one to write so far: adding vector search to a SQLite-style database with HNSW. That's where embedded SQL stops looking like 1992 and starts looking like the next decade.
More on the philosophy behind these choices in the origin-story post; a head-to-head against rusqlite-bundled SQLite on storage-heavy workloads is in the benchmarks post. The transactions and persistence section of the docs covers the same ground from the user side.
If SQLRite has been useful to you, ⭐ the repo — visibility is the bottleneck.