Skip to content

Persistence Internals

Contributor-facing documentation for FrogDB’s persistence architecture: RocksDB topology, WAL design, key-value schema, and snapshot internals.

For operator-facing configuration and recovery procedures, see Operations: Persistence.


FrogDB uses a single shared RocksDB instance with one column family per shard:

+----------------------------------------------------------+
| RocksDB Instance |
+----------------------------------------------------------+
| +-----------+ +-----------+ +-----------+ +---------+ |
| | CF: s0 | | CF: s1 | | CF: s2 | | CF: sN | |
| | (Shard 0) | | (Shard 1) | | (Shard 2) | |(Shard N)| |
| +-----------+ +-----------+ +-----------+ +---------+ |
| |
| +---------------------------------------------------+ |
| | Shared WAL | |
| +---------------------------------------------------+ |
+----------------------------------------------------------+

Benefits:

  • Single backup/restore operation for entire database
  • Shared WAL simplifies recovery
  • Atomic cross-shard operations possible via WriteBatch

Trade-off:

  • Potential lock contention on WAL writes (mitigated by batching)

Each column family (shard) stores keys with this format:

Key Format:

[user_key_bytes]

Value Format:

+---------------------------------------------------------+
| Header (fixed 24 bytes) |
+---------------------------------------------------------+
| type: u8 | Value type (0=String, 1=List, etc.) |
| flags: u8 | Reserved for future use |
| expires_at: i64 | Unix timestamp ms (0 = no expiry) |
| lfu_counter: u8 | LFU access counter |
| padding: [u8; 5] | Alignment padding |
| value_len: u64 | Length of value data |
+---------------------------------------------------------+
| Value Data (variable) |
+---------------------------------------------------------+

Type Encoding:

TypeCodeSerialization
String0Raw bytes
List1[len:u32][elem1_len:u32][elem1]...
Set2[len:u32][member1_len:u32][member1]...
Hash3[len:u32][k1_len:u32][k1][v1_len:u32][v1]...
SortedSet4[len:u32][score:f64][member_len:u32][member]...
Stream5Full entry + consumer group state
HyperLogLog6Sparse/dense encoding
JSON7UTF-8 encoded JSON string (via serde_json)
Bloom8[num_bits:u64][num_hashes:u8][bits...]
TimeSeries9[len:u32][timestamp:i64][value:f64]...
Geo10Stored as SortedSet with geohash scores

Byte Order: All multi-byte integers are stored in little-endian format.

  • lfu_counter persisted with value
  • last_access (LRU) NOT persisted — reset to recovery time on startup

After recovery, all keys appear “fresh” for LRU purposes (idle time = 0). This matches Redis behavior. Eviction accuracy self-corrects within minutes as keys are accessed during normal operation.

Expiry Index: NOT persisted separately. Rebuilt during recovery from expires_at field in each value. Active expiry index is in-memory only.

Recovery Conversion: Unix timestamps (persisted as i64 milliseconds) are converted to std::time::Instant (monotonic clock) during recovery.


Every write operation is appended to RocksDB’s WAL before acknowledgment:

Client Write (SET key value)
|
v
Shard Worker
|
+-- 1. Apply to in-memory store
|
+-- 2. Append to WAL (async batch)
| +-- RocksDB WriteBatch
|
+-- 3. Return OK to client
Failure PointIn-Memory StateClient ResponseRecovery
Before in-memory applyUnchangedError returnedNone needed
After in-memory, WAL failsWrite visibleError returnedMay be lost on restart
After WAL, before fsync (Async)Write visibleOK returnedMay be lost on crash
After fsync (Sync)Write visibleOK returnedGuaranteed durable

Design rationale: In-memory is the source of truth during operation (for low latency). WAL provides durability, not correctness during normal operation. This matches Redis AOF semantics. The alternative (rollback on WAL failure) would require undo logs and complex rollback logic with significant performance overhead.

FrogDB supports a configurable wal-failure-policy:

PolicyBehaviorDefault
continueLog error, return success (Redis/DragonflyDB semantics)Yes
rollbackUndo in-memory state, return IOERR to clientNo

Rollback mode details:

  • Before executing a write, affected keys’ current state is snapshotted (cheap Arc<Value> clones)
  • If WAL fails: snapshot is restored, IOERR returned
  • Single-shard write commands only; scatter-gather always uses continue
  • Performance impact: flush_async() forces synchronous disk I/O per command (~0.1-2ms vs ~1-10us)
Corruption TypeDetectionDefault Recovery
Truncated entryEntry length exceeds remaining file bytesTruncate WAL at corruption point
Checksum mismatchCRC32 of entry data doesn’t match headerTruncate WAL at corruption point
Invalid type markerUnknown operation type byteTruncate WAL at corruption point
Sequence gapExpected sequence N, found N+kPolicy-dependent

Why truncation is the default: Crashes during write leave partial entries at WAL end. Truncation is safe, snapshots provide fallback, and this matches Redis AOF behavior.


ModeDurabilityLatency
AsyncBest-effort (may lose data)~1-10 us
Periodic(1000ms)Bounded loss (~1s, matches Redis appendfsync everysec)~1-10 us
SyncGuaranteed (fsync per write)~100-500 us

The Periodic mode uses a wall-clock timer that fires on a fixed schedule (not reset-on-write). If previous fsync is still in progress when timer fires, that interval is skipped. This matches Redis appendfsync everysec behavior.

In Async and Periodic modes, writes are visible to other clients BEFORE they are durably persisted. This is by design and matches Redis behavior.


FrogDB uses epoch-based forkless snapshots instead of Redis’s fork-based approach:

  1. Snapshot begins: current epoch is recorded
  2. All shards iterate keys, writing values from epoch start
  3. Concurrent writes during snapshot go to a COW buffer
  4. COW buffer memory is explicitly tracked in total_memory_used()

maxmemory enforcement uses total bytes (store + COW), preventing OOM during snapshots.

Eviction during snapshot:

ConditionBehavior
Memory pressure during snapshotEviction proceeds normally
Key has pending COW entrySkip — already captured for snapshot
No evictable keys remainAbort snapshot (cow-memory-abort-threshold)

Sequence numbers are assigned at WAL append time, not at command execution time. This ensures:

  • Monotonically increasing sequences for replication ordering
  • Gaps are possible if batched writes fail partially
  • Replicas can request resumption from any sequence number
impl WalWriter {
fn append(&mut self, operation: &Operation) -> u64 {
let batch = WriteBatch::new();
batch.put(/* key, value encoding */);
let seq = self.db.write(batch)?;
self.replication_notify.send(ReplicationEntry {
sequence: seq,
operation: operation.clone(),
});
seq
}
}