Skip to content

Persistence

FrogDB persists data using RocksDB for durability. This document covers durability modes, snapshot configuration, and recovery procedures for operators.

FrogDB supports three durability modes controlling the trade-off between write performance and data safety:

ModeDurabilityLatencyData at Risk on Crash
asyncBest-effort~1-10 usAll unflushed writes (unbounded)
periodicBounded loss~1-10 usUp to fsync-interval-ms of writes (default 1s)
syncGuaranteed~100-500 usNone (acknowledged = durable)
  • async: Use for caching workloads where data loss is acceptable. Highest throughput.
  • periodic (recommended): Balanced option. Bounded loss window of 1 second by default. Matches Redis appendfsync everysec behavior.
  • sync: Use for critical data that must not be lost. Every acknowledged write is fsynced to disk before the client receives OK.

In async and periodic modes, writes are visible to other clients before they are fsynced to disk. A successful GET after SET does not guarantee the value will survive a crash. This matches standard Redis behavior.

[persistence]
enabled = true
data-dir = "/var/lib/frogdb"
durability-mode = "periodic" # async, periodic, sync
sync-interval-ms = 1000 # Fsync every 1 second (periodic mode)
[snapshot]
snapshot-interval-secs = 3600 # Snapshot every hour
max-snapshots = 5

FrogDB creates periodic point-in-time snapshots for faster recovery. Snapshots use a forkless algorithm that does not cause memory spikes.

  • Each shard captures a logical point-in-time view using epoch-based versioning.
  • The server continues processing commands during the snapshot.
  • Copy-on-write semantics capture old values for keys modified during the snapshot.
  • No 2x memory spike (unlike Redis fork-based snapshots).
[snapshot]
snapshot-dir = "/var/lib/frogdb/snapshots"
snapshot-interval-secs = 3600 # Snapshot every hour
max-snapshots = 5 # Retain up to 5 snapshots
ScenarioAdditional Memory
Low write rateMinimal (~COW buffer size)
High write rate, many key overwritesUp to COW buffer per shard
Pathological: every key overwritten~dataset size (worst case)

On startup, if data exists, FrogDB recovers state automatically:

  1. Check for snapshots and find the latest by epoch number.
  2. Load snapshot (for each shard: load key-value pairs into memory, rebuild expiry index).
  3. Replay WAL entries after the snapshot’s sequence number.
  4. Verify integrity and log recovery statistics.
ScenarioBehavior
Clean shutdownLoad snapshot + minimal WAL replay
CrashLoad snapshot + full WAL replay from snapshot point
No snapshotFull WAL replay from beginning
No dataFresh start
Corrupted WALRecover up to corruption point, log error
Dataset SizeApproximate Recovery Time
1 GB10-30 seconds
10 GB1-5 minutes
100 GB10-30 minutes
1 TB1-3 hours

Times depend on disk speed, data structure complexity, and available CPU.

[persistence]
# Policy when WAL corruption is detected
# "truncate" - Discard corrupted entry and all subsequent entries (default)
# "fail" - Abort startup, require manual intervention
wal-corruption-policy = "truncate"
  • truncate (default): Prioritizes returning to service. Operators can inspect logs to assess data loss.
  • fail: Requires manual intervention. Use for critical data where any data loss must be investigated.

Manual recovery when fail policy triggers startup abort:

Terminal window
# 1. Inspect WAL state
frogctl debug wal-inspect --data-dir /var/lib/frogdb/data/
# 2. Force truncation if acceptable
frogctl debug wal-truncate --data-dir /var/lib/frogdb/data/ --at-sequence <seq>
# 3. Restart server
systemctl start frogdb

Controls behavior when a WAL write fails after a command executes in memory:

PolicyBehavior
continue (default)Log error, return success to client. Write is visible but may be lost on restart.
rollbackUndo in-memory state, return IOERR to client. Adds latency due to synchronous disk I/O.
[persistence]
wal-failure-policy = "continue" # or "rollback"

Runtime toggle: CONFIG SET wal-failure-policy rollback


  • WAL files: Append-only log of all writes.
  • SST files: RocksDB sorted string table files (compacted data).
  • Snapshots: Point-in-time copies of the full dataset.

WAL files are retained to support replica reconnection via partial sync (PSYNC):

[rocksdb]
min-wal-retention-secs = 3600 # Keep WAL files for at least 1 hour
min-wal-files-to-keep = 10

Larger retention values allow replicas to recover from longer disconnections without requiring a full resync.

Key metrics to watch:

MetricDescription
frogdb_wal_bytes_totalWAL bytes written
frogdb_snapshot_size_bytesLast snapshot size
frogdb_persistence_errors_totalPersistence errors (disk full, I/O)

Alert on disk usage approaching capacity. WAL write failures in sync mode return errors to clients; in async/periodic modes, data may be silently lost.


See Metrics Reference for the full list. Key persistence metrics:

MetricTypeDescription
frogdb_wal_writes_totalCounterWAL writes
frogdb_wal_bytes_totalCounterWAL bytes written
frogdb_wal_flush_duration_secondsHistogramWAL flush latency
frogdb_wal_durability_lag_msGaugeDurability lag
frogdb_persistence_errors_totalCounterPersistence errors
frogdb_snapshot_in_progressGauge1 if snapshot running
frogdb_snapshot_duration_secondsHistogramSnapshot duration
frogdb_snapshot_size_bytesGaugeLast snapshot size