Replication Internals
Contributor-facing documentation for FrogDB’s primary-replica replication system: WAL streaming internals, PSYNC protocol details, and the replication state machine.
For operator-facing setup and failover procedures, see Operations: Replication.
Architecture
Section titled “Architecture”Data Flow
Section titled “Data Flow”Primary: Write -> In-Memory -> WAL -> Replication Stream -> Replicas | WAL Archive (disk)Sync Types
Section titled “Sync Types”| Type | Trigger | Mechanism |
|---|---|---|
| FULLRESYNC | New replica, WAL gap too large | RocksDB checkpoint transfer |
| PSYNC | Reconnection within backlog | Resume from last sequence number |
Replication ID
Section titled “Replication ID”Each dataset history has a unique identifier:
- 40-character hex string (like Redis)
- Changes on failover: New primary generates new ID
- Secondary ID: Promoted replicas remember old primary’s ID
Primary A (repl_id: abc123...) | +-- Replica B (tracking abc123...) | [Primary A fails, B promoted] |Primary B (repl_id: def456..., secondary_id: abc123...)This allows replicas of A to connect to B and continue incrementally.
Replication ID Generation
Section titled “Replication ID Generation”fn generate_replication_id() -> String { let mut bytes = [0u8; 20]; getrandom::getrandom(&mut bytes).expect("random bytes"); hex::encode(bytes) // 20 random bytes -> 40 hex characters}Secondary ID Usage
Section titled “Secondary ID Usage”When a replica is promoted:
fn on_promotion(&mut self) { self.secondary_repl_id = Some(self.repl_id.clone()); self.secondary_repl_offset = self.repl_offset; self.repl_id = generate_replication_id();}Other replicas can PSYNC using either the new primary’s ID or the secondary ID.
Full Synchronization (FULLRESYNC)
Section titled “Full Synchronization (FULLRESYNC)”Primary Replica | | |<---- PSYNC ? -1 ------------------| "I have no data" | | | Create RocksDB checkpoint | | | |----- FULLRESYNC <id> <seq> ------->| | | |----- [checkpoint files] ---------->| Transfer checkpoint | | | | Load checkpoint | | |<---- PSYNC <id> <seq> ------------| Ready for incremental | | |----- [WAL stream] --------------->| Continue streamingConcurrent FULLRESYNC Requests
Section titled “Concurrent FULLRESYNC Requests”If multiple replicas request FULLRESYNC simultaneously, FrogDB creates a single RocksDB checkpoint and streams it to all requesting replicas, amortizing the cost.
Checkpoint Transfer Format
Section titled “Checkpoint Transfer Format”1. FULLRESYNC response: +FULLRESYNC <repl_id> <sequence>\r\n2. Checkpoint header: $<total_size>\r\n3. File entries (repeated): *3\r\n $<name_len>\r\n<filename>\r\n :<file_size>\r\n $<file_size>\r\n<file_data>\r\n4. End marker: *1\r\n $3\r\nEOF\r\n5. Checksum footer: $40\r\n<sha256_hex_of_all_files>\r\nCheckpoint Integrity: Primary computes SHA256 of all file contents. Replica verifies after receiving all files. Mismatch triggers retry.
PSYNC Handshake Sequence
Section titled “PSYNC Handshake Sequence”Replica Primary | | |----- AUTH [user] <password> ------>| (if primary_auth configured) |<---- +OK -------------------------| | | |----- REPLCONF listening-port 6379->| (optional, for INFO output) |<---- +OK -------------------------| | | |----- REPLCONF capa eof psync2 --->| (announce capabilities) |<---- +OK -------------------------| | | |----- PSYNC <repl_id> <seq> ------>| (initiate sync) |<---- +FULLRESYNC or +CONTINUE ----|WAL Streaming
Section titled “WAL Streaming”After initial sync, the primary continuously streams WAL entries to replicas. FrogDB uses RocksDB’s GetUpdatesSince() for incremental sync.
TCP Backpressure Design
Section titled “TCP Backpressure Design”FrogDB uses TCP backpressure (no buffer limits) to avoid the “full sync loop” problem that affects Redis:
- If a replica is slow, TCP send buffer fills naturally
- Primary blocks on write (does not buffer unboundedly)
- No risk of exhausting memory with replication buffers
- Slow replicas cause primary to slow down rather than trigger full resync
This is a deliberate departure from Redis, which uses bounded output buffers and disconnects slow replicas, sometimes causing repeated full resyncs.
Synchronous Replication
Section titled “Synchronous Replication”When min_replicas_to_write >= 1, writes wait for replica acknowledgment:
async fn execute_with_sync_replication( cmd: &ParsedCommand, handler: &dyn Command, ctx: &mut CommandContext<'_>,) -> Response { let result = handler.execute(ctx, &cmd.args); let seq = ctx.wal.append(&result.operation)?;
if ctx.config.min_replicas_to_write == 0 { return result.response; // Async mode }
let ack_future = ctx.replication_tracker.wait_for_acks( seq, ctx.config.min_replicas_to_write, );
match tokio::time::timeout(ctx.config.replica_ack_timeout, ack_future).await { Ok(Ok(_)) => result.response, Ok(Err(_)) | Err(_) => { // Write ALREADY committed on primary Response::Error(Bytes::from_static(b"NOREPL Not enough replicas")) } }}When -NOREPL is returned, the write has already succeeded on the primary and is in the WAL. The error indicates replication durability was not confirmed, not that the write failed.
Replication State Machine
Section titled “Replication State Machine”Node Roles
Section titled “Node Roles”pub enum NodeRole { Primary, Replica { primary_addr: SocketAddr }, Standalone,}Replica Command Handling
Section titled “Replica Command Handling”| Command Type | Replica Behavior |
|---|---|
Read (READONLY flag) | Execute locally, may return stale data |
Write (WRITE flag) | Return -READONLY error |
| Replication commands | Execute (PSYNC, REPLCONF, etc.) |
| Admin commands | Depends on command |
Replication Metrics
Section titled “Replication Metrics”| Metric | Description |
|---|---|
last_write_seq | Sequence number of last write |
pending_sync_writes | Writes waiting for replica ACK |
norepl_errors | Writes that failed due to replica timeout |
replica_ack_wait_time | Time spent waiting for replica ACKs |