Skip to content

Clustering

FrogDB supports slot-based sharding across multiple nodes for horizontal scaling.

  • 16384 hash slots for Redis Cluster client compatibility.
  • Orchestrated control plane (external orchestrator, not gossip).
  • Full dataset replication — replicas copy all data from their primary.
TermDefinition
NodeA FrogDB server instance
Internal ShardThread-per-core partition within a node
SlotHash slot 0-16383 (cluster distribution unit)
PrimaryNode owning slots for writes
ReplicaNode replicating from a primary
OrchestratorExternal service managing cluster topology

FrogDB uses 16384 hash slots:

slot = CRC16(key) % 16384

Slots are assigned to nodes. Each node owns one or more contiguous slot ranges.

Keys containing {tag} use only the tag content for hashing, ensuring colocation:

Terminal window
# These keys hash to the same slot
SET {user:123}:profile "..."
SET {user:123}:settings "..."
# This MGET is guaranteed to be same-slot (fast, atomic)
MGET {user:123}:profile {user:123}:settings

FrogDB has two levels of sharding:

  1. Cluster routing: slot = CRC16(key) % 16384 determines which node.
  2. Internal routing: internal_shard = hash(key) % N determines which thread within a node.

Hash tags guarantee colocation at both levels.


FrogDB uses an external orchestrator rather than Redis-style gossip:

AspectOrchestrated (FrogDB)Gossip (Redis)
Topology sourceExternal orchestratorNode consensus
Failure detectionOrchestrator monitorsNodes vote on failures
Topology changesDeterministic, immediateEventually consistent
DebuggingExplicit stateDerived from gossip

The orchestrator is the control plane for cluster operations. Always deploy 3+ orchestrator instances in production.

If all orchestrators are unavailable:

  • No automatic failover can occur.
  • No slot migrations can happen.
  • No new nodes can join.
  • Data plane continues — existing topology serves traffic normally.

Recommended orchestrator backends: Kubernetes operator with etcd, Consul, or custom Raft-based.

The orchestrator pushes topology as JSON to each node’s admin API:

{
"cluster_id": "frogdb-prod-1",
"epoch": 42,
"nodes": [
{
"id": "node-abc123",
"address": "10.0.0.1:6379",
"admin_address": "10.0.0.1:6380",
"role": "primary",
"slots": [{"start": 0, "end": 5460}]
},
{
"id": "node-def456",
"address": "10.0.0.2:6379",
"admin_address": "10.0.0.2:6380",
"role": "primary",
"slots": [{"start": 5461, "end": 10922}]
},
{
"id": "node-ghi789",
"address": "10.0.0.3:6379",
"admin_address": "10.0.0.3:6380",
"role": "replica",
"replicates": "node-abc123"
}
]
}

  • Owns one or more slot ranges.
  • Accepts writes for owned slots.
  • Streams changes to replicas via WAL.
  • Responds to client requests or sends -MOVED redirects.
  • Full copy of primary’s dataset.
  • Read-only by default (READONLY command enables reads).
  • Candidates for failover promotion.
  • Can serve stale reads to scale read throughput.

Each node exposes an admin API on a separate port:

EndpointMethodDescription
/admin/clusterPOSTReceive topology update
/admin/clusterGETReturn current topology
/admin/healthGETHealth check
/admin/replicationGETReplication status

The admin API must be secured. Options (in order of recommendation):

MethodDescriptionUse Case
Network IsolationBind admin port to localhost or private networkDefault, simplest
mTLSMutual TLS with client certificatesProduction, external access
Bearer TokenShared secret in Authorization headerSimple auth for trusted networks
[cluster.admin]
bind = "127.0.0.1" # Bind to localhost only (default)
port = 6380
# For external access with auth:
# bind = "0.0.0.0"
# auth-token = "your-secret-token"

The /admin/health endpoint returns 200 OK when healthy or 503 Service Unavailable when unhealthy.

Health criteria:

  • Process is running and accepting connections.
  • All shard workers are responsive.
  • Memory is below critical threshold (default 95%).
  • Persistence is not blocked.
  • Cluster topology is valid.

All nodes in a cluster should have the same num-shards configuration for predictable behavior.

[cluster]
orchestrator-contact-timeout-ms = 60000 # Max time without orchestrator contact
topology-refresh-interval-ms = 30000 # Periodic topology refresh
[cluster.admin]
bind = "127.0.0.1"
port = 6380
[health]
memory-critical-percent = 95
shard-ping-timeout-ms = 100

See Replication for replica authentication setup.


ScenarioCluster Behavior
Single orchestrator downOthers continue, no impact
Orchestrator minority partitionedMajority continues making decisions
Orchestrator majority downCluster frozen (no topology changes, no failover)
All orchestrators downNodes continue with last known topology

What works without orchestrator: All read/write operations, client connections, established replication streams, health endpoint responses.

What requires orchestrator: Automatic failover, slot migration, adding/removing nodes, topology updates.


MetricDescription
frogdb_cluster_epochCurrent local epoch
frogdb_cluster_epoch_stale_detectionsTimes stale epoch detected
frogdb_cluster_last_orchestrator_contactTimestamp of last orchestrator message