Executive Summary
This document specifies the architecture, design decisions, and implementation details for Meridian, a production-grade system that transfers terabyte-scale datasets between nodes with zero data loss.
The system provides resumable transfers, content-defined deduplication, multi-path parallelism, end-to-end cryptographic integrity verification, and automatic failover — all while saturating available network bandwidth.
Requirements
Functional Requirements
- ▶Bulk Transfer: Transfer datasets from 1 GB to 50+ TB between two nodes (source and destination).
- ▶Resumability: If interrupted at any point (crash, network failure, reboot), resume from the point of interruption, not restart from scratch.
- ▶Incremental Sync: When transferring a previously-transferred dataset, only changed portions are re-sent.
- ▶Integrity Guarantee: The receiver cryptographically verifies that received data is bit-for-bit identical to the source. Silent corruption is detected.
- ▶Atomic Commit: A transfer is either fully committed or not committed at all. No partial states visible to consumers.
- ▶Multi-File Support: A single transfer job can include thousands of files in a directory tree, preserving structure on the destination.
Performance Targets
| Metric | Target | Rationale |
|---|---|---|
| Throughput | ≥80% of link capacity | On a 10 Gbps link, achieve ≥8 Gbps sustained |
| Chunk Latency (p99) | <50 ms | End-to-end time for one chunk: send, verify, ACK |
| Resume Time | <5 seconds | Time from process restart to transfer resumption |
| Dedup Ratio (incremental) | ≥40% | For typical edit patterns on previously-transferred data |
Reliability Targets
| Metric | Target | Rationale |
|---|---|---|
| Data Loss | Zero | Not one byte lost, ever. Cryptographic verification end-to-end. |
| Transfer Success Rate | 99.999% | Measured over 30-day rolling window |
| Recovery from Node Crash | Automatic | WAL-based recovery, no manual intervention |
| Recovery from Network Partition | Automatic | Reconnect with exponential backoff, delta sync on reconnect |
Security Requirements
Scalability
- • Datasets up to 50 TB in a single transfer job
- • Up to 10 million chunks per transfer (~40 TB at 4 MB average)
- • Up to 100 concurrent transfer jobs per node
- • Horizontal scaling via multiple sender/receiver pairs
Observability
- • Metrics: Prometheus-compatible on :9090 — throughput, chunk latency histogram, dedup ratio, error rate, WAL backlog
- • Tracing: OpenTelemetry distributed tracing with per-chunk spans. 100% sampling on failures, 10% on success
- • Alerting: Pre-configured rules for throughput drops, integrity failures, stalled transfers, disk space, certificate expiry
- • Both nodes run Linux (kernel 5.6+ for io_uring support)
- • Network link at least 1 Gbps. Optimized for 10–100 Gbps
- • NVMe SSD storage on both nodes
- • All ordering uses logical clocks, not wall clocks (no time sync required)
- • Designed for batch/bulk transfers, not real-time streaming
High-Level Architecture
The system consists of three planes: the data plane (chunking, transfer, reassembly), the control plane (coordination, state management, health monitoring), and the observability plane (metrics, tracing, alerting).
End-to-End Data Flow
Transfer State Machine
Failure at any state triggers RETRYING (exponential backoff). After max retries: FAILED with diagnostic snapshot. Operators can PAUSE or CANCEL at any point.
Component Responsibilities
| Component | Location | Responsibility |
|---|---|---|
| File Scanner | Source | Discover files, detect changes, feed chunker |
| CDC Chunker | Source | Split files into content-defined variable-size chunks |
| Dedup Index | Both | Track which chunks exist at each node (Bloom filter + RocksDB) |
| Transfer Queue | Source | Priority queue of chunks to send, backed by WAL |
| Transport Pool | Both | 8 gRPC connections, 16 streams each, TLS 1.3, flow control |
| Integrity Verifier | Destination | CRC32C, SHA-256, Merkle tree verification |
| Reassembler | Destination | Reconstruct files from chunks using manifest |
| WAL | Both | Crash-safe state tracking (append-only, fsync'd) |
| Coordinator | Both | State machine management, retry logic, job lifecycle |
| Metrics Exporter | Both | Prometheus metrics, OpenTelemetry traces |
Content-Defined Chunking
Fixed-size chunking (e.g., splitting every 4 MB) breaks catastrophically on insertions. Consider a 1 TB file where a 50 KB row is inserted near the beginning:
- • Fixed chunks: Every chunk boundary after the insertion shifts by 50 KB. Every chunk has a different SHA-256 hash. You must re-transfer the entire 1 TB file.
- • CDC chunks: Boundaries are data-dependent via rolling hash. An insertion only affects 2–3 chunks near the edit point. All other boundaries remain stable.
CDC reduces re-transfer volume by 90–99% compared to fixed-size chunking.
Algorithm: Buzhash
We use Buzhash (bitwise rolling hash) rather than Rabin fingerprinting:
| Property | Buzhash | Rabin |
|---|---|---|
| Speed | ~3.5 GB/s | ~2.8 GB/s |
| Distribution Quality | Good | Excellent |
| Implementation | XOR + rotate (simple) | Polynomial arithmetic (moderate) |
We chose Buzhash because chunking is CPU-bound and runs inline with I/O. The 25% speed advantage keeps pace with NVMe read speeds (~3.5 GB/s). Rabin's better distribution doesn't matter because we rely on SHA-256 for content addressing, not the rolling hash.
Chunk Size Parameters
| Parameter | Value | Rationale |
|---|---|---|
| Minimum Chunk | 256 KB | Prevents pathological tiny chunks from repetitive data patterns |
| Average Chunk | 4 MB | Sweet spot: 2.5M chunks per 10 TB fits in ~640 MB RAM index |
| Maximum Chunk | 16 MB | Caps worst-case for uniform data (zeros, compressed streams) |
| Window Size | 48 bytes | Sliding window for rolling hash computation |
The 4 MB average is a three-way tension: smaller chunks (1 MB) give better dedup granularity but produce 10M chunks per 10 TB, requiring ~2.5 GB for the index. Larger chunks (16 MB) reduce index size but miss small edits. 4 MB is tunable per-environment.
The Chunk Manifest
ChunkMeta {
chunk_index: u64,
file_path: String,
offset: u64,
length: u32,
sha256: [u8; 32]
}The manifest records each chunk's position (byte offset within the file), size, and content hash. It's deterministic: given the same file content and CDC parameters, the same boundaries are always produced. For multi-file transfers, chunk indices are global across all files. The sender can transmit chunks in any order; the receiver uses the manifest to place each chunk correctly.
Deduplication Index
Transport Layer
Protocol Stack: gRPC over HTTP/2 over TCP
Why 8 Connections and 16 Streams?
What is Multiplexing?
Without multiplexing, sending 16 chunks simultaneously requires 16 TCP connections. With multiplexing, one TCP connection carries all 16 by interleaving. HTTP/2 chops each chunk's data into small frames (16 KB default), stamps each frame with a stream ID, and interleaves them on the wire. The receiver reads frames, sorts by stream ID, and reassembles each stream independently. This is purely a framing-layer concern — invisible to application code.
Why TCP and Not UDP?
TCP operates at the packet level (1.5 KB). Our application operates at the chunk level (4 MB). A 4 MB chunk is ~2,700 packets. If one packet is lost:
- With TCP: Kernel retransmits just 1.5 KB. Application receives a complete 4 MB chunk. Cost: 1.5 KB re-sent, ~50 ms delay.
- With UDP (chunk-level retry): SHA-256 verification fails. Discard all 4 MB and re-request. Cost: 4 MB re-sent — a 2,700× amplification. At 0.1% loss, ~70% of all data is re-transmitted.
- With UDP (packet-level reliability): Track individual packets, detect losses, retransmit. But this requires sequence numbers, retransmit timers, duplicate detection, congestion control — you've reimplemented TCP.
TCP's ordering guarantee is a small tax (occasional HOL blocking). Its packet-level reliability saves us from 2,700× retransmit amplification.
QUIC Fallback
QUIC runs over UDP but builds its own per-stream reliability, eliminating TCP's head-of-line blocking. Used as a fallback when measured packet loss exceeds 1% (typically WAN/internet paths). Not the default because: (1) NIC hardware offloading only works for TCP; (2) BBR over TCP is more mature; (3) kernel bypass (io_uring) integrates better with TCP sockets.
Adaptive Compression
| Algorithm | Ratio | Speed | When Used |
|---|---|---|---|
| LZ4 | 2.1× | 4.5 GB/s | CPU bottleneck (fast network, slow CPU) |
| Zstd level 3 | 3.2× | 1.2 GB/s | Balanced default |
| Zstd level 9 | 3.8× | 200 MB/s | Network bottleneck (slow WAN link) |
| None | 1.0× | ∞ | Already compressed/encrypted data (entropy >7.5 bits/byte) |
The first 64 KB of each chunk is sampled to estimate entropy. High-entropy data is sent uncompressed. Algorithm choice between LZ4/Zstd is based on the ratio of measured network throughput to available CPU cycles.
Zero-Copy I/O & Kernel Optimizations
Why a Single TCP Connection Doesn't Fill the Pipe
A 10 Gbps NIC can push 1.25 GB/s onto the wire. A single “normal” TCP connection achieves a fraction of this. Five independent bottlenecks compound:
Combined Effect
| Bottleneck | Effect on 10 Gbps | Mitigation |
|---|---|---|
| Default 4 MB buffers | 640 Mbps (6%) | 128 MB per-socket buffers |
| CUBIC cwnd ramp | ~7 Gbps (70%) | BBR congestion control |
| HOL blocking (0.1% loss) | ~4 Gbps (40%) | 8 independent TCP connections |
| Single-threaded receiver | ~2.8 Gbps (28%) | 64 parallel processing workers |
| 2× memory copies | ~2.2 Gbps (22%) | sendfile() + io_uring + huge pages |
| All mitigations applied | ~8 Gbps (80%) | Remaining 20%: TLS, checksums, fsync |
Data Integrity
Four layers of verification, each catching a different class of corruption at a different point in the pipeline:
SHA-256 alone is technically sufficient. CRC32C adds value because it is 20× faster and catches wire errors before wasting CPU on decompression and SHA-256. The Merkle tree adds value because per-chunk SHA-256 cannot detect missing or misordered chunks. A single Merkle root comparison proves the entire dataset is correct; without it, you'd compare 2.5 million hashes individually.
Reliability & Fault Tolerance
Write-Ahead Log (WAL)
The WAL is an append-only file on disk — not a service, not a database, not a distributed system. It stores a sequence of binary records, each describing a state transition:
| Record Type | Meaning | Fields |
|---|---|---|
| CHUNK_QUEUED | Chunk added to send queue | transfer_id, chunk_index, sha256 |
| CHUNK_SENT | Chunk data written to socket | transfer_id, chunk_index |
| CHUNK_ACKED | Receiver confirmed receipt | transfer_id, chunk_index, ack_status |
| CHUNK_COMMITTED | Written to final location + fsync'd | transfer_id, chunk_index |
| STATE_CHANGE | Transfer phase transition | transfer_id, old_state, new_state |
| CHECKPOINT | Snapshot of all in-flight state | full state map |
Why a WAL and Not a Database?
The WAL is append-only: every write is sequential I/O at the end of the file — the fastest possible disk operation. NVMe SSDs can do 500K+ sequential small writes per second. A database doing random I/O, maintaining B-tree indexes, and journaling is 10–50× slower for this access pattern. With 64 parallel workers recording state, a database's row-level locking creates contention. An append-only file has zero lock contention.
Crash Recovery Scenarios
Retry Strategy
delay = min(base × 2^attempt + random(0, jitter), max_delay)
base: 100ms, jitter: 50ms, max: 30s, max_attempts: 5Per-connection retries: if a gRPC stream breaks, reconnect with fresh TLS handshake. All in-flight chunks on that stream are re-queued.
Chaos Engineering Tests
| Test | Mechanism | Validates |
|---|---|---|
| Kill sender mid-transfer | SIGKILL + restart | WAL replay, resume from checkpoint |
| Kill receiver mid-write | SIGKILL + restart | Partial chunk cleanup, gap filling |
| Network partition (60s) | iptables DROP | Reconnect + delta sync |
| Flip random bits on wire | tc netem corrupt 1% | CRC32C + SHA-256 catch all corruption |
| Disk full on receiver | fallocate fill disk | Backpressure, THROTTLE ack, retry |
| Slow disk (10ms latency) | dm-delay | I/O queue depth adaptation |
| OOM pressure | cgroups memory limit | Graceful degradation, fewer streams |
Security
Transport Security
All data in transit uses TLS 1.3 with cipher suites AES-256-GCM-SHA384 and ChaCha20-Poly1305-SHA256. Mutual TLS (mTLS) requires both sender and receiver to present X.509 certificates signed by the internal CA. Certificate rotation every 90 days.
Encryption at Rest — Key Hierarchy
- • Per-transfer DEKs limit blast radius of key compromise to one transfer, not all data
- • Destroying a DEK cryptographically erases that transfer instantly, without re-encrypting the volume
- • Chunks can live on any storage backend (local disk, object store, archive) carrying their own encryption
When LUKS is acceptable: single-tenant systems where per-transfer key isolation isn't required.
Access Control & Audit
RBAC with OPA policy engine evaluates every transfer request against policies: allowed source/dest pairs, max transfer size, time-of-day windows, rate limits per-user. Immutable audit log records every API call, state transition, and access with who, what, when, from_ip, transfer_id, result. Retained for 2 years.
Observability
Key Metrics (Prometheus)
| Metric | Type | Description |
|---|---|---|
| transfer_bytes_total | Counter | Total bytes transferred by direction and compression type |
| transfer_chunks_total | Counter | Chunks sent, received, deduped, or failed |
| transfer_throughput_bytes | Gauge | Current throughput per connection |
| chunk_latency_seconds | Histogram | End-to-end chunk delivery time (p50/p95/p99) |
| wal_entries_pending | Gauge | Unprocessed WAL entries (backlog indicator) |
| dedup_ratio | Gauge | Fraction of chunks skipped via deduplication |
| integrity_failures_total | Counter | CRC or SHA mismatches (should be ~0) |
| connection_resets_total | Counter | gRPC stream resets (network health indicator) |
Distributed Tracing
Every transfer gets a root span. Child spans for each phase: scan_files, chunk_file, negotiate_dedup, transfer_chunks (with per-chunk sub-spans for compress/encrypt/send), verify_merkle, commit. Sampling: 100% for failed transfers, 10% for successful (adaptive based on volume).
Alerting Rules
| Alert | Condition | Severity |
|---|---|---|
| Throughput Drop | <50% of baseline for 5 min | WARN |
| Integrity Failure | Any checksum mismatch | CRITICAL |
| Transfer Stalled | No progress for 2 min | WARN |
| WAL Backlog | >10K pending entries | WARN |
| Connection Failures | >5 resets in 1 min | CRITICAL |
| Disk Space Low | <10% free on receiver | CRITICAL |
| Certificate Expiry | <7 days to expiration | WARN |
Design Tradeoffs Summary
Every major decision involved rejecting viable alternatives:
| Decision | Chosen | Runner-Up | Key Tradeoff |
|---|---|---|---|
| Transport | gRPC streaming | Raw TCP | 5–8% overhead vs. months of protocol engineering |
| Chunking | CDC (Buzhash) | Fixed-size | Code complexity vs. 90–99% savings on incremental |
| Chunk Size | 4 MB average | 1 MB / 16 MB | RAM vs. dedup granularity vs. round-trips |
| Integrity | 3-layer | SHA-256 only | Complexity vs. early detection + failure localization |
| Reliability | WAL | Database | Sequential I/O speed vs. query flexibility |
| Dedup | Chunk-level | File-level / None | CPU+RAM overhead vs. 40–80% bandwidth savings |
| Compression | Adaptive | Fixed Zstd-3 | Decision logic vs. optimal per-environment |
| Flow Control | Receiver-driven | Sender rate-limit | ACK overhead vs. perfect receiver knowledge |
| Encryption | Chunk-level AES | Volume LUKS | Code complexity vs. key granularity |
| Coordination | etcd + RocksDB | Postgres | Operational overhead vs. multi-node needs |
| Connections | 8 TCP × 16 streams | 1 TCP / 128 TCP | HOL blast radius vs. connection overhead |
| Primary Protocol | TCP + QUIC fallback | QUIC-only | NIC offload + maturity vs. loss handling |
Operations & Capacity Planning
Kernel Tuning
# Socket buffers
net.core.rmem_max = 134217728 # 128 MB receive buffer max
net.core.wmem_max = 134217728 # 128 MB send buffer max
net.ipv4.tcp_rmem = 4096 1048576 134217728
net.ipv4.tcp_wmem = 4096 1048576 134217728
# Congestion control
net.ipv4.tcp_congestion_control = bbr
# Network queue
net.core.netdev_max_backlog = 250000
# VM tuning
vm.dirty_ratio = 40
vm.swappiness = 10
# File descriptors
ulimit -n 1048576 # 1M open fdsDeployment Modes
Hardware Recommendations
| Component | Specification | Rationale |
|---|---|---|
| CPU | 16+ cores (AMD EPYC / Xeon) | SHA-256, Zstd, CDC hashing are CPU-bound |
| RAM | 64 GB+ | Chunk buffers (64×16MB=1GB), dedup index, RocksDB block cache |
| Network | 25 Gbps+ NIC (dual-port for HA) | 10TB @ 25Gbps = ~53 minutes |
| Storage | NVMe SSD, 2× transfer size | Chunk store + WAL. 3+ GB/s sequential writes |
| NIC Offload | TSO, GRO, TCP checksum offload | Reduces CPU load for TCP processing |
Transfer Time Estimates
| Data Size | 10 Gbps | 25 Gbps | 100 Gbps |
|---|---|---|---|
| 100 GB | ~1.3 min | ~32 sec | ~8 sec |
| 1 TB | ~13 min | ~5.3 min | ~1.3 min |
| 10 TB | ~2.2 hrs | ~53 min | ~13 min |
| 50 TB | ~11 hrs | ~4.4 hrs | ~66 min |
Assumes 80% link utilization, 0% dedup, Zstd level 3 (3.2× compression). Actual times significantly lower with dedup on incremental transfers.
When This System is Wrong
- • Sub-100 GB, one-time transfer: Use rsync or scp. Engineering investment unjustified.
- • Streaming/real-time data: Use Kafka, Pulsar, or Kinesis. This is a batch system.
- • Cloud-to-cloud with managed services: Use AWS DataSync, GCP Transfer Service, or Azure Data Box.
- • Cross-region over public internet: Consider Aspera, Signiant with proprietary WAN optimization.
- • Immutable write-once data: Simple PUT to object storage with file-level checksums. CDC overhead adds no value.
Conclusion
Meridian is designed for organizations that need to move terabytes of data between nodes reliably, efficiently, and securely. Every design choice reflects a conscious tradeoff — gRPC over raw TCP for engineering velocity, CDC over fixed-size for incremental efficiency, WAL over database for I/O speed, multiple TCP connections over one for fault isolation.
The system is not the right tool for every job. For small, one-time transfers, rsync is simpler. For real-time streaming, Kafka is better. For cloud-managed environments, first-party transfer services may suffice. Meridian's value proposition is the combination of TB-scale throughput, zero-data-loss integrity, crash-safe resumability, and production-grade observability — all in a single, self-contained system.