Rahul Garg

Executive Summary

This document specifies the architecture, design decisions, and implementation details for Meridian, a production-grade system that transfers terabyte-scale datasets between nodes with zero data loss.

The system provides resumable transfers, content-defined deduplication, multi-path parallelism, end-to-end cryptographic integrity verification, and automatic failover — all while saturating available network bandwidth.

Resumable Transfers

Content-Defined Dedup

Multi-Path Parallelism

Cryptographic Integrity

Automatic Failover

Bandwidth Saturation

Requirements

Functional Requirements

▶
Bulk Transfer: Transfer datasets from 1 GB to 50+ TB between two nodes (source and destination).
▶
Resumability: If interrupted at any point (crash, network failure, reboot), resume from the point of interruption, not restart from scratch.
▶
Incremental Sync: When transferring a previously-transferred dataset, only changed portions are re-sent.
▶
Integrity Guarantee: The receiver cryptographically verifies that received data is bit-for-bit identical to the source. Silent corruption is detected.
▶
Atomic Commit: A transfer is either fully committed or not committed at all. No partial states visible to consumers.
▶
Multi-File Support: A single transfer job can include thousands of files in a directory tree, preserving structure on the destination.

Performance Targets

Metric	Target	Rationale
Throughput	≥80% of link capacity	On a 10 Gbps link, achieve ≥8 Gbps sustained
Chunk Latency (p99)	<50 ms	End-to-end time for one chunk: send, verify, ACK
Resume Time	<5 seconds	Time from process restart to transfer resumption
Dedup Ratio (incremental)	≥40%	For typical edit patterns on previously-transferred data

Reliability Targets

Metric	Target	Rationale
Data Loss	Zero	Not one byte lost, ever. Cryptographic verification end-to-end.
Transfer Success Rate	99.999%	Measured over 30-day rolling window
Recovery from Node Crash	Automatic	WAL-based recovery, no manual intervention
Recovery from Network Partition	Automatic	Reconnect with exponential backoff, delta sync on reconnect

Security Requirements

Encryption in Transit

TLS 1.3 with mTLS. AES-256-GCM, ChaCha20-Poly1305.

Encryption at Rest

AES-256-GCM per-chunk. Per-transfer DEKs. Master key in HSM/Vault.

Access Control

RBAC with OPA policy engine. Roles: admin, operator, transfer-agent, read-only.

Audit Logging

Immutable, append-only audit log. 2-year retention.

Scalability

• Datasets up to 50 TB in a single transfer job
• Up to 10 million chunks per transfer (~40 TB at 4 MB average)
• Up to 100 concurrent transfer jobs per node
• Horizontal scaling via multiple sender/receiver pairs

Observability

• Metrics: Prometheus-compatible on :9090 — throughput, chunk latency histogram, dedup ratio, error rate, WAL backlog
• Tracing: OpenTelemetry distributed tracing with per-chunk spans. 100% sampling on failures, 10% on success
• Alerting: Pre-configured rules for throughput drops, integrity failures, stalled transfers, disk space, certificate expiry

Constraints & Assumptions

• Both nodes run Linux (kernel 5.6+ for io_uring support)
• Network link at least 1 Gbps. Optimized for 10–100 Gbps
• NVMe SSD storage on both nodes
• All ordering uses logical clocks, not wall clocks (no time sync required)
• Designed for batch/bulk transfers, not real-time streaming

High-Level Architecture

The system consists of three planes: the data plane (chunking, transfer, reassembly), the control plane (coordination, state management, health monitoring), and the observability plane (metrics, tracing, alerting).

Architecture: End-to-End Data FlowOpen in Excalidraw →

End-to-End Data Flow

1. File Scanning

Scan directory tree, compute metadata. inotify/fswatch for change detection.

2. CDC Chunking

Variable-size chunks (256 KB – 16 MB, avg 4 MB) via Buzhash rolling hash.

3. Deduplication

SHA-256 hash checked against receiver's dedup index. Skip existing chunks.

4. Compression

Adaptive: LZ4, Zstd L3, Zstd L9, or none based on entropy sampling.

5. Transport

gRPC bi-directional streams over 8 parallel TCP connections. TLS 1.3.

6. Integrity

CRC32C (wire), SHA-256 (chunk), Merkle root (file completeness).

7. Reassembly & Commit

Chunks reassembled via manifest. fsync'd. Atomically committed.

Transfer State Machine

QUEUED→SCANNING→CHUNKING→NEGOTIATING→TRANSFERRING→VERIFYING→COMMITTING→COMPLETE

Failure at any state triggers RETRYING (exponential backoff). After max retries: FAILED with diagnostic snapshot. Operators can PAUSE or CANCEL at any point.

Component Responsibilities

Component	Location	Responsibility
File Scanner	Source	Discover files, detect changes, feed chunker
CDC Chunker	Source	Split files into content-defined variable-size chunks
Dedup Index	Both	Track which chunks exist at each node (Bloom filter + RocksDB)
Transfer Queue	Source	Priority queue of chunks to send, backed by WAL
Transport Pool	Both	8 gRPC connections, 16 streams each, TLS 1.3, flow control
Integrity Verifier	Destination	CRC32C, SHA-256, Merkle tree verification
Reassembler	Destination	Reconstruct files from chunks using manifest
WAL	Both	Crash-safe state tracking (append-only, fsync'd)
Coordinator	Both	State machine management, retry logic, job lifecycle
Metrics Exporter	Both	Prometheus metrics, OpenTelemetry traces

Content-Defined Chunking

Why Not Fixed-Size Chunks?

Fixed-size chunking (e.g., splitting every 4 MB) breaks catastrophically on insertions. Consider a 1 TB file where a 50 KB row is inserted near the beginning:

• Fixed chunks: Every chunk boundary after the insertion shifts by 50 KB. Every chunk has a different SHA-256 hash. You must re-transfer the entire 1 TB file.
• CDC chunks: Boundaries are data-dependent via rolling hash. An insertion only affects 2–3 chunks near the edit point. All other boundaries remain stable.

CDC reduces re-transfer volume by 90–99% compared to fixed-size chunking.

Algorithm: Buzhash

We use Buzhash (bitwise rolling hash) rather than Rabin fingerprinting:

Property	Buzhash	Rabin
Speed	~3.5 GB/s	~2.8 GB/s
Distribution Quality	Good	Excellent
Implementation	XOR + rotate (simple)	Polynomial arithmetic (moderate)

We chose Buzhash because chunking is CPU-bound and runs inline with I/O. The 25% speed advantage keeps pace with NVMe read speeds (~3.5 GB/s). Rabin's better distribution doesn't matter because we rely on SHA-256 for content addressing, not the rolling hash.

Chunk Size Parameters

Parameter	Value	Rationale
Minimum Chunk	256 KB	Prevents pathological tiny chunks from repetitive data patterns
Average Chunk	4 MB	Sweet spot: 2.5M chunks per 10 TB fits in ~640 MB RAM index
Maximum Chunk	16 MB	Caps worst-case for uniform data (zeros, compressed streams)
Window Size	48 bytes	Sliding window for rolling hash computation

The 4 MB average is a three-way tension: smaller chunks (1 MB) give better dedup granularity but produce 10M chunks per 10 TB, requiring ~2.5 GB for the index. Larger chunks (16 MB) reduce index size but miss small edits. 4 MB is tunable per-environment.

The Chunk Manifest

ChunkMeta {
  chunk_index: u64,
  file_path:   String,
  offset:      u64,
  length:      u32,
  sha256:      [u8; 32]
}

The manifest records each chunk's position (byte offset within the file), size, and content hash. It's deterministic: given the same file content and CDC parameters, the same boundaries are always produced. For multi-file transfers, chunk indices are global across all files. The sender can transmit chunks in any order; the receiver uses the manifest to place each chunk correctly.

Deduplication Index

RocksDB Backend

SHA-256(chunk) → chunk_id, size, ref_count, locations[]

Bloom Filter

0.01% false positive rate. ~600 KB RAM for 2.5M chunks. Eliminates 99.99% of disk lookups.

Garbage Collection

ref_count=0 eligible for GC. Background compactor every 6 hours, 72-hour grace period.

Transport Layer

Transport Layer: Protocol Stack & TCP vs UDPOpen in Excalidraw →

Protocol Stack: gRPC over HTTP/2 over TCP

TCP (Layer 4)

Provides packet-level reliable delivery. 8 independent TCP connections, each with 128 MB socket buffer, BBR congestion control. Kernel handles retransmission at 1.5 KB granularity.

HTTP/2 (Framing)

Multiplexes 16 logical streams per TCP connection. Frames from different streams interleaved on the wire, tagged with stream IDs. 16-way parallelism per connection without separate handshakes.

gRPC (Application)

RPC semantics (TransferChunks, NegotiateChunks, CommitTransfer), protobuf serialization, bi-directional streaming, deadline/cancellation propagation. Each gRPC stream maps to one HTTP/2 stream.

Combined

8 TCP connections × 16 gRPC streams = 128 parallel chunk transfers. A packet loss on connection #3 only stalls its 16 streams; the other 112 streams continue unaffected.

Why 8 Connections and 16 Streams?

Too Few (1 TCP, 128 streams)

Single packet loss causes head-of-line blocking across all 128 streams. At 0.1% loss, effective throughput drops dramatically.

Too Many (128 TCP, 1 stream each)

128 TLS handshakes, 128 × 128 MB = 16 GB RAM, 128 congestion windows fighting for bandwidth, middlebox DDoS flags.

Sweet Spot (8 TCP, 16 streams)

8 fault domains limit blast radius to 12.5%. 16 streams amortize overhead. 8 handshakes, 2 GB buffer memory, good fairness.

What is Multiplexing?

Without multiplexing, sending 16 chunks simultaneously requires 16 TCP connections. With multiplexing, one TCP connection carries all 16 by interleaving. HTTP/2 chops each chunk's data into small frames (16 KB default), stamps each frame with a stream ID, and interleaves them on the wire. The receiver reads frames, sorts by stream ID, and reassembles each stream independently. This is purely a framing-layer concern — invisible to application code.

Why TCP and Not UDP?

The Granularity Argument

TCP operates at the packet level (1.5 KB). Our application operates at the chunk level (4 MB). A 4 MB chunk is ~2,700 packets. If one packet is lost:

With TCP: Kernel retransmits just 1.5 KB. Application receives a complete 4 MB chunk. Cost: 1.5 KB re-sent, ~50 ms delay.
With UDP (chunk-level retry): SHA-256 verification fails. Discard all 4 MB and re-request. Cost: 4 MB re-sent — a 2,700× amplification. At 0.1% loss, ~70% of all data is re-transmitted.
With UDP (packet-level reliability): Track individual packets, detect losses, retransmit. But this requires sequence numbers, retransmit timers, duplicate detection, congestion control — you've reimplemented TCP.

TCP's ordering guarantee is a small tax (occasional HOL blocking). Its packet-level reliability saves us from 2,700× retransmit amplification.

QUIC Fallback

QUIC runs over UDP but builds its own per-stream reliability, eliminating TCP's head-of-line blocking. Used as a fallback when measured packet loss exceeds 1% (typically WAN/internet paths). Not the default because: (1) NIC hardware offloading only works for TCP; (2) BBR over TCP is more mature; (3) kernel bypass (io_uring) integrates better with TCP sockets.

Adaptive Compression

Algorithm	Ratio	Speed	When Used
LZ4	2.1×	4.5 GB/s	CPU bottleneck (fast network, slow CPU)
Zstd level 3	3.2×	1.2 GB/s	Balanced default
Zstd level 9	3.8×	200 MB/s	Network bottleneck (slow WAN link)
None	1.0×	∞	Already compressed/encrypted data (entropy >7.5 bits/byte)

The first 64 KB of each chunk is sampled to estimate entropy. High-entropy data is sent uncompressed. Algorithm choice between LZ4/Zstd is based on the ratio of measured network throughput to available CPU cycles.

Zero-Copy I/O & Kernel Optimizations

sendfile()

Transfers data directly from disk to socket without copying into userspace. Saves 2 memory copies per chunk.

io_uring

Linux 5.6+ async batched I/O submission. Up to 32 SQEs per syscall, reducing context switches 10–30×.

Huge Pages

2 MB huge pages for chunk buffers to reduce TLB misses during hash computation and compression.

Why a Single TCP Connection Doesn't Fill the Pipe

A 10 Gbps NIC can push 1.25 GB/s onto the wire. A single “normal” TCP connection achieves a fraction of this. Five independent bottlenecks compound:

1. Socket Buffer Limits

TCP throughput is bounded by: throughput = socket_buffer / RTT. Default Linux socket buffers are ~4 MB. With 50 ms RTT: 4 MB / 50 ms = 640 Mbps — just 6.4% of a 10 Gbps link. The kernel defaults to small buffers because a server with 1,000 connections at 128 MB each would need 256 TB of RAM.

2. Congestion Window Ramp-Up

TCP starts with a small congestion window (~14.6 KB) and probes upward. To fill a 10 Gbps link with 50 ms RTT, cwnd must reach the bandwidth-delay product: 10 Gbps × 50 ms = 62.5 MB. CUBIC ramps conservatively. BBR probes much faster.

3. Head-of-Line Blocking

HTTP/2 on one TCP connection: a lost packet stalls ALL streams until retransmission (one RTT). At 0.1% loss on 10 Gbps, thousands of stall events per second. Multiple TCP connections contain the blast radius.

4. Receiver Processing Bottleneck

Single-threaded receiver doing decompress + SHA-256 + disk write tops out at 350–500 MB/s. Buffer fills, TCP window closes, sender throttled. We use 64 parallel processing workers.

5. Memory Copy Overhead

Normal read()/write() copies data twice per direction. At 10 Gbps that's ~2.5 GB/s of memcpy competing with application work. sendfile(), io_uring, and zero-copy techniques reduce this.

Combined Effect

Bottleneck	Effect on 10 Gbps	Mitigation
Default 4 MB buffers	640 Mbps (6%)	128 MB per-socket buffers
CUBIC cwnd ramp	~7 Gbps (70%)	BBR congestion control
HOL blocking (0.1% loss)	~4 Gbps (40%)	8 independent TCP connections
Single-threaded receiver	~2.8 Gbps (28%)	64 parallel processing workers
2× memory copies	~2.2 Gbps (22%)	sendfile() + io_uring + huge pages
All mitigations applied	~8 Gbps (80%)	Remaining 20%: TLS, checksums, fsync

Data Integrity

Four Layers of Data Integrity & Merkle TreeOpen in Excalidraw →

Four layers of verification, each catching a different class of corruption at a different point in the pipeline:

Layer 1: CRC32C on the Wire

Every chunk carries a CRC32C checksum in its protobuf header. Verified immediately upon receipt, before writing to disk. Hardware-accelerated (SSE4.2, ~30 GB/s), ~0.1% overhead. Catches: bit flips in transit, NIC errors, kernel buffer corruption.

Layer 2: SHA-256 Content Hash

After decompression, receiver computes SHA-256 of raw chunk data and compares to manifest hash. Uses hardware acceleration (SHA-NI on x86, CE on ARM) at ~1.5 GB/s. Catches: decompression errors, memory corruption, anything CRC32C might miss.

Layer 3: Merkle Tree

Once all chunks for a file are received, a Merkle tree is built from chunk SHA-256 hashes. Root compared against sender's. Catches: missing chunks, duplicate chunks, misordered chunks. Enables O(log n) failure localization via binary search.

Layer 4: Background Scrubbing

After commit, a background scrubber periodically re-reads stored chunks and re-verifies SHA-256 hashes. Detects bit rot from storage media degradation. Low I/O priority (ionice class 3). Full scrub of 10 TB completes in ~8 hours at 350 MB/s.

Why Three Layers Instead of Just SHA-256?

SHA-256 alone is technically sufficient. CRC32C adds value because it is 20× faster and catches wire errors before wasting CPU on decompression and SHA-256. The Merkle tree adds value because per-chunk SHA-256 cannot detect missing or misordered chunks. A single Merkle root comparison proves the entire dataset is correct; without it, you'd compare 2.5 million hashes individually.

Reliability & Fault Tolerance

Write-Ahead Log (WAL)

The WAL is an append-only file on disk — not a service, not a database, not a distributed system. It stores a sequence of binary records, each describing a state transition:

Record Type	Meaning	Fields
CHUNK_QUEUED	Chunk added to send queue	transfer_id, chunk_index, sha256
CHUNK_SENT	Chunk data written to socket	transfer_id, chunk_index
CHUNK_ACKED	Receiver confirmed receipt	transfer_id, chunk_index, ack_status
CHUNK_COMMITTED	Written to final location + fsync'd	transfer_id, chunk_index
STATE_CHANGE	Transfer phase transition	transfer_id, old_state, new_state
CHECKPOINT	Snapshot of all in-flight state	full state map

The Write-Ahead Contract

Before doing any action, write a WAL record describing the intended action. Only after the record is safely on disk (fsync'd) do you execute the action. If you crash between the WAL write and the action, on restart you replay the WAL and retry. Because all operations are idempotent (sending a duplicate chunk is a no-op via dedup), retrying is always safe.

Why a WAL and Not a Database?

The WAL is append-only: every write is sequential I/O at the end of the file — the fastest possible disk operation. NVMe SSDs can do 500K+ sequential small writes per second. A database doing random I/O, maintaining B-tree indexes, and journaling is 10–50× slower for this access pattern. With 64 parallel workers recording state, a database's row-level locking creates contention. An append-only file has zero lock contention.

Crash Recovery Scenarios

Sender Crash

Read WAL from last checkpoint. Identify chunks in QUEUED or SENT-but-not-ACKED state. Resume from there. Skip file scanning if manifest already persisted.

Receiver Crash

Read WAL. Identify partially-written chunks (ACKED but not COMMITTED). Delete incomplete chunks. Send updated manifest to sender. Sender fills gaps.

Both Crash Simultaneously

Both replay WALs independently. Sender sends manifest; receiver responds with what it has. Delta transfer resumes. Worst case — verified with chaos testing.

Network Partition

gRPC keepalive pings every 10s. 3 consecutive failures (30s) = dead. Sender pauses and queues locally. Auto-reconnect with exponential backoff (1s → 60s cap). Lightweight delta sync via bitmap.

Retry Strategy

delay = min(base × 2^attempt + random(0, jitter), max_delay)
base: 100ms, jitter: 50ms, max: 30s, max_attempts: 5

Per-connection retries: if a gRPC stream breaks, reconnect with fresh TLS handshake. All in-flight chunks on that stream are re-queued.

Chaos Engineering Tests

Test	Mechanism	Validates
Kill sender mid-transfer	SIGKILL + restart	WAL replay, resume from checkpoint
Kill receiver mid-write	SIGKILL + restart	Partial chunk cleanup, gap filling
Network partition (60s)	iptables DROP	Reconnect + delta sync
Flip random bits on wire	tc netem corrupt 1%	CRC32C + SHA-256 catch all corruption
Disk full on receiver	fallocate fill disk	Backpressure, THROTTLE ack, retry
Slow disk (10ms latency)	dm-delay	I/O queue depth adaptation
OOM pressure	cgroups memory limit	Graceful degradation, fewer streams

Security

Transport Security

All data in transit uses TLS 1.3 with cipher suites AES-256-GCM-SHA384 and ChaCha20-Poly1305-SHA256. Mutual TLS (mTLS) requires both sender and receiver to present X.509 certificates signed by the internal CA. Certificate rotation every 90 days.

Encryption at Rest — Key Hierarchy

Master Key (MK)

Stored in HSM / AWS KMS / HashiCorp Vault. Never leaves the HSM. Rotates quarterly.

Data Encryption Key (DEK)

Per-transfer random 256-bit key. Encrypted with MK (envelope encryption). Stored alongside transfer metadata.

Chunk Nonce

Unique per-chunk (chunk_index as nonce counter). Ensures no nonce reuse even across retries.

Why Chunk-Level, Not Volume-Level (LUKS)?

• Per-transfer DEKs limit blast radius of key compromise to one transfer, not all data
• Destroying a DEK cryptographically erases that transfer instantly, without re-encrypting the volume
• Chunks can live on any storage backend (local disk, object store, archive) carrying their own encryption

When LUKS is acceptable: single-tenant systems where per-transfer key isolation isn't required.

Access Control & Audit

RBAC with OPA policy engine evaluates every transfer request against policies: allowed source/dest pairs, max transfer size, time-of-day windows, rate limits per-user. Immutable audit log records every API call, state transition, and access with who, what, when, from_ip, transfer_id, result. Retained for 2 years.

Observability

Key Metrics (Prometheus)

Metric	Type	Description
transfer_bytes_total	Counter	Total bytes transferred by direction and compression type
transfer_chunks_total	Counter	Chunks sent, received, deduped, or failed
transfer_throughput_bytes	Gauge	Current throughput per connection
chunk_latency_seconds	Histogram	End-to-end chunk delivery time (p50/p95/p99)
wal_entries_pending	Gauge	Unprocessed WAL entries (backlog indicator)
dedup_ratio	Gauge	Fraction of chunks skipped via deduplication
integrity_failures_total	Counter	CRC or SHA mismatches (should be ~0)
connection_resets_total	Counter	gRPC stream resets (network health indicator)

Distributed Tracing

Every transfer gets a root span. Child spans for each phase: scan_files, chunk_file, negotiate_dedup, transfer_chunks (with per-chunk sub-spans for compress/encrypt/send), verify_merkle, commit. Sampling: 100% for failed transfers, 10% for successful (adaptive based on volume).

Alerting Rules

Alert	Condition	Severity
Throughput Drop	<50% of baseline for 5 min	WARN
Integrity Failure	Any checksum mismatch	CRITICAL
Transfer Stalled	No progress for 2 min	WARN
WAL Backlog	>10K pending entries	WARN
Connection Failures	>5 resets in 1 min	CRITICAL
Disk Space Low	<10% free on receiver	CRITICAL
Certificate Expiry	<7 days to expiration	WARN

Design Tradeoffs Summary

Every major decision involved rejecting viable alternatives:

Decision	Chosen	Runner-Up	Key Tradeoff
Transport	gRPC streaming	Raw TCP	5–8% overhead vs. months of protocol engineering
Chunking	CDC (Buzhash)	Fixed-size	Code complexity vs. 90–99% savings on incremental
Chunk Size	4 MB average	1 MB / 16 MB	RAM vs. dedup granularity vs. round-trips
Integrity	3-layer	SHA-256 only	Complexity vs. early detection + failure localization
Reliability	WAL	Database	Sequential I/O speed vs. query flexibility
Dedup	Chunk-level	File-level / None	CPU+RAM overhead vs. 40–80% bandwidth savings
Compression	Adaptive	Fixed Zstd-3	Decision logic vs. optimal per-environment
Flow Control	Receiver-driven	Sender rate-limit	ACK overhead vs. perfect receiver knowledge
Encryption	Chunk-level AES	Volume LUKS	Code complexity vs. key granularity
Coordination	etcd + RocksDB	Postgres	Operational overhead vs. multi-node needs
Connections	8 TCP × 16 streams	1 TCP / 128 TCP	HOL blast radius vs. connection overhead
Primary Protocol	TCP + QUIC fallback	QUIC-only	NIC offload + maturity vs. loss handling

Operations & Capacity Planning

Kernel Tuning

# Socket buffers
net.core.rmem_max = 134217728          # 128 MB receive buffer max
net.core.wmem_max = 134217728          # 128 MB send buffer max
net.ipv4.tcp_rmem = 4096 1048576 134217728
net.ipv4.tcp_wmem = 4096 1048576 134217728

# Congestion control
net.ipv4.tcp_congestion_control = bbr

# Network queue
net.core.netdev_max_backlog = 250000

# VM tuning
vm.dirty_ratio = 40
vm.swappiness = 10

# File descriptors
ulimit -n 1048576  # 1M open fds

Deployment Modes

Lite Mode (2 nodes)

In-process coordinator, no etcd. WAL + RocksDB on local disk. Simplest deployment.

Standard Mode (2+ nodes)

etcd for coordination, RocksDB for local state. Multiple concurrent jobs and failover.

Relay Mode

Source → Relay → Destination. For transfers crossing network boundaries. Relay stores chunks temporarily.

Hardware Recommendations

Component	Specification	Rationale
CPU	16+ cores (AMD EPYC / Xeon)	SHA-256, Zstd, CDC hashing are CPU-bound
RAM	64 GB+	Chunk buffers (64×16MB=1GB), dedup index, RocksDB block cache
Network	25 Gbps+ NIC (dual-port for HA)	10TB @ 25Gbps = ~53 minutes
Storage	NVMe SSD, 2× transfer size	Chunk store + WAL. 3+ GB/s sequential writes
NIC Offload	TSO, GRO, TCP checksum offload	Reduces CPU load for TCP processing

Transfer Time Estimates

Data Size	10 Gbps	25 Gbps	100 Gbps
100 GB	~1.3 min	~32 sec	~8 sec
1 TB	~13 min	~5.3 min	~1.3 min
10 TB	~2.2 hrs	~53 min	~13 min
50 TB	~11 hrs	~4.4 hrs	~66 min

Assumes 80% link utilization, 0% dedup, Zstd level 3 (3.2× compression). Actual times significantly lower with dedup on incremental transfers.

When This System is Wrong

Use Something Simpler When...

• Sub-100 GB, one-time transfer: Use rsync or scp. Engineering investment unjustified.
• Streaming/real-time data: Use Kafka, Pulsar, or Kinesis. This is a batch system.
• Cloud-to-cloud with managed services: Use AWS DataSync, GCP Transfer Service, or Azure Data Box.
• Cross-region over public internet: Consider Aspera, Signiant with proprietary WAN optimization.
• Immutable write-once data: Simple PUT to object storage with file-level checksums. CDC overhead adds no value.

Conclusion

Meridian is designed for organizations that need to move terabytes of data between nodes reliably, efficiently, and securely. Every design choice reflects a conscious tradeoff — gRPC over raw TCP for engineering velocity, CDC over fixed-size for incremental efficiency, WAL over database for I/O speed, multiple TCP connections over one for fault isolation.

The system is not the right tool for every job. For small, one-time transfers, rsync is simpler. For real-time streaming, Kafka is better. For cloud-managed environments, first-party transfer services may suffice. Meridian's value proposition is the combination of TB-scale throughput, zero-data-loss integrity, crash-safe resumability, and production-grade observability — all in a single, self-contained system.

← Back to Blog

MeridianTB-Scale Reliable Data Transfer

Executive Summary

Requirements

Functional Requirements

Performance Targets

Reliability Targets

Security Requirements

Scalability

Observability

High-Level Architecture

End-to-End Data Flow

Transfer State Machine

Component Responsibilities

Content-Defined Chunking

Algorithm: Buzhash

Chunk Size Parameters

The Chunk Manifest

Deduplication Index

Transport Layer

Protocol Stack: gRPC over HTTP/2 over TCP

Why 8 Connections and 16 Streams?

What is Multiplexing?

Why TCP and Not UDP?

QUIC Fallback

Adaptive Compression

Zero-Copy I/O & Kernel Optimizations

Why a Single TCP Connection Doesn't Fill the Pipe

Combined Effect

Data Integrity

Reliability & Fault Tolerance

Write-Ahead Log (WAL)

Why a WAL and Not a Database?

Crash Recovery Scenarios

Retry Strategy

Chaos Engineering Tests

Security

Transport Security

Encryption at Rest — Key Hierarchy

Access Control & Audit

Observability

Key Metrics (Prometheus)

Distributed Tracing

Alerting Rules

Design Tradeoffs Summary

Operations & Capacity Planning

Kernel Tuning

Deployment Modes

Hardware Recommendations

Transfer Time Estimates

When This System is Wrong

Conclusion

Meridian
TB-Scale Reliable Data Transfer