Pidgin -- Runtime-learned compression that beats Protobuf

Learns schemas at runtime from sample JSON, strips keys entirely, encodes values optimally. No .proto files, no codegen, zero config.

pip install pidgin-codec
Python 3.10+ / MIT License / C extension auto-detected

Benchmarks

Compression ratio: % of original JSON size (lower is better). Verified on identical data, single-threaded.

Users x1000

gzip
21.4%
brotli
19.5%
Proto+zstd
20.3%
Pidgin
18.9%

Orders x500 (nested)

gzip
19.1%
brotli
17.5%
Proto+zstd
17.4%
Pidgin
16.1%

Events x5000

gzip
13.6%
brotli
12.3%
Proto+zstd
12.3%
Pidgin
10.8%

Speed (ms, with C extension -- lower is better)

DatasetbrotlizstdProto+zstdPidgin
Users x10007.783.675.766.41
Orders x50010.526.416.977.96
Events x500030.4914.8524.9225.62

Best compression across all datasets. Speed competitive with Proto+zstd, faster than brotli on all except Users.

Features

Two layers, each useful independently.

core

SchemaCodec

Runtime-learned binary compression. Feed it sample JSON, it infers the schema -- field names, types, enum domains, nesting. Then it strips keys entirely and encodes values with type-optimal encoders. C extension for native speed.

optional

RatchetCipher

Forward-secrecy encryption for persistent channels (WebSocket, SSE, IoT). Keys ratchet after every message -- compromising a future key cannot decrypt past messages. Install with pip install pidgin[crypto].

SchemaCodec

from pidgin import SchemaCodec

# Learn schema from sample records
codec = SchemaCodec.learn(sample_records)

# Compress: ~10-19% of original JSON
compressed = codec.compress(records)

# Decompress: lossless round-trip
original = codec.decompress(compressed)

Schema Evolution

# v1 schema
codec_v1 = SchemaCodec.learn(users_v1)

# Data gains new fields, types change
profile_v2 = codec_v1.profile.evolve(users_v2)
codec_v2 = SchemaCodec(profile_v2)

# v2 handles both old and new data shapes
codec_v2.compress(old_data)  # missing new fields → absent marker
codec_v2.compress(new_data)  # new fields encoded, removed fields skipped

How It Works

Two-stage pipeline. Each stage is independent and composable.

JSON Input
Raw structured data (list of dicts)
|
SchemaCodec
Learn schema from samples. Strip all keys.
Encode values by inferred type (int, float, str, bool, enum, nested).
Apply brotli on top. Output: compact binary.
|
RatchetCipher (optional)
Forward-secrecy encryption for persistent channels.
Key ratchets after every message.
Past messages safe even if future key leaks.
|
Wire Output
Compressed, optionally encrypted binary

Key insight: why Pidgin beats generic compressors

Generic compressors (gzip, brotli, zstd) treat data as opaque byte streams. They find repeated byte patterns but cannot exploit structural knowledge. In a JSON array of 1000 objects, the key "email" appears 1000 times -- generic compressors reduce this to a few bytes via backreferences, but Pidgin eliminates it entirely. It knows the schema, so the key is encoded once in the profile header, and each record contains only values in a fixed order. Combined with type-specific encoding (varints for integers, enum indices for categorical strings, nested sub-profiles for objects), this produces consistently smaller output.

API Reference

Complete public interface. All classes importable from pidgin.

Class / MethodDescription
SchemaCodec.learn(samples)Learn schema from list of dicts, return codec
codec.compress(data)Compress dict or list[dict] to binary bytes
codec.decompress(data)Decompress binary back to dict or list[dict]
codec.profileAccess learned SchemaProfile
profile.evolve(new_samples)Evolve schema -- backward + forward compatible
profile.diff(other)Show changes between profile versions
profile.to_json() / from_json()Serialize / deserialize profile for sharing
RatchetCipher(shared_secret)Init encryption with shared secret
cipher.encrypt(data)Encrypt bytes + ratchet key forward
cipher.decrypt(data)Decrypt bytes + ratchet key forward
SecureChannel.create(name)E2E encrypted channel (X25519 + SchemaCodec + Ratchet)

Server Integration

One line to enable. Zero backend changes.

ServerConfigIntegration
Kongpidgin = trueLua FFI plugin
nginxpidgin on;Dynamic C module
ApachePidginEnable OnOutput filter
CaddypidginGo cgo middleware
Traefikmiddleware configPure Go plugin
HAProxySPOE filterExternal agent
FastAPIadd_middleware(PidginMiddleware)Python
DjangoMIDDLEWARE = [...]Python

All modules use libpidgin (portable C library). Auto-learns schemas, auto-compresses, auto-evolves on API changes. Profiles served at /.well-known/pidgin/.

Schema Evolution

API changes handled automatically. Zero downtime.

API ChangeWhat Happens
New field addedJSON fallback, then auto-evolve incorporates it as typed field
Field removedABSENT marker (1 byte), old clients unaffected
Field returnsAlready in schema as nullable, encodes typed immediately
New enum valueAppended to list (old indices preserved)
Type widened (int to float)Auto-widened safely
Schema drift detectedAuto-evolve, profile version bumped, new ETag

Old clients with v1 profiles decode v2 data (unknown fields in JSON fallback). New clients with v2 profiles decode v1 data (missing fields as absent). Bidirectional compatibility guaranteed.