File Format Specification#

This document describes the barecat file format.

Overview#

A barecat archive consists of:

Shard files - Binary files containing concatenated file data
Index file - SQLite database with metadata

File Naming#

Given a base path myarchive.barecat:

myarchive.barecat                 # SQLite index database
myarchive.barecat-shard-00000     # First data shard
myarchive.barecat-shard-00001     # Second data shard (if needed)
myarchive.barecat-shard-00002     # ...

Shard numbers are zero-padded to 5 digits.

Shard Files#

Shard files are simple binary files containing file data concatenated sequentially:

+----------------+----------------+----------------+-----+
| File 1 data    | File 2 data    | File 3 data    | ... |
+----------------+----------------+----------------+-----+
^                ^                ^
offset=0         offset=1000      offset=2500

There is no header, footer, or delimiter between files. The index stores the offset and size of each file.

When files are deleted, gaps may appear:

+----------------+--------+----------------+-----+
| File 1 data    | [gap]  | File 3 data    | ... |
+----------------+--------+----------------+-----+

Gaps are reclaimed by barecat defrag.

SQLite Index#

The index is a standard SQLite 3 database with the following schema:

Tables#

files - File metadata

CREATE TABLE files (
    path     TEXT NOT NULL,
    parent   TEXT GENERATED ALWAYS AS (
        rtrim(rtrim(path, replace(path, '/', '')), '/')
    ) VIRTUAL NOT NULL REFERENCES dirs(path),
    shard    INTEGER NOT NULL,
    offset   INTEGER NOT NULL,
    size     INTEGER DEFAULT 0,
    crc32c   INTEGER DEFAULT NULL,
    mode     INTEGER DEFAULT NULL,
    uid      INTEGER DEFAULT NULL,
    gid      INTEGER DEFAULT NULL,
    mtime_ns INTEGER DEFAULT NULL
);

path: Full path within the archive (e.g., “dir/subdir/file.txt”)
parent: Computed parent directory path
shard: Shard number (0, 1, 2, …)
offset: Byte offset within the shard
size: File size in bytes
crc32c: CRC32C checksum of file contents
mode: Unix file mode (permissions)
uid, gid: Owner user/group ID
mtime_ns: Modification time in nanoseconds since Unix epoch

dirs - Directory metadata and statistics

CREATE TABLE dirs (
    path           TEXT NOT NULL,
    parent         TEXT GENERATED ALWAYS AS (
        CASE WHEN path = '' THEN NULL
        ELSE rtrim(rtrim(path, replace(path, '/', '')), '/') END
    ) VIRTUAL REFERENCES dirs(path),
    num_subdirs    INTEGER DEFAULT 0,
    num_files      INTEGER DEFAULT 0,
    num_files_tree INTEGER DEFAULT 0,
    size_tree      INTEGER DEFAULT 0,
    mode           INTEGER DEFAULT NULL,
    uid            INTEGER DEFAULT NULL,
    gid            INTEGER DEFAULT NULL,
    mtime_ns       INTEGER DEFAULT NULL
);

num_subdirs: Immediate subdirectory count
num_files: Immediate file count
num_files_tree: Recursive file count
size_tree: Recursive total size

config - Archive configuration

CREATE TABLE config (
    key        TEXT PRIMARY KEY,
    value_text TEXT DEFAULT NULL,
    value_int  INTEGER DEFAULT NULL
) WITHOUT ROWID;

Standard config entries:

use_triggers: 1 if triggers are active
shard_size_limit: Maximum shard size in bytes
schema_version_major: Schema major version (currently 0)
schema_version_minor: Schema minor version (currently 3)

Indexes#

CREATE UNIQUE INDEX idx_files_path ON files(path);
CREATE UNIQUE INDEX idx_dirs_path ON dirs(path);
CREATE INDEX idx_files_parent ON files(parent);
CREATE INDEX idx_dirs_parent ON dirs(parent);
CREATE INDEX idx_files_shard_offset ON files(shard, offset);

Triggers#

The database uses triggers to maintain directory statistics. When a file is added, the parent directory’s counters are automatically updated, propagating up the tree.

Triggers can be disabled for bulk operations via the use_triggers config flag.

Checksum#

CRC32C (Castagnoli) is used for file checksums:

Polynomial: 0x1EDC6F41
Hardware accelerated on modern CPUs
Compatible with Google’s CRC32C implementation

Data Integrity#

Barecat provides:

File checksums - CRC32C for each file
SQLite integrity - ACID transactions, journaling
Verification - barecat verify checks all checksums

It does NOT provide:

Archive-level signatures
Encryption
Compression (files stored as-is)

Compatibility#

The format is designed for simplicity and long-term compatibility:

SQLite - Universally supported, stable format
Shards - Plain binary, no proprietary encoding
Schema versioning - Allows forward-compatible changes

Reading a barecat archive requires:

SQLite library
Ability to read binary files
Understanding of this specification

No special decompression or decryption is needed.

Writing a barecat archive is also straightforward with standard SQLite using the schema.sql file and regular file I/O.

Example: Reading Without Library#

Using standard tools:

# List all files
sqlite3 myarchive.barecat "SELECT path, shard, offset, size FROM files"

# Extract a specific file
sqlite3 myarchive.barecat \
    "SELECT shard, offset, size FROM files WHERE path='dir/file.txt'"
# Returns: 0|1234|5678

# Read the data
dd if=myarchive.barecat-shard-00000 bs=1 skip=1234 count=5678

Using Python without barecat:

import sqlite3

conn = sqlite3.connect('myarchive.barecat')
cursor = conn.execute(
    "SELECT shard, offset, size FROM files WHERE path=?",
    ('dir/file.txt',)
)
shard, offset, size = cursor.fetchone()

with open(f'myarchive.barecat-shard-{shard:05d}', 'rb') as f:
    f.seek(offset)
    data = f.read(size)

Version History#

Schema 0.3 (unreleased)

Fixed trigger bug: num_files no longer incorrectly propagated on directory move/delete (num_files counts direct children only, not recursive)

Schema 0.2 (v0.2.5, January 2025)

First released schema version.

config table with schema versioning (schema_version_major, schema_version_minor)
crc32c column in files for checksums
mode, uid, gid, mtime_ns columns for Unix metadata
dirs table with num_subdirs, num_files, num_files_tree, size_tree
SQLite triggers for automatic stats propagation
config table uses WITHOUT ROWID

Internal development note: During development, an intermediate version briefly used WITHOUT ROWID for all tables (files, dirs, config). This was reverted before release because rowid tables are more space-efficient for this use case. The script upgrade_database2.py exists to convert databases from this intermediate format, but since it was never released, this script is unlikely to be needed.

Pre-versioned format (original, June 2023)

Never formally released. Incompatible with current barecat.

Simple schema without config table or versioning
files: path, parent, shard, offset, size
directories: path, parent, total_size, total_file_count
No checksums or Unix metadata

Upgrading#

Run barecat upgrade <archive> to upgrade an archive to the current schema version. The upgrade process detects the source version automatically.

Pre-versioned → 0.3

Heavy migration that:

Renames old index to .old
Creates new index with current schema
Copies directory and file metadata
Calculates CRC32C checksums for all files (uses --workers)

0.2 → 0.3

Lightweight in-place fix:

Drops buggy del_subdir and move_subdir triggers
Recreates triggers with fixed logic
Rebuilds directory tree statistics to fix any corruption
Updates schema version

This is fast even for large archives since it doesn’t touch file data.