Performance Characteristics#

This document explains barecat’s performance characteristics and optimization strategies.

Lookup Performance#

File lookup by path is O(1) via SQLite index:

# Constant time regardless of archive size
data = bc['path/to/file.jpg']

The SQLite B-tree index on files.path provides logarithmic lookup, but with SQLite’s page caching and memory-mapping, this approaches constant time for typical workloads.

Benchmark (1M files):

First lookup: ~1ms (cold cache)
Subsequent lookups: ~0.1ms (warm cache)

Directory Listing#

Directory operations use the parent column index:

# O(k) where k = number of entries in directory
entries = bc.listdir('some/directory')

The idx_files_parent and idx_dirs_parent indexes make this efficient.

Walking the entire archive:

# O(n) where n = total files
for root, dirs, files in bc.walk(''):
    pass

Sequential vs Random Access#

Sequential access (reading in physical order):

from barecat import Order

# Optimal for HDDs, good for SSDs
for f in bc.index.iter_all_fileinfos(order=Order.ADDRESS):
    data = bc[f.path]

This minimizes disk seeks by reading files in the order they’re stored.

Random access (shuffled order):

import random

paths = list(bc.index.iter_all_paths())
random.shuffle(paths)
for path in paths:
    data = bc[path]

Performance depends heavily on storage:

SSD: Random access is nearly as fast as sequential
HDD: Random access can be 100x slower than sequential

Memory Usage#

Barecat uses memory-mapping for large reads:

PRAGMA mmap_size = 30000000000  # ~30GB mmap window

This allows the OS to manage file caching efficiently.

Memory overhead per open archive:

SQLite connection: ~1MB
File handles: ~1KB per open shard
Index cache: Managed by SQLite

For threadsafe=True, each thread/process gets its own connections, multiplying memory usage.

Write Performance#

Single file writes:

bc['file.txt'] = data  # Append to shard + index insert

Write performance is limited by:

SQLite transaction overhead
fsync for durability

Bulk writes (disable triggers for speed):

with bc.index.no_triggers():
    for path, data in items:
        bc[path] = data
bc.index.update_treestats()  # Rebuild after bulk insert

This can be 10-100x faster for large imports.

Sharding Impact#

Shards affect performance in several ways:

Too many shards (small shard size):

More file handles to manage
More metadata in index
Good for distribution

Too few shards (large shard size):

Single points of failure
Harder to move around
Better sequential read performance

Recommended shard sizes:

10-50GB for most use cases
1-10GB if distribution is important
Unlimited for small archives

Multi-Process Performance#

With threadsafe=True, each DataLoader worker gets isolated resources:

bc = barecat.Barecat('data.barecat', threadsafe=True)

# Each worker has:
# - Own SQLite connection
# - Own file handles
# - No locking overhead

This scales linearly with workers up to I/O saturation.

Benchmark (8 workers, SSD):

Workers  Throughput (images/sec)
      2,500
      5,000
      9,500
      15,000 (I/O limited)

Network Filesystem Considerations#

Barecat was designed for network filesystems (NFS, Lustre, GPFS).

Why it helps:

Fewer files = less metadata overhead
Sequential reads within shards
SQLite index can be cached locally

Optimization tips:

Copy index to local SSD if possible:

cp archive.barecat /local/tmp/
ln -sf /network/archive.barecat-shard-* /local/tmp/
# Use /local/tmp/archive.barecat

Use read-ahead for sequential access
Consider readonly_is_immutable=True for better caching

Comparison with Raw Files#

Reading 1M small files (1KB each):

Method              Time (sec)    Notes
Raw files (SSD)     120           1M stat + open + read
Raw files (NFS)     3600+         Network metadata overhead
Barecat (SSD)       30            Single index + sequential read
Barecat (NFS)       60            Index cached, shard streaming

The improvement comes from:

Single SQLite lookup vs filesystem metadata
Sequential shard read vs many small reads
No directory traversal overhead

Profiling Tips#

Identify I/O bottlenecks:

import cProfile
import pstats

with cProfile.Profile() as pr:
    for i in range(1000):
        data = bc[paths[i]]

stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)

Monitor with iostat:

iostat -x 1  # Watch disk utilization during training

PyTorch profiler:

with torch.profiler.profile(
    activities=[torch.profiler.ProfilerActivity.CPU],
    with_stack=True,
) as prof:
    for batch in loader:
        pass

print(prof.key_averages().table(sort_by="cpu_time_total"))

Optimization Checklist#

Use SSDs if possible
Set threadsafe=True for multi-worker DataLoader
Use Order.ADDRESS for sequential workloads on HDDs
Disable triggers for bulk imports
Size shards appropriately (10-50GB typical)
Cache index locally on network filesystems
Profile before optimizing - measure, don’t guess