Performance Characteristics#
This document explains barecat’s performance characteristics and optimization strategies.
Lookup Performance#
File lookup by path is O(1) via SQLite index:
# Constant time regardless of archive size
data = bc['path/to/file.jpg']
The SQLite B-tree index on files.path provides logarithmic lookup,
but with SQLite’s page caching and memory-mapping, this approaches constant
time for typical workloads.
Benchmark (1M files):
First lookup: ~1ms (cold cache)
Subsequent lookups: ~0.1ms (warm cache)
Directory Listing#
Directory operations use the parent column index:
# O(k) where k = number of entries in directory
entries = bc.listdir('some/directory')
The idx_files_parent and idx_dirs_parent indexes make this efficient.
Walking the entire archive:
# O(n) where n = total files
for root, dirs, files in bc.walk(''):
pass
Sequential vs Random Access#
Sequential access (reading in physical order):
from barecat import Order
# Optimal for HDDs, good for SSDs
for f in bc.index.iter_all_fileinfos(order=Order.ADDRESS):
data = bc[f.path]
This minimizes disk seeks by reading files in the order they’re stored.
Random access (shuffled order):
import random
paths = list(bc.index.iter_all_paths())
random.shuffle(paths)
for path in paths:
data = bc[path]
Performance depends heavily on storage:
SSD: Random access is nearly as fast as sequential
HDD: Random access can be 100x slower than sequential
Memory Usage#
Barecat uses memory-mapping for large reads:
PRAGMA mmap_size = 30000000000 # ~30GB mmap window
This allows the OS to manage file caching efficiently.
Memory overhead per open archive:
SQLite connection: ~1MB
File handles: ~1KB per open shard
Index cache: Managed by SQLite
For threadsafe=True, each thread/process gets its own connections,
multiplying memory usage.
Write Performance#
Single file writes:
bc['file.txt'] = data # Append to shard + index insert
Write performance is limited by:
SQLite transaction overhead
fsync for durability
Bulk writes (disable triggers for speed):
with bc.index.no_triggers():
for path, data in items:
bc[path] = data
bc.index.update_treestats() # Rebuild after bulk insert
This can be 10-100x faster for large imports.
Multi-Process Performance#
With threadsafe=True, each DataLoader worker gets isolated resources:
bc = barecat.Barecat('data.barecat', threadsafe=True)
# Each worker has:
# - Own SQLite connection
# - Own file handles
# - No locking overhead
This scales linearly with workers up to I/O saturation.
Benchmark (8 workers, SSD):
Workers Throughput (images/sec)
1 2,500
2 5,000
4 9,500
8 15,000 (I/O limited)
Network Filesystem Considerations#
Barecat was designed for network filesystems (NFS, Lustre, GPFS).
Why it helps:
Fewer files = less metadata overhead
Sequential reads within shards
SQLite index can be cached locally
Optimization tips:
Copy index to local SSD if possible:
cp archive.barecat /local/tmp/ ln -sf /network/archive.barecat-shard-* /local/tmp/ # Use /local/tmp/archive.barecat
Use read-ahead for sequential access
Consider
readonly_is_immutable=Truefor better caching
Comparison with Raw Files#
Reading 1M small files (1KB each):
Method Time (sec) Notes
Raw files (SSD) 120 1M stat + open + read
Raw files (NFS) 3600+ Network metadata overhead
Barecat (SSD) 30 Single index + sequential read
Barecat (NFS) 60 Index cached, shard streaming
The improvement comes from:
Single SQLite lookup vs filesystem metadata
Sequential shard read vs many small reads
No directory traversal overhead
Profiling Tips#
Identify I/O bottlenecks:
import cProfile
import pstats
with cProfile.Profile() as pr:
for i in range(1000):
data = bc[paths[i]]
stats = pstats.Stats(pr)
stats.sort_stats('cumulative')
stats.print_stats(20)
Monitor with iostat:
iostat -x 1 # Watch disk utilization during training
PyTorch profiler:
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.CPU],
with_stack=True,
) as prof:
for batch in loader:
pass
print(prof.key_averages().table(sort_by="cpu_time_total"))
Optimization Checklist#
Use SSDs if possible
Set threadsafe=True for multi-worker DataLoader
Use Order.ADDRESS for sequential workloads on HDDs
Disable triggers for bulk imports
Size shards appropriately (10-50GB typical)
Cache index locally on network filesystems
Profile before optimizing - measure, don’t guess