Comparison with Other Formats#

This document compares barecat with other storage formats for large file collections.

Quick Comparison#

Feature	Barecat	tar	zip	HDF5
Random access	O(1)	O(n)	O(n)*	O(1)
Append files	Yes	Slow**	No	Yes
Delete files	Yes	No	No	Partial
Millions files	Yes	Slow	Slow	Slow
Browse	Yes	Stream only	Yes	Yes
Simple format	Yes	Yes	Yes	No
Compression	External	Yes	Yes	Yes
Encryption	No	No	Yes	No

* zip requires scanning central directory ** tar append requires rewriting

tar#

tar (tape archive) is the classic Unix archiver.

Pros:

Universal support
Simple streaming format
Supports compression (gzip, bzip2, xz)

Cons:

No random access - Must scan sequentially to find a file
No index - File lookup is O(n)
No modification - Can only append, not delete or modify

Use tar when:

Distributing software releases
Creating backups for sequential restore
Maximum compatibility is needed

Use barecat instead when:

You need random access to individual files
You have millions of files
You’ll be reading files non-sequentially (ML training)

zip#

zip is widely used for compressed archives.

Pros:

Per-file compression
Central directory for file listing
Windows-native support
Optional encryption

Cons:

Central directory must be read - O(n) to load, memory overhead
No modification - Updating requires rewriting
Slow for huge archives - Millions of entries = slow startup

Use zip when:

Distributing files to end users
Need Windows compatibility
Need per-file compression

Use barecat instead when:

Archive has millions of files
Need fast startup (no CD loading)
Need to add/delete files

HDF5#

HDF5 is a hierarchical data format popular in scientific computing.

Pros:

Complex data types (arrays, tables)
Chunking and compression
Parallel I/O support

Cons:

Complex format - Many features = complexity
Single-file limitation - All data in one file
Slow with many small items - Designed for large arrays
Hard to browse - Need special tools

Use HDF5 when:

Storing numerical arrays
Need compression of large blocks
Using scientific Python stack

Use barecat instead when:

Storing many small files (images, text)
Need simple filesystem-like browsing
Want to avoid HDF5 complexity

LMDB#

LMDB (Lightning Memory-Mapped Database) is a key-value store.

Pros:

Very fast reads
Memory-mapped access
ACID transactions

Cons:

Single-file - Can grow very large
Key-value only - No directory structure
Limited tooling - Need LMDB utilities

Use LMDB when:

Need maximum read speed
Data fits in a single file
Don’t need filesystem structure

Use barecat instead when:

Need sharding across multiple files
Want directory/path structure
Need to browse with standard tools

TFRecord#

TFRecord is TensorFlow’s recommended data format.

Pros:

Optimized for TensorFlow
Supports sharding
Compression support

Cons:

TensorFlow-specific - Hard to use elsewhere
No random access - Sequential reads only
Binary protocol buffers - Hard to inspect

Use TFRecord when:

Exclusively using TensorFlow
Following TensorFlow tutorials

Use barecat instead when:

Using PyTorch or other frameworks
Need random access
Want to inspect files directly

Raw Files on Filesystem#

Storing files directly on the filesystem.

Pros:

Maximum compatibility
Easy to inspect
No archive overhead

Cons:

Inode limits - Filesystems limit file count
Metadata overhead - Each file has filesystem metadata
Network FS performance - Many small files = slow
Backup complexity - Many files to track

Use raw files when:

Small number of files (< 100k)
Need direct tool access
Local SSD storage

Use barecat instead when:

Millions of files
Network filesystem (NFS, Lustre, GPFS)
Need to move/copy dataset as unit

WebDataset#

WebDataset stores files as tar shards with naming conventions.

Pros:

Simple shard format
Streaming-friendly
Works with PyTorch

Cons:

No random access - Sequential shard reading
Naming conventions - Must follow patterns
No index - Can’t look up specific files

Use WebDataset when:

Streaming large datasets
Don’t need random access
Already using tar shards

Use barecat instead when:

Need random access
Want to look up specific files
Need filesystem-like operations

Summary Recommendations#

Choose barecat when:

Dataset has millions of small files
You need random access (ML training with shuffling)
You want filesystem-like operations (listdir, walk)
You’re using network storage
You want to browse archives easily

Choose tar when:

Creating archives for distribution
Sequential access is fine
Maximum compatibility needed

Choose zip when:

Distributing to end users
Need Windows compatibility
Archive is not huge (< 100k files)

Choose HDF5 when:

Storing large numerical arrays
Using scientific Python stack
Need internal compression

Choose raw files when:

Small dataset (< 100k files)
Direct tool access needed
Local SSD storage