Comparison with Other Formats#

This document compares barecat with other storage formats for large file collections.

Quick Comparison#

Feature

Barecat

tar

zip

HDF5

Random access

O(1)

O(n)

O(n)*

O(1)

Append files

Yes

Slow**

No

Yes

Delete files

Yes

No

No

Partial

Millions files

Yes

Slow

Slow

Slow

Browse

Yes

Stream only

Yes

Yes

Simple format

Yes

Yes

Yes

No

Compression

External

Yes

Yes

Yes

Encryption

No

No

Yes

No

* zip requires scanning central directory ** tar append requires rewriting

tar#

tar (tape archive) is the classic Unix archiver.

Pros:

  • Universal support

  • Simple streaming format

  • Supports compression (gzip, bzip2, xz)

Cons:

  • No random access - Must scan sequentially to find a file

  • No index - File lookup is O(n)

  • No modification - Can only append, not delete or modify

Use tar when:

  • Distributing software releases

  • Creating backups for sequential restore

  • Maximum compatibility is needed

Use barecat instead when:

  • You need random access to individual files

  • You have millions of files

  • You’ll be reading files non-sequentially (ML training)

zip#

zip is widely used for compressed archives.

Pros:

  • Per-file compression

  • Central directory for file listing

  • Windows-native support

  • Optional encryption

Cons:

  • Central directory must be read - O(n) to load, memory overhead

  • No modification - Updating requires rewriting

  • Slow for huge archives - Millions of entries = slow startup

Use zip when:

  • Distributing files to end users

  • Need Windows compatibility

  • Need per-file compression

Use barecat instead when:

  • Archive has millions of files

  • Need fast startup (no CD loading)

  • Need to add/delete files

HDF5#

HDF5 is a hierarchical data format popular in scientific computing.

Pros:

  • Complex data types (arrays, tables)

  • Chunking and compression

  • Parallel I/O support

Cons:

  • Complex format - Many features = complexity

  • Single-file limitation - All data in one file

  • Slow with many small items - Designed for large arrays

  • Hard to browse - Need special tools

Use HDF5 when:

  • Storing numerical arrays

  • Need compression of large blocks

  • Using scientific Python stack

Use barecat instead when:

  • Storing many small files (images, text)

  • Need simple filesystem-like browsing

  • Want to avoid HDF5 complexity

LMDB#

LMDB (Lightning Memory-Mapped Database) is a key-value store.

Pros:

  • Very fast reads

  • Memory-mapped access

  • ACID transactions

Cons:

  • Single-file - Can grow very large

  • Key-value only - No directory structure

  • Limited tooling - Need LMDB utilities

Use LMDB when:

  • Need maximum read speed

  • Data fits in a single file

  • Don’t need filesystem structure

Use barecat instead when:

  • Need sharding across multiple files

  • Want directory/path structure

  • Need to browse with standard tools

TFRecord#

TFRecord is TensorFlow’s recommended data format.

Pros:

  • Optimized for TensorFlow

  • Supports sharding

  • Compression support

Cons:

  • TensorFlow-specific - Hard to use elsewhere

  • No random access - Sequential reads only

  • Binary protocol buffers - Hard to inspect

Use TFRecord when:

  • Exclusively using TensorFlow

  • Following TensorFlow tutorials

Use barecat instead when:

  • Using PyTorch or other frameworks

  • Need random access

  • Want to inspect files directly

Raw Files on Filesystem#

Storing files directly on the filesystem.

Pros:

  • Maximum compatibility

  • Easy to inspect

  • No archive overhead

Cons:

  • Inode limits - Filesystems limit file count

  • Metadata overhead - Each file has filesystem metadata

  • Network FS performance - Many small files = slow

  • Backup complexity - Many files to track

Use raw files when:

  • Small number of files (< 100k)

  • Need direct tool access

  • Local SSD storage

Use barecat instead when:

  • Millions of files

  • Network filesystem (NFS, Lustre, GPFS)

  • Need to move/copy dataset as unit

WebDataset#

WebDataset stores files as tar shards with naming conventions.

Pros:

  • Simple shard format

  • Streaming-friendly

  • Works with PyTorch

Cons:

  • No random access - Sequential shard reading

  • Naming conventions - Must follow patterns

  • No index - Can’t look up specific files

Use WebDataset when:

  • Streaming large datasets

  • Don’t need random access

  • Already using tar shards

Use barecat instead when:

  • Need random access

  • Want to look up specific files

  • Need filesystem-like operations

Summary Recommendations#

Choose barecat when:

  • Dataset has millions of small files

  • You need random access (ML training with shuffling)

  • You want filesystem-like operations (listdir, walk)

  • You’re using network storage

  • You want to browse archives easily

Choose tar when:

  • Creating archives for distribution

  • Sequential access is fine

  • Maximum compatibility needed

Choose zip when:

  • Distributing to end users

  • Need Windows compatibility

  • Archive is not huge (< 100k files)

Choose HDF5 when:

  • Storing large numerical arrays

  • Using scientific Python stack

  • Need internal compression

Choose raw files when:

  • Small dataset (< 100k files)

  • Direct tool access needed

  • Local SSD storage