How to Convert Archives#
This guide covers converting between barecat and traditional archive formats (tar, zip).
Converting tar/zip to Barecat#
Basic Conversion#
# From tar.gz
barecat convert dataset.tar.gz dataset.barecat
# From zip
barecat convert dataset.zip dataset.barecat
# From uncompressed tar
barecat convert dataset.tar dataset.barecat
With Shard Size Limit#
barecat convert dataset.tar.gz dataset.barecat -s 50G
Streaming from stdin#
For very large archives or when piping from another command:
# Pipe from curl
curl -s https://example.com/data.tar.gz | \
barecat convert --stdin tar.gz dataset.barecat
# Pipe from decompression
zstd -d -c data.tar.zst | barecat convert --stdin tar dataset.barecat
# Supported formats: tar, tar.gz, tar.bz2, tar.xz
Converting Barecat to tar/zip#
Basic Conversion#
# To tar.gz
barecat convert dataset.barecat dataset.tar.gz
# To plain tar
barecat convert dataset.barecat dataset.tar
# To zip
barecat convert dataset.barecat dataset.zip
With Root Directory#
Many tar archives wrap all files in a root directory. To replicate this:
barecat convert --root-dir myproject dataset.barecat dataset.tar.gz
This produces a tar where all paths are prefixed with myproject/.
Streaming to stdout#
# Stream to file
barecat convert --stdout dataset.barecat tar.gz > dataset.tar.gz
# Pipe to another command
barecat convert --stdout dataset.barecat tar | ssh remote "tar -xf -"
# Pipe to compression tool
barecat convert --stdout dataset.barecat tar | zstd -o dataset.tar.zst
Zero-Copy Wrapping#
For uncompressed tar or zip files, barecat can create an index without copying data. The original file becomes the shard (via symlink).
barecat convert --wrap dataset.tar dataset.barecat
This creates:
dataset.barecat- The SQLite index databasedataset.barecat-shard-00000- Symlink todataset.tar
Requirements for --wrap:
tar must be uncompressed (not .tar.gz, .tar.bz2, etc.)
zip must be uncompressed (created with
zip -0)
To create an uncompressed zip:
zip -0 -r dataset.zip directory/
Checking if wrap is possible:
# This will error with a helpful message if not possible
barecat convert --wrap compressed.tar.gz out.barecat
# Error: Cannot wrap compressed file (gzip)...
Python API#
from barecat import archive2barecat, barecat2archive
from barecat.cli.impl import wrap_archive
# tar to barecat
archive2barecat('data.tar.gz', 'data.barecat', shard_size_limit=50*1024**3)
# barecat to tar
barecat2archive('data.barecat', 'data.tar.gz', root_dir='myproject')
# Zero-copy wrap
wrap_archive('data.tar', 'data.barecat')
Handling Large Archives#
For very large archives (TBs), consider:
Use streaming - Avoids loading entire archive in memory
cat huge.tar.gz | barecat convert --stdin tar.gz output.barecat
Set shard size - Smaller shards are easier to handle
barecat convert huge.tar.gz output.barecat -s 10G
Use –wrap for uncompressed - Zero-copy, instant
barecat convert --wrap huge.tar output.barecat
Troubleshooting#
“Cannot wrap compressed file”#
The --wrap option only works with uncompressed archives. Decompress first:
gunzip dataset.tar.gz
barecat convert --wrap dataset.tar dataset.barecat
“ZIP has compressed entries”#
The zip file uses compression. Either:
Convert without
--wrap(copies data):barecat convert dataset.zip dataset.barecat
Create an uncompressed zip:
unzip dataset.zip -d temp/ zip -0 -r dataset_uncompressed.zip temp/ barecat convert --wrap dataset_uncompressed.zip dataset.barecat
See Also#
How to Merge Archives - Merge multiple archives
Command Line Interface - CLI reference for
barecat convert