Efficiently retrieving a file from an archive

Is there a smarter tar or cpio out there, or a smarter way to archive, to efficiently retrieve a file stored in the archive? I am using tar to archive a group of very large (multi-GB) bz2 files. If I use tar -tf file.tar to list the files within the archive, this takes a very long time to complete (~10-15 minutes). Likewise, cpio -t < file.cpio takes just as long to complete, plus or minus a few seconds. Accordingly, retrieving a file from an archive (via tar -xf file.tar myFileOfInterest for example) is as slow. Is there an archival method out there that keeps a readily available "catalog" with the archive, so that an individual file within the archive can be retrieved quickly? For example, some kind of catalog that stores a pointer to a particular byte in the archive, as well as the size of the file to be retrieved (as well as any other filesystem-specific particulars). Is there a tool (or argument to tar or cpio) that allows efficient retrieval of a file within the archive?>
Answer:

There is a newer utility called dar which is sort of a modernization of tar, aimed at disk archiving instead of tape archiving, which might be faster. It's not included in most *nix distributions by default, but I've had good luck compiling it (or maybe I just found precompiled binaries) on both Linux and Mac OS X. I don't know for a fact that it's any faster to retrieve files than tar, but the archive format is different and is aimed at modern storage devices rather than linear tape, so it wouldn't surprise me if it was. It normally compresses files by default (unlike tar) but it has a commandline switch to exclude certain file extensions like .gz or .bz2. If I get a chance tonight I'll run some tests and see if dar is faster to retrieve files from an archive.

Blazecock Pileon at Ask.Metafilter.Com Visit the source

Was this solution helpful to you?

Other answers

I'd also consider compressing each chromosome (or unit of chromosomes) separately.

rhizome

The internet archive has a similar use case to you. They developed a format that they call ARC. See: http://www.archive.org/web/researcher/ArcFileFormat.php They explicitly separate the index of files in the archive from the archive. (Note that there are two archive formats with the extension .arc)

bdc34

The root of the problem is that the tar format does not maintain an index of the files in the archive; extracting "FileOfInterest" means seeking through the archive one file header at a time looking for the right file name. zip or 7zip archives have indexes, and will work much better. (FWIW, if you ditch the bz2 on the individual files and store them in a 7zip archive, there's a good chance you'll make some massive space savings; 7zip really shines when redundancy in data spans multiple files in the archive, which I suspect may be the case for you. And you'd get in the indexing win.) http://ask.metafilter.com/131322/Efficiently-retrieving-a-file-from-an-archive#1876420 How does one directory per table not provide this functionality? Why the need to wrap all the files into one big file? I'll assume you've got your reasons, but it's not clear what they are.

buxtonbluecat

When I dealt with this sort of thing, I built a filter that split the input & independently bzipped those. It would produce files like largearchive.tar.aaa.bz2. Block headers let tar pick up in the middle. I'd split the files at cd size - 700m - & also let the filter tee into a "tar tvf" to keep a catalog. Restoring was a matter of piping the relevant files to bzip2 then tar.

Pronoiac

We already have a naming and organizational scheme for these files (which is tied closely to how our lab data browser operates) as well as packaging/unpackaging tools which are used to work with these bundles. Adding version control and changing filesystems or naming schemes wouldn't solve the root of this specific problem and would likely introduce several new and larger headaches. I'll be honest and say that these three approaches are probably non-starters. A database is great for random access and we use this for visualizing data, but for storage and performance reasons, lossy compression is used for some of the data put into the database. To get to the true data values we need to handle packaging of files that other institutions have available and we need to be able to use reasonably standard and/or open-source UNIX tools and procedures to do this, which motivates my question. (Additionally, filesystem access lets us reduce load on our already overburdened database.) It sounds like an index-capable archival tool like 7z or zip may help solve this issue. Thanks to all for your advice!

Blazecock Pileon

rdiffbackup? Really it sounds like you should just be using a directory structure and some tools to enforce it - rather than relying on tar to archive things. If you are always pulling individual items out of the archives - they probably shouldn't be tarred up in the first place. Version control (svn, git, etc) or just a plain directory structure with some supporting scripts to index, checksum, and so on would probably make more sense. Or a database.

TravellingDen

Two people said this already but I'll concur. Try zip -0 and unzip. zip -0 (that's a zero) will turn off the compression to avoid wasting time on that, given that your contents are already really well compressed. You can unzip a single file to stdout with unzip -p file.zip filetoextract. It keeps a catalog at the end of the file so should be efficient to get files out of large archives, hopefully only a few seeks. That's not to say that there aren't implementation bugs.

sergent

But in general it looks pretty much built for file-oriented operation rather than stream-oriented. If that's the case, that's not going to be as helpful as other tools (we like to pipe stuff between commands as much as possible). But it might be useful for other work, so thanks for pointing it out!

Blazecock Pileon

I'd also consider compressing each chromosome (or unit of chromosomes) separately. To reiterate, the archive is made up of bz2 (bzip2) files.

Blazecock Pileon

Related Q & A:

How to delete a particular line from a file?Best solution by Stack Overflow
How can I programmatically extract a file quickly and efficiently within Android?Best solution by Stack Overflow
How can I send a file to a friend on Hotmail?Best solution by askdavetaylor.com
Do I need a special ISP package to do a file hosting site?Best solution by Yahoo! Answers
I cannot attach a file to a mail in Yahoo.Best solution by Yahoo! Answers

Just Added Q & A:

How many active mobile subscribers are there in China?Best solution by Quora
How to find the right vacation?Best solution by bookit.com
How To Make Your Own Primer?Best solution by thekrazycouponlady.com
How do you get the domain & range?Best solution by ChaCha
How do you open pop up blockers?Best solution by Yahoo! Answers

For every problem there is a solution! Proved by Solucija.

Got an issue and looking for advice?
Ask Solucija to search every corner of the Web for help.
Get workable solutions and helpful tips in a moment.

Just ask Solucija about an issue you face and immediately get a list of ready solutions, answers and tips from other Internet users. We always provide the most suitable and complete answer to your question at the top, along with a few good alternatives below.