quick and dirty guide...
...to tar and gzip/bzip2
Another issue I see on the message boards quite frequently is noobs trying to get their heads around tar archiving utilities, and gzip/bzip2 compression utilities. It pays to keep in mind that in the Windows world both of these are packaged together in the .zip format. In the Linux world, as with most things, more control comes at the price of more complexity, as the two issues of archiving and compression are seperated.
I guess the first thing we should do is discuss the difference. Compression is a means to shrink the physical size of a file in bytes. The technical aspects of how compression works is a bit beyond the scope of this guide, so suffice it to say that the computer uses an algorithm to combine redundant bytes of data together. Archiving on the other hand, is the act of combining several files together into one, for ease of backup and distribution, all the while keeping the individual file attributes and permissions intact.
what's with the file extensions?
You have probably seen a lot of different file extensions when trying to download new apps, and perhaps been a bit confused as to which to choose. They probably had some g's, b's, and t's in them, right? There are only three basic filetypes when using GNU/Linux compression and archiving tools. Let's have a look:
filename.tar
This is a standard tar archive. Tar is short for tape archive which is a throwback to the old days of backing up hard drives to regular tape.
filename.gz
This is a compressed file using the GNU Zip compression algorithm. At this time, it is the most common compression method used.
filename.bz2
This is a compressed file using the newer bzip2 compression algorithm (if anyone knows what the 'b' stands for please let me know). Bzip2 is a more efficient algorithym which results in smaller file sizes. Many of the major FTP archive sites are switching to bzip2 to save disk space and bandwidth.
Of course, the most common case is that tar is used together with a compression tool, which results in file extensions such as .tar.gz, .tar.bz2, and .tgz. It is important to keep in mind that these file extensions are for the benefit of humans, not computers. There is nothing to stop you but convention from naming your tar archives with say, the .joe filename extension. It will still untar just fine.
using tar
The format of using tar on the command line is:
tar functionoptions files...
You do not need to use the usual '-' after tar, because the first argument to the tar command is what is called the function, rather than an option. You can use the dash if you like though. The most common functions used with tar are:
- c
- to create a new archive
- x
- to extract files from an archive (untar them)
- t
- to list the contents of a tar archive
- r
- to add more files to an existing archive
There are, of course, more functions available for more esoteric tasks, which you can discover with man tar if you are really curious. The most important option is the 'f' flag, which must be specified right before the filename you intend to act on. There are other functions as well, but I do not want to give them away now lest I ruin the suprise later. Now I want you to find a directory somewhere in your home directory, preferably one with 4-5 files so we can have some working examples. As an example I will use a directory where I keep some of the python scripts I am working on:
[bulliver@badcomputer scripts]$ ls -l python -rwxr-xr-x 1 bulliver bulliver 62 Dec 26 16:08 exp1.py -rwxr-xr-x 1 bulliver bulliver 110 Feb 12 02:42 exp2.py -rwxr-xr-x 1 bulliver bulliver 5549 Jan 9 02:10 find_mu.py -rwxr-xr-x 1 bulliver bulliver 3433 Jan 9 18:20 find_mu2.py -rw-r--r-- 1 bulliver bulliver 3415 Jan 29 03:58 find_mu2.txt -rwxr-xr-x 1 bulliver bulliver 9285 Jan 29 03:58 hockey_pool.py
Now we want to combine these six files into a single archive, so we'll use the 'c' function. It is also wise to keep all the files you intend to archive contained in a directory. If you have ever unpacked a tar file you downloaded from the internet and had the files explode all over the current directory you will understand why it is good form to package all of your archives in a top-level directory.
[bulliver@badcomputer scripts]$ tar cvf python.tar python python/ python/hockey_pool.py python/find_mu.py python/exp1.py python/exp2.py python/find_mu2.py python/find_mu2.txt [bulliver@badcomputer scripts]$ ls -l drwxr-xr-x 2 bulliver bulliver 288 Feb 4 18:59 bash drwxr-xr-x 2 bulliver bulliver 392 Jan 29 03:57 c drwxr-xr-x 2 bulliver bulliver 224 Feb 12 00:32 python -rw-r--r-- 1 bulliver bulliver 20480 Feb 16 00:38 python.tar drwxr-xr-x 2 bulliver bulliver 48 Feb 4 18:14 test
There are a couple of things to notice here. First of all, the 'v' (verbose) option gives us a list of the files that tar is archiving. Keep in mind that tar works recursively: (if there were any directories in 'python' they would have been added to the archive (along with any deeper levels of files and directories)). Another thing to notice is that tar left our original python directory intact. Now lets have a look inside our archive:
[bulliver@badcomputer scripts]$ tar tvf python.tar drwxr-xr-x bulliver/bulliver 0 2003-02-12 00:32 python/ -rwxr-xr-x bulliver/bulliver 9285 2003-01-29 03:58 python/hockey_pool.py -rwxr-xr_x bulliver/bulliver 5549 2003-01-09 02:10 python/find_mu.py -rwxr-xr-x bulliver/bulliver 62 2002-12-26 16:08 python/exp1.py -rwxr-xr-x bulliver/bulliver 110 2003-02-12 02:42 python/exp2.py -rwxr-xr-x bulliver/bulliver 3433 2003-01-09 18:20 python/find_mu2.py -rwxr-xr-x bulliver/bulliver 3415 2003-01-29 03:58 python/find_mu2.txt
As you can see, all of the files we archived have retained their permissions and timestamps. To extract the archive we would use:
[bulliver@badcomputer scripts]$ tar xvf python.tar python/ python/hockey_pool.py python/find_mu.py python/exp1.py python/exp2.py python/find_mu2.py python/find_mu2.txt
using gzip and bzip2
As mentioned before, gzip and bzip2 do basically the same thing using two different methods, so which should we use? Far be it for me to tell you, so I'll just say that a good rule of thumb is that if the archive is smaller, say less than 3-5 MB you should use gzip and if it is larger, use bzip2. Why? Well, bzip is more efficient, but takes longer to compress/decompress, so only use it if it will actually make a noticeable difference; ie the size difference between a 1 MB tar archive compressed with gzip and bzip2 is negligable, whilst the difference between a 50 MB archive can be quite pronounced. That being said, just use whichever you prefer :).
Compressing files is pretty easy. Here are a few examples:
[bulliver@badcomputer cruft]$ ls bg_1.txt bg_2.txt bg_3.txt pangram.txt screenie.txt [bulliver@badcomputer cruft]$ gzip pangram.txt [bulliver@badcomputer cruft]$ bzip2 screenie.txt [bulliver@badcomputer cruft]$ ls bg_1.txt bg_2.txt bg_3.txt pangram.txt.gz screenie.txt.bz2
Notice that the original files are not preserved. Uncompressing the files is just as simple:
[bulliver@badcomputer cruft]$ gunzip pangram.txt.gz [bulliver@badcomputer cruft]$ bunzip2 screenie.txt.bz2 [bulliver@badcomputer cruft]$ ls bg_1.txt bg_2.txt bg_3.txt pangram.txt screenie.txt
...and our files are back as new. You can compress pretty much any type of file, but you wont get good results with files that are already compressed in some form such as .gif, .jpg, .mp3, .mpeg, and others. You can only combine so many redundant bytes.
You can even read plain text files while they are compressed using zcat and bzcat. Lets assume our files are compressed once more:
[bulliver@badcomputer cruft]$ zcat pangram.txt.gz
Darren's python panagram program found this sentence which contains exactly nine 'a's, one 'b',
five 'c's, four 'd's, thirty-six 'e's, eleven 'f's, three 'g's, ten 'h's, fifteen 'i's, one 'j',
one 'k', three 'l's, three 'm's, twenty-nine 'n's, sixteen 'o's, four 'p's, one 'q', fifteen 'r's,
thirty-one 's's, twenty-one 't's, six 'u's, four 'v's, four 'w's, six 'x's, seven 'y's, and one 'z'.
[bulliver@badcomputer cruft]$ bzcat screenie.txt.bz2
This example isn't nearly as interesting!
Getting back to our tar archive, using gzip we would compress it thusly:
[bulliver@badcomputer scripts]$ gzip python.tar
and using bzip2:
[bulliver@badcomputer scripts]$ bzip2 python.tar
That's it! Of course there are numerous options available to both commands, but I will let you discover them by reading the man pages.
putting it all together
So how do you deal with the foo.tar.gz or foo.tar.bz2 file you just downloaded from the internet? You could extract it manually:
[bulliver@badcomputer bulliver]$ gunzip foo.tar.gz [bulliver@badcomputer bulliver]$ tar xvf foo.tar
That seems a bit tacky though. If you're a guru you might string it together using zcat and a pipe, and take advantage of unix filestreams:
[bulliver@badcomputer bulliver]$ zcat foo.tar.gz | tar xv
That seems a bit esoteric though. If you're like the rest of us you just take advantage of GNU tar's decompression utilities which are built right in. This means you can extract and uncompress the archive with one command, instead of dealing with two seperate steps. When you pass the 'z' option to tar it will uncompress using gzip, and with the 'j' option it will uncompress using bzip2. So:
[bulliver@badcomputer scripts]$ tar xzf foo.tar.gz [bulliver@badcomputer scripts]$ tar xjf foo.tar.bz2
Conversly, you can of course create your own compressed archives with one command as well. Just replace the extract function with 'c' for create:
[bulliver@badcomputer scripts]$ tar czf foo.tar.gz foo/ [bulliver@badcomputer scripts]$ tar cjf foo.tar.bz2 foo/
I hope this guide has helped you to understand archiving and compression. If you have any questions/comments please feel free to contact me.
stats
It is
Saturday July 04, 2009 3:49 am
This page served 16373 times
This page last modified: April 14, 2008 11:28 am
Your IP address is: 38.103.63.61
You are browsing using: CCBot/1.0 (+http://www.commoncrawl.org/bot.html)
You are browsing from: United States.
badcomputer.org's uptime: 03:49:49 up 235 days, 7:00, 1 user, load average: 0.00, 0.00, 0.00
local
home | unix stuff | dir2ogg | sneetchalizer | wmainfo | q&d guide to permissions | q&d guide to tar and gzip | code | MS rant | browser shootout | linux & iAudio X5 | photos | music | programming poetry | sieve of Eratosthenes | plea | rain | suffer | archive | about | recipes | compaqr3000 | sitemap
search
credits
This page, and all pages on this site were created and are maintained by Darren Kirby using valid XHTML 1.0 and CSS, and are ©copyright 2002 - 2008. The Penguin image was created by Tukka, and is used by permission. Inspiration for the look of this site was provided by Eric A. Meyer's CSS gallery. This website runs on Gentoo Linux. It is served by Apache. PHP and MySQL hold together the backend.