Data Compression

Just as virtual memory can stretch the capacity of your primary storage system, data compression can boost the capacity of your secondary storage system. Compression eliminates the waste in a storage system. It seeks to put every bit of information in the smallest possible space. In effect, the compression system squeezes the air out of the data stream. Data compression can reduce fat files into their slimmest possible representation, which can later, through a decompression process, be reconstituted into their original form.

Most compression systems work by reducing recurrent patterns in the data stream into short tokens. For example, the two-byte pattern "at" could be coded as a single byte, such as "@", thus cutting the storage requirement in half. Most compression systems don't permanently assign tokens to bit patterns but instead make the assignments on the fly. They work on individual blocks of data, one at a time, starting afresh with each block. Consequently, the patterns stored by the tokens of one block may be entirely different from those used in the next block. The key to decoding the patterns from the tokens is included as part of the data stream.

Disk compression systems put data compression technology to work by increasing the apparent capacity of your disk drives. Generally, they work by creating a virtual drive with expanded capacity, which you can use as though it were a normal (but larger) disk drive. The compression system automatically takes care of compressing and decompressing your data as you work with it. The information is stored in compressed form on your physical disk drive, which is hidden from you.

The compression ratio compares the resultant storage requirements to those required by the uncompressed data. For example, a compression ratio of 90 percent would reduce storage requirements by 90 percent. The compressed data could be stored in 10 percent of the space required by its original form. Most data-compression systems achieve about a 50-percent compression ratio on the mix of data that most people use.

Because the compression ratio varies with the kind of data you store, the ultimate capacity of a disk that uses compression is impossible to predict. The available capacity reported by your operating system on a compressed drive is only an estimate based on the assumed compression ratio of the system. You can change this assumption to increase the reported remaining capacity of your disk drive, but the actual remaining capacity (which depends on the data you store, not the assumption) will not change.

Most compression systems assume that you want to get back every byte and every bit you store. You don't want numbers disappearing from your spreadsheets or commands from your programs. You assume that decompressing the compressed data will yield everything you started with—without losing a bit. The processes that deliver that result are called lossless compression systems.

Sometimes, however, your data may contain more detail than you need. For example, you might scan a photo with a true-color 24-bit scanner and display it on an ordinary VGA system with a color range of only 256 hues. All the precise color information in your scan is wasted on your display, and the substantial disk space you use for storing it could be put to better use.

Analog images converted to digital form and analog audio recordings digitized often contain subtle nuances beyond the perception of most people. Some data-reduction schemes called lossy compression systems ignore these fine nuances. The reconstituted data does not exactly replicate the original. For viewing or listening, the restored data is often good enough. Because lossy compression systems work faster than lossless schemes and because their resultant compression ratios are higher, they are often used in time- and space-sensitive applications, such as digital image and sound storage.

Compression has proved to be such a valuable technology that it is used in other ways besides increasing disk storage. For example, advanced modem protocols often include data compression to increase throughput levels. In addition, file-archiving software, such as the popular program PKZip, also takes advantage of compression to more effectively use your disk's space.

File-compression and file-archiving software differs from ordinary disk compression in several ways. It works on a file-by-file basis rather than across a complete disk. It is not automatic; you manually select the files you want to archive and compress. The archiving software does not work on the fly but instead executes upon your command. It compresses files individually but can package several files together into a single archive file. Archive files are stored as ordinary Windows files but can be read or executed only after they have been uncompressed.

Because these archiving systems do not compress on the fly, they can spend extra time to optimize their compression (for example, trying several compression algorithms to find the most successful one). They can often achieve higher compression ratios than standard disk-compression software. (Because they are time insensitive, these programs can try more complex compression methods and avoid the rule that a compressed file can be compressed no further.) Your disk compression software, however, can't squeeze their contents any tighter.

Perhaps the most popular application for file-compression software is preparing files for transmission by modem. It allows you to package together a group of related files, shrink them to the minimal possible size, and conveniently ship them off with a single send command. Of course, the resultant files will not be further compressible, so your modem will apparently operate at a slower speed, passing along the compressed data, byte for byte.

All compression systems use essentially the same compression methods, even the same algorithms. As a result, using more than one of these methods is counterproductive. After you have compressed data, you cannot squeeze it again (at least using the same algorithm). Layering multiple levels of compression won't yield more space and may in fact waste space—and it definitely impacts performance.

[ Team LiB ]