Errors

No memory system is perfect. Certainly your computer's memory won't mislay its gloves or forget your birthday, but it can suffer little slips, an errant bit here and there. Although one bit in a few million might not seem like a big deal, it's enough to send your system into a tailspin, or worse, alter the answer to some important calculation, jogging the decimal point a few places to the left.

Causes

The memory errors that your computer is likely to suffer fall into two broad classes: soft errors and hard errors. Either can leave you staring at an unflinching screen, sometimes but not always emblazoned with a cryptic message that does nothing to help you regain the hours' work irrevocably lost. The difference between them is transience. Soft errors are little more than disabling glitches that disappear as fast as they come. Hard errors linger until you take a trip to the repair shop.

Soft Errors

For your computer, a soft memory error is an unexpected and unwanted change. Something in memory might turn up different from what it's supposed to be. One bit in a memory chip may suddenly, randomly change state. Or a glitch of noise inside your system may get stored as if it were valid data. In either case, one bit becomes something other than what it's supposed to be, thus possibly changing an instruction in a program or a data value.

With a soft error, the change appears in your data rather than hardware. Replace or restore the erroneous data or program code, and your system will operate exactly as it always has. In general, your system needs nothing more than a reboot—a cold boot being best to gain the assurance of your computer's self-test of its circuits (including memory). The only damage is the time you waste retracing your steps to get back to the place in your processing at which the error occurred. Soft errors are the best justification for the sage advice, "Save often."

Most soft errors result from problems either within memory chips themselves or in the overall circuitry of your computer. The underlying mechanism behind these two types of soft errors is entirely different.

Chip-Level Errors

The errors inside memory chips are almost always a result of radioactive decay. The problem is not nuclear waste (although nuclear wastes is a problem) but something even more devious. The culprit is the epoxy of the plastic chip package, which like most materials may contain a few radioactive atoms. Typically one of these minutely radioactive atoms will spontaneously decay and shoot out an alpha particle into the chip. (There are a number of radioactive atoms in just about everything—they don't amount to very much, but they are there. And by definition, a radioactive particle will spontaneously decay sometime.) An alpha particle is a helium nucleus, two protons and two neutrons, having a small positive charge and a lot of kinetic energy. If such a charged particle hits a memory cell in the chip, the charge and energy of the particle can cause the cell to change state, blasting the memory bit it contains to a new and different value. This miniature atomic blast is not enough to damage the silicon structure of the chip itself, however.

Whether a given memory cell will suffer this kind of soft error is unpredictable, just as predicting whether a given radioactive atom will decay is unpredictable. When you deal with enough atoms, however, this unpredictability becomes a probability, and engineers can predict how often one of the memory cells in a chip will suffer such an error. They just can't predict which one.

In the early days of computers, radioactive decay inside memory chips was the most likely cause of soft errors in computers. Thanks to improved designs and technology, each generation of memory chip has become more reliable, no matter whether you measure per bit or per chip. For example, any given bit in a 16KB chip might suffer a decay-caused soft error every billion or so hours. The likelihood of any given bit in a modern 16MB chip will suffer an error is on the order of once in two trillion hours. In other words, modern memory chips are about 5000 times more reliable than those of first generation computers, and the contents of each cell is about five million times more reliable once you take into account that chip capacities have increased a thousand-fold. Although conditions of use influence the occurrence of soft errors, the error rate of modern memory is such that a typical computer with 8MB of RAM would suffer a decay-caused soft error once in 10 to 30 years. The probability is so small that many computer-makers now ignore it.

System-Level Errors

Sometimes the data traveling though your computer gets hit by a noise glitch—on the scale of memory cells, a little glitch can be like you getting struck by lightning. Just as you might have trouble remembering after such a glitch, so does the memory cell. If a pulse of noise is strong enough and occurs at an especially inopportune instant, it can be misinterpreted by your computer as a data bit. Such a system-level error will have the same effect on your computer as a soft error in memory. In fact, some system-level errors may be reported as memory errors (for example, when the glitch appears in the circuitry between your computer's memory chips and the memory controller).

The most likely place for system-level soft errors to occur is on your computer's buses. A glitch on a data line can cause your computer to try to use or execute a bad bit of data or program code, thus causing an error. Or your computer could load the bad value into memory, saving it to relish (and crash from) at some later time. A glitch on the address bus will make your computer similarly find the wrong bit or byte, and the unexpected value may have exactly the same effects as a data bus error.

The probability of a system-level error occurring depends on the design of your computer. A careless designer can leave your system not only susceptible to system-level errors but even prone to generating the glitches that cause them. Pushing a computer design to run too fast is particularly prone to cause problems. You can do nothing to prevent system-level soft errors other than to choose your computer wisely.

Hard Errors

When some part of a memory chip actually fails, the result is a hard error. For instance, a jolt of static electricity can wipe out one or more memory cells. As a result, the initial symptom is the same as a soft error—a memory error may cause an error in the results you get or a total crash of your system. The operative difference is that the hard error doesn't go away when you reboot your system. In fact, your machine may not pass its memory test when you try to start it up again. Alternately, you may encounter repeated, random errors when a memory cell hovers between life and death.

Hard errors require attention. The chip or module in which the error originates needs to be replaced.

Note, however, that operating memory beyond its speed capability often causes the same problem as hard errors. In fact, operating memory beyond its ratings causes hard errors. You can sometimes clear up such problems by adding wait states to your system's memory cycles, a setting many computers allow you to control as part of their advanced setup procedure. This will, of course, slow down the operation of your computer so that it can accommodate the failing memory. The better cure is to replace the too-slow memory with some that can handle the speed.

Detection and Prevention

Most computers check every bit of their memory for hard errors every time you switch your system on or perform a cold boot, although some computers give you the option of bypassing this initial memory check to save time. Soft errors are another matter entirely. They rarely show up at boot time. Rather, they are likely to occur at the worst possible moment—which means just about any time you're running your computer.

Computer-makers use two strategies to combat memory errors: parity and detection/correction. Either one will ensure the integrity of your system's memory. Which is best—or whether you need any error compensation at all—is a personal choice. Manufacturers consider the memory systems in most modern computers so reliable that they opt to save a few bucks and omit any kind of error detection. Where accuracy really counts, in servers, network managers usually insist their computers have the best error prevention possible by using memory that supports error-correction code.

Parity

When memory chips were of dubious reliability, computer manufacturers followed the lead of the first computer and added an extra bit of storage to every byte of memory. The extra bit was called a parity check bit, and it allows the computer to verify the integrity of the data stored in memory. Using a simple algorithm, the parity check bit permits a computer to determine that a given byte of memory has the right number of binary zeros and ones in it. If the count changes, your computer knows an error occurred.

When a microprocessor writes a byte to memory, the value stored in the parity check bit is set either to a logical one or zero in such a way that the total of all nine bits storing the byte is always odd. Every time your computer reads a given portion of memory, the memory controller totals up the nine bits storing each byte, verifying that the overall total (including the parity check bit) remains odd. Should the system detect a total that's even, it immediately knows that something has happened to cause one bit of the byte to change, making the stored data invalid.

The philosophy behind parity memory is that having bad data is worse than losing information through a system crash. One bad bit results in a halt to all computing. You won't make a mistake—and you won't like it when your computer's memory does. Although parity errors happened rarely, when they did, most people questioned the value of parity checking. Consequently, few people minded when manufacturers started omitting it from their designs.

Fake Parity

Fake parity memory is a means of cutting the cost of memory modules for computers with built-in parity checking. Instead of actually performing a parity check of the memory on the module, the fake parity system always sends out a signal indicating that memory parity is good. No extra bits of memory means the vendors of "fake parity" chips don't have to pay the extra bucks for extra memory.

Fake parity has two downsides. The cost savings often are not passed down to you, at least explicitly. The fake parity modules are often sold as ordinary parity modules with no indication of the underlying shortchange on technology. Moreover, a fake parity module does not offer protection against parity errors causing erroneous data in your computer. One fake parity module defeats the operation and purpose of the parity memory system in your computer. Fortunately, with the decline in the use of actual parity memory, the incentive to sell fake parity memory has disappeared.

Detection/Correction

Parity checking can only locate an error of one bit in a byte. More elaborate error-detection schemes can detect larger errors. Better still, when properly implemented, these schemes can fix single-bit errors without crashing your system. Called error-correction code (ECC), this scheme in its most efficient form requires three extra bits per byte of storage. The additional bits allow your system not only to determine the occurrence of a memory error but also to locate any single bit that changed so that the error can be reversed. Some people call this technology Error Detection and Correction (EDAC).

Most server computers use ECC memory. In fact, many desktop machines also allow you to use ECC. Support for ECC is built into many chipsets. If your BIOS allows you to turn this facility on, you can use ECC, but you'll also need ECC memory modules (which means tossing out the ones that you have).

Server-makers have another reason for shifting from parity to ECC memory. As the width of the data bus increases, error correction memory become less expensive to implement. In fact, with today's Pentium computers—which have 64-bit data buses—the cost of extra memory for parity checking and full error correction are the same. Table 15.4 summarizes the penalty required by parity and ECC technology for various bus widths.

Table 15.4. Comparison of Parity and ECC Memory

Extra Bits Required Cost Increase

Bus Width Parity ECC Parity ECC

8 1 5 12.5% 62%

16 2 6 12.5% 38%

32 4 8 12.5% 25%

64 8 8 12.5% 12.5%

Parity and ECC memory are technologies for people who want or need to take every possible precaution. They are the kind of people who check the weather report, then put on their galoshes and carry an umbrella even when the prediction is 100 percent in favor of sun. They are the folks who get the last laugh when they are warm and dry after a once-in-a-century freak thunderstorm rumbles through. These people know that the best solution to computer and memory problems is prevention.

[ Team LiB ]