Error detection and correction

Error Detection and Correction Jeff Layton Data protection and checking takes place various places throughout a system. Some of it is in hardware and some of it is in software. The goal is to ensure that data is not corrupted changedeither coming from or going to the hardware or in the software stack. One key technology is ECC memory error-correcting code memory.

Error detection and correction

Error Detection and Correction Jeff Layton Data protection and checking takes place various places throughout a system. Some of it is in hardware and some of it is in software.

The goal is to ensure that data is not corrupted changedeither coming from or going to the hardware or in the software stack. One key technology is ECC memory error-correcting code memory.

The standard ECC memory used in systems today can detect and correct what are called single-bit errors, and although it can detect double-bit errors, it cannot correct them. A simple Error detection and correction of one bit in a byte can make a drastic difference in the value of the byte.

ECC memory can detect the problem and correct it so with the user unaware.

Error detection and correction

Notice, however, that only one bit in the byte has been changed and then corrected. If two bits change — perhaps by both the second and seventh from the left — the byte is now i. After all, you are using ECC memory, so ensuring the data is correct is important; if an uncorrectable memory error occurs, you would probably want the system to stop.

The source of bit-flipping usually originates in some sort of electrical or magnetic interference inside the system. This interference can cause a bit to flip at seemingly random times, depending on the circumstances.

According to the Wikipedia article and a paper on single-event upsets in RAMmost single-bit flips are the result of background radiation — primarily neutrons from cosmic rays. The lower number is just about one error per gigabit of memory per hour.

The upper number indicates roughly one error every 1, years per gigabit of memory. A study of real memory errors took place at Google. During their investigations they found that one third of the machines and more than 8 percent of the DIMMs saw correctable errors per year.

The study went on to report some other interesting results that bear repeating here. A DIMM that has a correctable error is 13— times more likely to see another in the same month. An uncorrectable error is preceded by a correctable error 70—80 percent of the time.

A correctable error increases the probability of an uncorrectable error by factors of 9— Uncorrectable errors following a correctable error are still small at 0.

Related content

The incidence of correctable errors increases with age, but the incidence of uncorrectable errors decreases with age The increasing incidence of correctable errors sets in after about 10—18 months.

The most likely reason for uncorrectable errors decreasing is that DIMMs with a large number of correctable errors are replaced, decreasing the likelihood of uncorrectable errors. Moreover, the rate of correctable errors can be an important factor in watching for memory failure.

Consequently, I think monitoring and capturing the correctable error information is very important. Linux and Memory Errors When I worked for Linux Networx years ago, they were helping with a project that was called bluesmoke.

The idea was to have a kernel module that could catch and report hardware-related errors within the system. This goes beyond just memory errors to include hardware errors in the cache, DMA, fabric switching, thermal throttling, hypertransport bus, and so on.

For many years, people wrote EDAC kernel modules for various chipsets so they could capture hardware-related error information and report it.

This was initially done outside the kernel at the beginning of the project, but, starting with kernel 2. Starting with kernel 2. Rather than focus on getting EDAC working, I want to focus on what information it can provide and why it is important.

I'll be using a Dell PowerEdge R as an example system. It was running CentOS 6.Error detection and correction has great practical importance in maintaining data (information) integrity across noisy channels and lessthan- reliable storage media. MathWorks Machine Translation. The automated translation of this page is provided by a general purpose third party translator tool.

MathWorks does not warrant, and disclaims all liability for, the accuracy, suitability, or fitness for purpose of the translation.

Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America. error-correction, double-error-detection scheme is most often used in real systems. The modified code uses a different parity check bit scheme that balances the num- ber of inputs to the logic for each check bit and thus the number of inputs to each.

Note Data can be corruppted during transmission. Some applications require that errors be detected and correctederrors be detected and corrected. The feedback you provide will help us show you more relevant content in the future.