Design and implementation of fault tolerant memory

  • Detail

Design and implementation of fault-tolerant memory based on CPLD

Abstract: the causes of memory errors are analyzed, and an effective way to improve its reliability is proposed. Combined with the reliability growth plan of aerospace computer, a fault-tolerant scheme using error correction and detection chip is presented, and the simulation results realized by CPLD are given. Finally, the reliability of fault-tolerant memory is analyzed

key words: fault tolerance; Rectification and inspection; Transient error; Optimal odd weight code; With the continuous improvement of the performance (speed, integration, etc.) of various circuits and chips, especially in military, aerospace and other applications, the reliability requirements are often the first. People have increasingly increased requirements for the reliability of the system, which puts forward strict target requirements for the design and manufacturing of circuit systems

memory is one of the most commonly used devices in the circuit system, which is composed of large-scale integrated circuit memory chips. The actual statistics show that the main errors of memory in space applications are one dislocation [1] or related multi bit errors caused by transient errors (also known as single event disturbance, SEU), while random independent multi bit errors are rare. The errors of semiconductor memory are generally divided into hard errors and soft errors, of which the main one is soft errors. The phenomenon of hard error is that the company will build a new LEGO sustainable development material center at the headquarters. At one or some locations, there are repeated errors in accessing data. This can occur because one or more storage units have failed. Soft errors are mainly caused by particles. The material of the memory chip contains trace radioactive elements, which will release particles intermittently. These particles impact the storage capacitor with considerable energy, change its charge, and cause errors in data storage. Another cause of soft errors is noise interference. At the same time, in the space environment, under the impact of charged particles with sufficient energy, the bits in the storage unit of the memory flip, resulting in SEU error [2]. In this paper, we design and implement the memory fault tolerance with CPLD technology and error correction and detection chip, which greatly improves the reliability of the system. The following is the design of specific fault-tolerant memory and "future path" of door alarm electricity.

1 error detection and error correction principle

common error correction codes that can detect 2 dislocations and correct 1 dislocation (SEC-DED [3, 4] for short) include Extended Hamming code and the best odd weight code (optimal their minimum code spacing is 4, and they have similarities. For example, the redundancy is the same. For data bits K, the check bits r should meet 2 R-1 K + R. when k = 16, r = 6, the data bit length is doubled, and the check bits only need to be increased by 1 bit, so the coding efficiency is high. In addition, from the source, they are the extension code and truncation code of Hamming code, and some data also call the best odd weight code modified Hamming code. Reference [4] introduces the coding and decoding theory of SEC-DED and sec-aued) codes. In terms of performance, the best odd weight code is superior to the Extended Hamming code. The former is also superior to the latter in error correction and detection. Its error correction probability of 3-bit error is lower than that of the latter, while the detection probability of 4-bit error is higher than that of the latter. The most important thing is that it is easy to implement in hardware, so it is most widely used. In this paper, the best odd weight code is used

first, construct the check matrix of the best odd weight code, i.e. H matrix. The H matrix of the best odd weight code should meet the following requirements:

(1) each column contains an odd number of 1, and there is no same column

(2) the total number of 1s is small, so the number of semi adders in the check bit and adjoint generation expression is small, so the number of semi adders required to generate logic is small, which can save equipment, reduce cost and improve reliability

(3) the number of 1 in each line shall be equal to or close to a certain average value as far as possible, which determines the consistency of the generation logic and its series, not only the decoding speed is fast, but also the line is symmetrical

in the application, the (13, 8, 4) best odd weight code is used, the data code is (D7 D6 D5 D4 D3 D2 D1 D0), the check code is (C4 C3 C2 C1 C0), the P matrix and coding rules are:

when decoding, add the new check bit obtained by re encoding the data to the original check bit module 2 to obtain the adjoint formula s, which can distinguish the error type:

(1) if s = 0, it is considered that there is no error

(2) if s 0 and s contains an odd number of 1, it is considered that a unit error has occurred; If s 0 and s contains an even number of 1, it is considered that 2 dislocations have occurred

therefore, the error pattern s = [S 0 s 1 s 2 s 3 s 4] corresponds to the generated error one by one, thus realizing the function of correcting one and checking two

2 memory fault-tolerant chip design and implementation

2.1 memory design and implementation scheme

(1) backup row (or column) scheme

this scheme is to add several backup rows (or columns) during the design and manufacturing of memory chips. During chip test, if any invalid row (or column) is found, it shall be replaced by backup row (or column) through laser (or electrical) processing. This method has the advantages of simple design, less increase of die area and no loss of circuit speed. However, he needs to add some process links for testing and correcting the effective rows (or columns). The more important weakness is that this scheme is only applicable to ram, not ROM

(2) error correction coding scheme

this scheme uses error correction coding in the memory chip to automatically detect and correct errors. This scheme does not require additional testing, error correction and other process links. In addition to improving the yield, it also significantly improves the reliability. The most outstanding advantage of this scheme is that it is especially suitable for ROM; It can also be used for RAM when the speed requirement is not high. Its main drawback is that it takes up additional chip area and affects the overall working speed of the chip due to encoding and decoding. Introducing error correction coding and other fault-tolerant technologies for memory system level into the memory chip is an effective measure to improve the yield and reliability of the memory chip and contribute to the governance of PM2.5. For example, ECC memory used in the server adopts this technology

the fault-tolerant memory in this paper adopts the error correction code scheme, and its implementation block diagram is shown in Figure 1

2.2 design of error correction and detection circuit. During the write cycle, the bus directly writes the data to the memory through the error correction circuit, and the data is written to the redundant memory through the 5B check code generated by the error correction circuit. The reading cycle is divided into two steps. In the first step, the data and check bit data are read from the memory and redundant memory respectively and sent to the error correction circuit for locking; Step 2: error detection. If there is no error, the data is directly sent to the data bus. Two dislocations generate interrupts for processing, and one dislocation corrects the data and sends it to the data bus. Because the correct data is required, if the check bit is wrong, no processing will be carried out and the correct data will be output directly

2.3 circuit input and output design

rd, WR, CLK are the error correction and detection signals input by CPU, and the on-chip control signals are generated through the control circuit. When writing a signal, DB [7..0] is input from the data bus and written to the memory through the three state control (Santai module) after locking. At the same time, the data is generated into the redundant memory through the parity code generation module (paritygen), and written into the redundant memory through the three state control. When reading the signal, the memory data is read into the error correction and detection circuit, which generates a 5B check code after locking. At the same time, it passes through the errorsample together with the 5B check code read from the redundant memory to generate an error pattern. The error pattern is used to detect the error. When the data has an error, the correct data is output to the data bus after error correction by the errorcorrect module. Errordetec is the error status module, SEF and def are error status signals. No error at 0, 0, 1, 0, 1, 1, 2. The functional modules of each part of the circuit are shown in Figure 2

3 simulation and its waveform

in this paper, the CPLD device EPM7128 of Altera company is used as the design environment [ 5 ]. Figure 3 is the simulation diagram of error correction and detection circuit. The error correction and detection circuit is simulated with CPLD. In the figure, data AA is written from the data line at 118 ~ 205 ns, 1 dislocation is generated when reading data at 359 ~ 443 ns, and 2 dislocation is generated at 601 ~ 692 ns. At this time, 2 dislocation is detected, but it cannot be corrected. At 781 ~ 863 ns, the case of 1 dislocation in the check bit is simulated

4 analysis and conclusion

the error correction code circuit designed by using the basic principle of the best odd weight code in this paper can correct the unit error and detect 2 dislocations. The memory will not be interrupted due to the single experimental stroke: the performance of the flexible packaging film to be tested and the required dislocations. Therefore, the MTBF increases and the reliability is improved. However, the new equipment with one correction, one inspection and two yards makes the MTBF decrease

in terms of efficiency, within the time t, the number of 1 dislocation is n 1, and the number of 2-bit and multi bit errors is N 2. When the error correction code is adopted, the mean time between failures is t 1 = t/(n 1 + N 2). After the best odd weight code is adopted, 1 dislocation is correctable, and only 2-bit and multi bit errors are not correctable, so it is treated as an error. Assuming that the equipment is increased by% due to the use of error correction code, the mean time between failures after the use of the best odd weight code is:

it is estimated from the data that for the proportion of 1 dislocation in the whole error, the gain g=4.6 ~ 9.3. The fault tolerance of memory is realized by CPLD, which greatly shortens the design and development cycle, reduces the cost, and improves the reliability of the system

Copyright © 2011 JIN SHI