Cache error propagation model
Cache memory is a small, fast, memory system that holds frequently used data. With increasing processor speed, aggressive design practices increase the probability of fault occurrence and the presence of latent errors as processor allows a short duration for read and write. The fault may corrupt the cache memory system or lead to an erroneous internal CPU state. In this paper, we investigate the error propagation in cache memory system due to transient faults either in the cache memory itself or in the processor's registers or both. The information gained from such an investigation should lead to the development of more effective error recovery mechanisms against failures due to transient faults arising in the machine's cache memory and register set. We establish that even though the computer system is capable of recovering about 50% of the time from the effect of a single erroneous cache location/processor register, the other 50% of the time error recovery is affected only through specific recovery mechanisms. Our results are obtained using both a discrete-time Markov model and by means of error injection on a real system.