Reliability and performance of component based software systems with restarts, retries, reboots and repairs
High reliability and performance are vital for software systems handling diverse mission critical applications. Such software systems are usually component based and may possess multiple levels of fault recovery. A number of parameters, including the software architecture, behavior of individual components, underlying hardware, and the fault recovery measures, affect the behavior of such systems, and there is a need for an approach to study them. In this paper we present an integrated approach for modeling and analysis of component based systems with multiple levels of failures and fault recovery both at the software, as well as the hardware level. The approach is useful to analyze attributes such as overall reliability, performance, and machine availabilities for such systems, wherein failures may happen at the software components, the operating system, or at the hardware, and corresponding restarts, retries, reboots or repairs are used for mitigation. Our approach encompasses Markov chain, and queueing network modeling, for estimating system reliability, machine availabilities and performance. The approach is helpful for designing and building better systems and also while improving existing systems. ©2006 IEEE.