Rethinking Software Fault Tolerance
Traditional software fault tolerance makes use of design-diversity-based redundancy. While proven to be effective, the independent development of multiple versions of a program or component is connected with high costs. This article shows that failures caused by so-called Mandelbugs (i.e., software faults whose activation and/or error propagation depends on the system environment) can often be treated by generating or forcing a new or modified execution environment. In the case of aging-related bugs, a subtype of Mandelbugs, failures can be postponed/prevented via a proactive technique known as software rejuvenation. Indeed, techniques based on environmental diversity, such as retry, reboot, or failover to an identical replica, are successfully used in practice. We discuss two such real-case examples, the IBM Session Initiation Protocol (SIP) Application Server cluster and Avaya gateway servers.
Duke Scholars
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Operations Research
- 4612 Software engineering
- 4010 Engineering practice and education
- 0906 Electrical and Electronic Engineering
- 0803 Computer Software
Citation
Published In
DOI
EISSN
ISSN
Publication Date
Volume
Issue
Start / End Page
Related Subject Headings
- Operations Research
- 4612 Software engineering
- 4010 Engineering practice and education
- 0906 Electrical and Electronic Engineering
- 0803 Computer Software