Skip to main content

Daniel J. Sorin

Professor of Electrical and Computer Engineering
Electrical and Computer Engineering
Box 90291, Durham, NC 27708-0291
403 Wilkinson Building, Durham, NC 27708

Selected Publications


Rigorous Evaluation of Computer Processors with Statistical Model Checking

Conference Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023 · October 28, 2023 Experiments with computer processors must account for the inherent variability in executions. Prior work has shown that real systems exhibit variability, and random effects must be injected into simulators to account for it. Thus, we can run multiple execu ... Full text Cite

HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols

Journal Article IEEE Micro · July 1, 2023 We address the two challenges architects face when designing heterogeneous processors with cache-coherent shared memory. First, we introduce HeteroGen, an automated tool for composing clusters of cores, each with its own coherence protocol. Second, we show ... Full text Cite

HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols

Conference Proceedings - International Symposium on High-Performance Computer Architecture · January 1, 2022 We solve the two challenges architects face when designing heterogeneous processors with cache coherent shared memory. First, we develop an automated tool, called HeteroGen, for composing clusters of cores, each with its own coherence protocol. Second, we ... Full text Cite

Spatiotemporal Strategies for Long-Term FPGA Resource Management

Conference Proceedings - 2022 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2022 · January 1, 2022 The deployment of increasingly large and capable FPGAs has motivated mechanisms for sharing them, but system support for FPGAs is not yet mature. Traditional scheduling algorithms do not account for the unique characteristics of FPGAs, leading to infeasibl ... Full text Cite

Reconfigurable Hardware in Postsilicon Microarchitecture

Journal Article Computer · March 1, 2021 Computer architects want to design processors that are general purpose yet have the performance of special-purpose hardware tailored to each application. Recently, this goal has led to a proliferation of hardware accelerators for important tasks, including ... Full text Cite

Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures

Conference Proceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021 · March 1, 2021 Achieving peak performance on sparse operations is challenging. The distribution of the non-zero elements and underlying hardware platform affect the execution efficiency. Given the diversity in workloads and architectures, no unique solution always wins. ... Full text Cite

Bayesian Optimization for Efficient Accelerator Synthesis

Journal Article ACM Transactions on Architecture and Code Optimization · January 1, 2021 Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabric ... Full text Cite

Roadmap subsampling for changing environments

Conference IEEE International Conference on Intelligent Robots and Systems · October 24, 2020 Precomputed roadmaps can enable effective multi-query motion planning: a roadmap can be built for a robot as if no obstacles were present, and then after edges invalidated by obstacles observed at query time are deleted, path search through the remaining r ... Full text Cite

Foosball Coding: Correcting Shift Errors and Bit Flip Errors in 3D Racetrack Memory

Conference Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020 · June 1, 2020 Racetrack memory is a promising new non-volatile memory technology, especially because of the density of its 3D implementation. However, for 3D racetrack to reach its potential, certain reliability issues must be overcome. Prior work used per-track encodin ... Full text Cite

HieraGen: Automated Generation of Concurrent, Hierarchical Cache Coherence Protocols

Conference Proceedings - International Symposium on Computer Architecture · May 1, 2020 We present HieraGen, a new tool for automatically generating hierarchical cache coherence protocols. HieraGen's inputs are the simple, atomic, stable state protocols for each level of the hierarchy. HieraGen's output is a highly concurrent hierarchical pro ... Full text Cite

Computer Architecture for Orbital Edge Computing

Journal Article Computer · April 1, 2020 Full text Cite

Prospector: Synthesizing Efficient Accelerators via Statistical Learning

Conference Proceedings of the 2020 Design, Automation and Test in Europe Conference and Exhibition, DATE 2020 · March 1, 2020 Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabric ... Full text Cite

A Primer on Memory Consistency and Cache Coherence, Second Edition

Journal Article Synthesis Lectures on Computer Architecture · January 1, 2020 Many modern computer systems, including homogeneous and heterogeneous architectures, support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. For a shared memory machine, ... Full text Cite

A programmable architecture for robot motion planning acceleration

Conference Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors · July 1, 2019 We have designed a programmable architecture to accelerate collision detection and graph search, two of the principal components of robotic motion planning. The programmability enables the architecture to be applied to a wide range of different robots and ... Full text Cite

GreenFlag: Protecting 3D-Racetrack Memory from Shift Errors

Conference Proceedings - 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019 · June 1, 2019 Racetrack memory is an exciting emerging memory technology with the potential to offer far greater capacity and performance than other non-volatile memories. Racetrack memory has an unusual error model, though, which precludes the use of the typical error ... Full text Cite

Extending flash lifetime in embedded processors by expanding analog choice

Journal Article IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems · November 1, 2018 We extend the lifetime of Flash memory in embedded processors by exploiting the fact that data from sensors is inherently analog. Prior work in the computer architecture community has assumed that all data is digital and has overlooked the opportunities av ... Full text Cite

ProtoGen: Automatically generating directory cache coherence protocols from atomic specifications

Conference Proceedings - International Symposium on Computer Architecture · July 19, 2018 Designing directory cache coherence protocols is complicated because coherence transactions are not atomic in modern multicore processors. A coherence transaction comprises multiple messages, and these messages can interleave with other conflicting coheren ... Full text Cite

Low-Power Content Addressable Memory

Journal Article Computer · March 1, 2018 This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from IEEE Computer Architecture Letters. ... Full text Cite

Jenga: Efficient fault tolerance for stacked DRAM

Conference Proceedings - 35th IEEE International Conference on Computer Design, ICCD 2017 · November 22, 2017 In this paper, we introduce Jenga, a new scheme for protecting 3D DRAM, specifically high bandwidth memory (HBM), from failures in bits, rows, banks, channels, dies, and TSVs. By providing redundancy at the granularity of a cache block rather than across b ... Full text Cite

Architecting hierarchical coherence protocols for push-button parametric verification

Conference Proceedings of the Annual International Symposium on Microarchitecture, MICRO · October 14, 2017 Recent work in formal verification theory and verification-aware design has sought to bridge the divide between the class of protocols architects want to design and the class of protocols that are verifiable with state of the art tools. Particularly, the r ... Full text Cite

Rigorous Evaluation of Computer Processors with Statistical Model Checking

Conference Proceedings of the 56th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2023 · October 28, 2023 Experiments with computer processors must account for the inherent variability in executions. Prior work has shown that real systems exhibit variability, and random effects must be injected into simulators to account for it. Thus, we can run multiple execu ... Full text Cite

HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols

Journal Article IEEE Micro · July 1, 2023 We address the two challenges architects face when designing heterogeneous processors with cache-coherent shared memory. First, we introduce HeteroGen, an automated tool for composing clusters of cores, each with its own coherence protocol. Second, we show ... Full text Cite

HeteroGen: Automatic Synthesis of Heterogeneous Cache Coherence Protocols

Conference Proceedings - International Symposium on High-Performance Computer Architecture · January 1, 2022 We solve the two challenges architects face when designing heterogeneous processors with cache coherent shared memory. First, we develop an automated tool, called HeteroGen, for composing clusters of cores, each with its own coherence protocol. Second, we ... Full text Cite

Spatiotemporal Strategies for Long-Term FPGA Resource Management

Conference Proceedings - 2022 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2022 · January 1, 2022 The deployment of increasingly large and capable FPGAs has motivated mechanisms for sharing them, but system support for FPGAs is not yet mature. Traditional scheduling algorithms do not account for the unique characteristics of FPGAs, leading to infeasibl ... Full text Cite

Reconfigurable Hardware in Postsilicon Microarchitecture

Journal Article Computer · March 1, 2021 Computer architects want to design processors that are general purpose yet have the performance of special-purpose hardware tailored to each application. Recently, this goal has led to a proliferation of hardware accelerators for important tasks, including ... Full text Cite

Learning Sparse Matrix Row Permutations for Efficient SpMM on GPU Architectures

Conference Proceedings - 2021 IEEE International Symposium on Performance Analysis of Systems and Software, ISPASS 2021 · March 1, 2021 Achieving peak performance on sparse operations is challenging. The distribution of the non-zero elements and underlying hardware platform affect the execution efficiency. Given the diversity in workloads and architectures, no unique solution always wins. ... Full text Cite

Bayesian Optimization for Efficient Accelerator Synthesis

Journal Article ACM Transactions on Architecture and Code Optimization · January 1, 2021 Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabric ... Full text Cite

Roadmap subsampling for changing environments

Conference IEEE International Conference on Intelligent Robots and Systems · October 24, 2020 Precomputed roadmaps can enable effective multi-query motion planning: a roadmap can be built for a robot as if no obstacles were present, and then after edges invalidated by obstacles observed at query time are deleted, path search through the remaining r ... Full text Cite

Foosball Coding: Correcting Shift Errors and Bit Flip Errors in 3D Racetrack Memory

Conference Proceedings - 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2020 · June 1, 2020 Racetrack memory is a promising new non-volatile memory technology, especially because of the density of its 3D implementation. However, for 3D racetrack to reach its potential, certain reliability issues must be overcome. Prior work used per-track encodin ... Full text Cite

HieraGen: Automated Generation of Concurrent, Hierarchical Cache Coherence Protocols

Conference Proceedings - International Symposium on Computer Architecture · May 1, 2020 We present HieraGen, a new tool for automatically generating hierarchical cache coherence protocols. HieraGen's inputs are the simple, atomic, stable state protocols for each level of the hierarchy. HieraGen's output is a highly concurrent hierarchical pro ... Full text Cite

Computer Architecture for Orbital Edge Computing

Journal Article Computer · April 1, 2020 Full text Cite

Prospector: Synthesizing Efficient Accelerators via Statistical Learning

Conference Proceedings of the 2020 Design, Automation and Test in Europe Conference and Exhibition, DATE 2020 · March 1, 2020 Accelerator design is expensive due to the effort required to understand an algorithm and optimize the design. Architects have embraced two technologies to reduce costs. High-level synthesis automatically generates hardware from code. Reconfigurable fabric ... Full text Cite

A Primer on Memory Consistency and Cache Coherence, Second Edition

Journal Article Synthesis Lectures on Computer Architecture · January 1, 2020 Many modern computer systems, including homogeneous and heterogeneous architectures, support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. For a shared memory machine, ... Full text Cite

A programmable architecture for robot motion planning acceleration

Conference Proceedings of the International Conference on Application-Specific Systems, Architectures and Processors · July 1, 2019 We have designed a programmable architecture to accelerate collision detection and graph search, two of the principal components of robotic motion planning. The programmability enables the architecture to be applied to a wide range of different robots and ... Full text Cite

GreenFlag: Protecting 3D-Racetrack Memory from Shift Errors

Conference Proceedings - 49th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2019 · June 1, 2019 Racetrack memory is an exciting emerging memory technology with the potential to offer far greater capacity and performance than other non-volatile memories. Racetrack memory has an unusual error model, though, which precludes the use of the typical error ... Full text Cite

Extending flash lifetime in embedded processors by expanding analog choice

Journal Article IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems · November 1, 2018 We extend the lifetime of Flash memory in embedded processors by exploiting the fact that data from sensors is inherently analog. Prior work in the computer architecture community has assumed that all data is digital and has overlooked the opportunities av ... Full text Cite

ProtoGen: Automatically generating directory cache coherence protocols from atomic specifications

Conference Proceedings - International Symposium on Computer Architecture · July 19, 2018 Designing directory cache coherence protocols is complicated because coherence transactions are not atomic in modern multicore processors. A coherence transaction comprises multiple messages, and these messages can interleave with other conflicting coheren ... Full text Cite

Low-Power Content Addressable Memory

Journal Article Computer · March 1, 2018 This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from IEEE Computer Architecture Letters. ... Full text Cite

Jenga: Efficient fault tolerance for stacked DRAM

Conference Proceedings - 35th IEEE International Conference on Computer Design, ICCD 2017 · November 22, 2017 In this paper, we introduce Jenga, a new scheme for protecting 3D DRAM, specifically high bandwidth memory (HBM), from failures in bits, rows, banks, channels, dies, and TSVs. By providing redundancy at the granularity of a cache block rather than across b ... Full text Cite

Architecting hierarchical coherence protocols for push-button parametric verification

Conference Proceedings of the Annual International Symposium on Microarchitecture, MICRO · October 14, 2017 Recent work in formal verification theory and verification-aware design has sought to bridge the divide between the class of protocols architects want to design and the class of protocols that are verifiable with state of the art tools. Particularly, the r ... Full text Cite

Verifiable hierarchical protocols with network invariants on parametric systems

Conference Proceedings of the 16th Conference on Formal Methods in Computer-Aided Design, FMCAD 2016 · March 24, 2017 We present Neo, a framework for designing pre-verified protocol components that can be instantiated and connected in an arbitrarily large hierarchy (tree), with a guarantee that the whole system satisfies a given safety property. We employ the idea of netw ... Full text Cite

Persistent Memory

Journal Article Computer · March 1, 2017 This installment of Computer's series highlighting the work published in IEEE Computer Society journals comes from Computer Architecture Letters. ... Full text Cite

The microarchitecture of a real-Time robot motion planning accelerator

Conference Proceedings of the Annual International Symposium on Microarchitecture, MICRO · December 14, 2016 We have developed a hardware accelerator for motion planning, a critical operation in robotics. In this paper, we present the microarchitecture of our accelerator and describe a prototype implementation on an FPGA. We experimentally show that the accelerat ... Full text Cite

Methuselah flash: Rewriting codes for extra long storage lifetime

Conference Proceedings - 46th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2016 · September 29, 2016 Motivated by embedded systems and datacenters that require long-life components, we extend the lifetime of Flash memory using rewriting codes that allow for multiple writes to a page before it needs to be erased. Although researchers have previously explor ... Full text Cite

Robot motion planning on a chip

Conference Robotics: Science and Systems · January 1, 2016 We describe a process that constructs robot-specific circuitry for motion planning, capable of generating motion plans approximately three orders of magnitude faster than existing methods. Our method is based on building collision detection circuits for a ... Cite

Writing without disturb on phase change memories by integrating coding and layout design

Conference ACM International Conference Proceeding Series · October 5, 2015 We integrate coding techniques and layout design to elimi- nate write-disturb in phase change memories (PCMs), while enhancing lifetime and host-visible capacity. We first pro- pose a checkerboard confguration for cell layout to elimi- nate write-disturb w ... Full text Cite

PVCoherence: Designing Flat Coherence Protocols for Scalable Verification

Journal Article IEEE Micro · May 1, 2015 The goal of this work is to design cache coherence protocols with many cores such that they can be verified with existing verification methodologies. In particular, the authors focus on flat (nonhierarchical) coherence protocols using a mostly automated me ... Full text Cite

Multi-program benchmark definition

Conference ISPASS 2015 - IEEE International Symposium on Performance Analysis of Systems and Software · April 27, 2015 Although definition of single-program benchmarks is relatively straight-forward-a benchmark is a program plus a specific input-definition of multi-program benchmarks is more complex. Each program may have a different runtime and they may have different int ... Full text Open Access Cite

Argus-G: Comprehensive, low-cost error detection for GPGPU cores

Journal Article IEEE Computer Architecture Letters · January 1, 2015 We have developed and evaluated Argus-G, an error detection scheme for general purpose GPU (GPGPU) cores. Argus-G is a natural extension of the Argus error detection scheme for CPU cores, and we demonstrate how to modify Argus such that it is compatible wi ... Full text Cite

Recycled Error Bits: Energy-Efficient Architectural Support for Floating Point Accuracy

Conference International Conference for High Performance Computing, Networking, Storage and Analysis, SC · January 16, 2014 In this work, we provide energy-efficient architectural support for floating point accuracy. For each floating point addition performed, we "recycle" that operation's rounding error. We make this error architecturally visible such that it can be used, when ... Full text Cite

Nostradamus: Low-cost hardware-only error detection for processor cores

Journal Article Proceedings -Design, Automation and Test in Europe, DATE · 2014 Cite

Architecting dynamic power management to be formally verifiable

Journal Article Proceedings - Design Automation Conference · January 1, 2014 Many computer systems employ dynamic power management (DPM) to maximize power efficiency. DPM offers great opportunities, but deploying it carries significant risks if the DPM scheme is not completely verified. We propose architecting the DPM scheme such t ... Full text Cite

Nostradamus: Low-cost hardware-only error detection for processor cores

Conference Proceedings -Design, Automation and Test in Europe, DATE · January 1, 2014 We propose a new, low-cost, hardware-only scheme to detect errors in superscalar, out-of-order processor cores. For each instruction decoded, Nostradamus compares what the instruction is expected to do against what the instruction actually does. We impleme ... Full text Cite

PVCoherence: Designing flat coherence protocols for scalable verification

Conference Proceedings - International Symposium on High-Performance Computer Architecture · 2014 The goal of this work is to design cache coherence protocols with many cores that can be verified with state-of-the-art automated verification methodologies. In particular, we focus on flat (non-hierarchical) coherence protocols, and we use a mostly-automa ... Full text Cite

Scalably verifiable dynamic power management

Conference Proceedings - International Symposium on High-Performance Computer Architecture · January 1, 2014 Dynamic power management (DPM) is critical to maximizing the performance of systems ranging from multicore processors to datacenters. However, one formidable challenge with DPM schemes is verifying that the DPM schemes are correct as the number of computat ... Full text Cite

The recycling of fly ash to obtain building materials

Conference International Multidisciplinary Scientific GeoConference Surveying Geology and Mining Ecology Management, SGEM · December 1, 2013 At this moment the fly ash is deposed into huge dump in entire word. The storage of them produces so many problems (in special environment problems): contaminates lands and water (with heavy metals), so large lands occupy, is blow up by wind (will contamin ... Full text Cite

Exploring memory consistency for massively-threaded throughput-oriented processors

Journal Article Proceedings - International Symposium on Computer Architecture · August 12, 2013 We re-visit the issue of hardware consistency models in the new context of massively-threaded throughput-oriented processors (MTTOPs). A prominent example of an MTTOP is a GPGPU, but other examples include Intel's MIC architecture and some recent academic ... Full text Cite

Coset coding to extend the lifetime of memory

Journal Article Proceedings - International Symposium on High-Performance Computer Architecture · July 23, 2013 Some recent memory technologies, including phase change memory (PCM), have lifetime reliabilities that are affected by write operations. We propose the use of coset coding to extend the lifetimes of these memories. The key idea of coset coding is that it p ... Full text Cite

Applying reduced precision arithmetic to detect errors in floating point multiplication

Journal Article Proceedings of IEEE Pacific Rim International Symposium on Dependable Computing, PRDC · January 1, 2013 Prior work developed an efficient technique, called reduced precision checking, for detecting errors in floating point addition. In this work, we extend reduced precision checking (RPC) to multiplication. Our results show that RPC can successfully detect e ... Full text Cite

Evaluating cache coherent shared virtual memory for heterogeneous multicore chips

Journal Article ISPASS 2013 - IEEE International Symposium on Performance Analysis of Systems and Software · January 1, 2013 Although current homogeneous chips tightly couple the cores with cache-coherent shared virtual memory (CCSVM), this is not the communication paradigm used by any current heterogeneous chip. In this paper, we present a CCSVM design for a CPU/GPU chip, as we ... Full text Cite

Building materials realised with fly ash

Conference 12th International Multidisciplinary Scientific GeoConference and EXPO - Modern Management of Mine Producing, Geology and Environmental Protection, SGEM 2012 · December 1, 2012 Into this paper are presented the experimental research regarding the obtaining the building materials with big quantity of fly ash from Timisoara Power Plant, Romania. There are used classical binders (lime and cement) to activate the components of fly as ... Cite

Writing cosets of a convolutional code to increase the Lifetime of Flash memory

Journal Article 2012 50th Annual Allerton Conference on Communication, Control, and Computing, Allerton 2012 · December 1, 2012 The goal of this paper is to extend the lifetime of Flash memory by reducing the frequency with which a given page of memory is erased. This is accomplished by increasing the number of writes that are possible before erasure is necessary. Redundancy is int ... Full text Cite

Why on-chip cache coherence is here to stay

Journal Article Communications of the ACM · July 1, 2012 The article discusses how on-chip hardware coherence can scale gracefully as the number of cores increases. Cache coherence has come to dominate the market for technical, as well as for legacy, reasons. Technically, hardware cache coherence provides perfor ... Full text Cite

Building a roadmap for enhanced oil recovery prefeasibility study

Conference Society of Petroleum Engineers - SPE Russian Oil and Gas Exploration and Production Technical Conference and Exhibition 2012 · January 1, 2012 This paper describes an easy-to-use and fast-track roadmap for Enhanced Oil Recovery (EOR) Prefeasibility Study including (1) screening of EOR suitable methods 2) estimating of additional recovery with mechanistic 3D models 3) evaluating preliminary econom ... Full text Cite

Architectures for online error detection and recovery in multicore processors

Journal Article Proceedings -Design, Automation and Test in Europe, DATE · May 31, 2011 The huge investment in the design and production of multicore processors may be put at risk because the emerging highly miniaturized but unreliable fabrication technologies will impose significant barriers to the life-long reliable operation of future chip ... Cite

Address translation aware memory consistency

Journal Article IEEE Micro · January 1, 2011 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation systems. Detecting bugs and faults requires a clear specification of correct behavior. A new framework for address translation aware memory ... Full text Cite

An FPGA-based experimental evaluation of microprocessor core error detection with Argus-2

Journal Article Performance Evaluation Review · January 1, 2011 Recently, several researchers have proposed schemes for low-cost, low-power error detection in the processor core. In this work, we demonstrate that one particular scheme, an enhanced implementation of the Argus framework called Argus-2, is a viable option ... Full text Cite

A primer on memory consistency and cache coherence

Journal Article Synthesis Lectures on Computer Architecture · January 1, 2011 Many modern computer systems and most multicore chips (chip multiprocessors) support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. For a shared memory machine, the mem ... Full text Cite

Fractal Coherence: Scalably verifiable cache coherence

Conference Proceedings of the Annual International Symposium on Microarchitecture, MICRO · December 1, 2010 We propose an architectural design methodology for designing formally verifiable cache coherence protocols, called Fractal Coherence. Properly designed to be fractal in behavior, the proposed family of cache coherence protocols can be formally verified cor ... Full text Cite

Fractal consistency: Architecting the memory system to facilitate verification

Journal Article IEEE Computer Architecture Letters · July 1, 2010 One of the most challenging problems in developing a multicore processor is verfiying that the design is correct, and one of the most difficult aspects of pre-silicon verification is verifying that the memory system obeys the architecture’s specified ... Full text Open Access Cite

Specifying and dynamically verifying address translation-aware memory consistency

Conference International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · May 19, 2010 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ... Full text Cite

Unified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all

Conference Proceedings - International Symposium on High-Performance Computer Architecture · January 1, 2010 We propose UNITD, a unified hardware coherence framework that integrates translation coherence into the existing cache coherence protocol. In UNITD coherence protocols, the TLBs participate in the cache coherence protocol just like the instruction and data ... Full text Open Access Cite

Specifying and dynamically verifying address translation-aware memory consistency

Conference ACM SIGPLAN Notices · January 1, 2010 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ... Full text Cite

Reduced precision checking for a floating point adder

Journal Article Proceedings - IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems · December 1, 2009 We present an error detection technique for a floating point adder which uses a checker adder of reduced precision to determine if the result is correct within some error bound. Our analysis establishes a relationship between the width of the checker adder ... Full text Cite

Analyzing formal verification and testing efforts of different fault tolerance mechanisms

Journal Article Proceedings - IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems · December 1, 2009 Pre-fabrication design verification and post-fabrication chip testing are two important stages in the product realization process. These two stages consume a large part of resources in the form of time, money, and engineering effort during the process [1]. ... Full text Cite

Dynamic power gating with quality guarantees

Journal Article Proceedings of the International Symposium on Low Power Electronics and Design · November 24, 2009 Power gating is usually driven by a predictive control, and frequent mispredictions can counter-productively lead to a large increase in energy consumption. This energy vulnerability could be exploited by malicious applications such as a power virus, or it ... Full text Cite

Multicore power management: Ensuring robustness via early-stage formal verification

Journal Article 2009 7th IEEE-ACM International Conference on Formal Methods and Models for Co-Design, MEMOCODE '09 · November 19, 2009 Dynamic power management (DPM) is important for multicore architectures. One important challenge for multicore DPM schemes is verifying that they are both safe (cannot lead to power or thermal catastrophes) and efficient (achieve as much performance as pos ... Full text Cite

Coherence Basics

Chapter · January 1, 2009 In this chapter, we introduce enough about cache coherence to understand how consistency models interact with caches. We start in Section 2.1 by presenting the system model that we consider throughout this primer. To simplify the exposition in this chapter ... Full text Cite

Relaxed Memory Consistency

Chapter · January 1, 2009 The previous two chapters explored the memory consistency models sequential consistency (SC) and total store order (TSO). These chapters presented SC as intuitive and TSO as widely implemented (e.g., in x86). Both models are sometimes called strong because ... Full text Cite

Introduction to Consistency and Coherence

Chapter · January 1, 2009 Many modern computer systems and most multicore chips (chip multiprocessors) support shared memory in hardware. In a shared memory system, each of the processor cores may read and write to a single shared address space. These designs seek various goodness ... Full text Cite

Advanced Topics in Coherence

Chapter · January 1, 2009 In Chapters 7 and 8, we have presented snooping and directory coherence protocols in the context of the simplest system models that were sufficient for explaining the fundamental issues of these protocols. In this chapter, we extend our presentation of coh ... Full text Cite

Snooping Coherence Protocols

Chapter · January 1, 2009 In this chapter, we present snooping coherence protocols. Snooping protocols were the first widely-deployed class of protocols and they continue to be used in a variety of systems. Snooping protocols offer many attractive features, including low-latency co ... Full text Cite

Total Store Order and the x86 Memory Model

Chapter · January 1, 2009 A widely implemented memory consistency model is total store order (TSO). TSO is used in SPARC implementations and, more importantly, appears to match the memory consistency model of the widely used x86 architecture. This chapter presents this important co ... Full text Cite

Directory Coherence Protocols

Chapter · January 1, 2009 In this chapter, we present directory coherence protocols. Directory protocols were originally developed to address the lack of scalability of snooping protocols. Traditional snooping systems broadcast all requests on a totally ordered interconnection netw ... Full text Cite

Coherence Protocols

Chapter · January 1, 2009 In this chapter, we return to the topic of cache coherence that we introduced in Chapter 2. We defined coherence in Chapter 2, in order to understand coherence’s role in supporting consistency, but we did not delve into how specific coherence protocols wor ... Full text Cite

Memory Consistency Motivation and Sequential Consistency

Chapter · January 1, 2009 This chapter delves into memory consistency models (a.k.a. memory models) that define the behavior of shared memory systems for programmers and implementors. These models define correctness so that programmers know what to expect and implementors know what ... Full text Cite

Dynamic verification of memory consistency in cache-coherent multithreaded computer architectures

Journal Article IEEE Transactions on Dependable and Secure Computing · January 1, 2009 Multithreaded servers with cache-coherent shared memory are the dominant type of machines used to run critical network services and database management systems. To achieve the high availability required for these tasks, it is necessary to incorporate mecha ... Full text Cite

Core cannibalization architecture: Improving lifetime chip performance for multicore processors in the presence of hard faults

Journal Article Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT · December 1, 2008 To improve the lifetime performance of a multicore chip with simple cores, we propose the Core Cannibalization Architecture (CCA). A chip with CCA provisions a fraction of the cores as cannibalizable cores (CCs). In the absence of hard faults, the CCs func ... Full text Cite

Reducing the impact of intra-core process variability with criticality-based resource allocation and prefetching

Journal Article Conference on Computing Frontiers - Proceedings of the 2008 Conference on Computing Frontiers, CF'08 · December 1, 2008 We develop architectural techniques for mitigating the impact of process variability. Our techniques hide the performance effects of slow components-including registers, functional units, and L1I and L1D cache frames-without slowing the clock frequency or ... Full text Cite

Detouring: Translating software to circumvent hard faults in simple cores

Journal Article Proceedings of the International Conference on Dependable Systems and Networks · October 13, 2008 CMOS technology trends are leading to an increasing incidence of hard (permanent) faults in processors. These faults may be introduced at fabrication or occur in the field. Whereas high-performance processor cores have enough redundancy to tolerate many of ... Full text Cite

The impact of dynamically heterogeneous multicore processors on thread scheduling

Journal Article IEEE Micro · May 1, 2008 Although most current multicore processors are homogeneous, microarchitects are now proposing heterogeneous core implementations, including systems in which heterogeneity is introduced at runtime. This article shows that operating system schedulers must co ... Full text Cite

Argus: Low-cost, comprehensive error detection in simple cores

Journal Article IEEE Micro · January 1, 2008 Argus, a novel approach for detecting errors in simple processor cores, dynamically verifies the correctness of the four tasks performed by a von Neumann core: control flow, data flow, computation, and memory access. Argus detects transient and permanent e ... Full text Cite

Low-cost run-time diagnosis of hard delay faults in the functional units of a microprocessor

Journal Article 2007 IEEE International Conference on Computer Design, ICCD 2007 · December 1, 2007 This paper addresses the run-time diagnosis of delay faults in functional units of microprocessors. Despite the popularity of the stuck-at fault model, it is no longer the only relevant fault model. The delay fault model - which assumes that the faulty cir ... Full text Cite

Reducing the impact of process variability with prefetching and criticality-based resource allocation

Journal Article Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT · December 1, 2007 Full text Cite

Verification-aware microprocessor design

Journal Article Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT · December 1, 2007 The process of verifying a new microprocessor is a major problem for the computer industry. Currently, architects design processors to be fast, power-efficient, and reliable. However, architects do not quantify the impact of these design decisions on the e ... Full text Cite

Error detection using dynamic dataflow verification

Journal Article Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT · December 1, 2007 A significant fraction of the circuitry in a modern processor is dedicated to converting the linear instruction stream into a representation that allows the execution of instructions in data dependence order, rather than program order, to extract instructi ... Full text Cite

Argus: Low-cost, comprehensive error detection in simple cores

Journal Article Proceedings of the Annual International Symposium on Microarchitecture, MICRO · December 1, 2007 We have developed Argus, a novel approach for providing low-cost, comprehensive error detection for simple cores. The key to Argus is that the operation of a von Neumann core consists of four fundamental tasks - control flow, dataflow, computation, and mem ... Full text Cite

Lazy error detection for microprocessor functional units

Conference Proceedings - IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems · December 1, 2007 We propose and evaluate the use of lazy error detection for a superscalar, out-of-order microprocessor's functional units. The key insight is that error detection is off the critical path, because an instruction's results are speculative for at least a cyc ... Full text Cite

Unified microprocessor core storage

Journal Article 2007 Computing Frontiers, Conference Proceedings · October 22, 2007 The organization and management of microprocessor storage structures (e.g., L1 caches, TLBs, etc.) is critical to the performance and energy consumption of the microprocessor. We propose and develop the first microprocessor that can dynamically allocate st ... Full text Cite

Error detection via online checking of cache coherence with token coherence signatures

Journal Article Proceedings - International Symposium on High-Performance Computer Architecture · August 10, 2007 To provide high dependability in a multithreaded system despite hardware faults, the system must detect and correct errors in its shared memory system. Recent research has explored dynamic checking of cache coherence as a comprehensive approach to memory s ... Full text Cite

Online Diagnosis of Hard Faults in Microprocessors

Journal Article ACM Transactions on Architecture and Code Optimization · January 1, 2007 We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigur ... Full text Cite

Dynamic verification of memory consistency in cache-coherent multithreaded computer architectures

Journal Article Proceedings of the International Conference on Dependable Systems and Networks · December 22, 2006 Multithreaded servers with cache-coherent shared memory are the dominant type of machines used to run critical network services and database management systems. To achieve the high availability required for these tasks, it is necessary to incorporate mecha ... Full text Cite

Choosing an error protection scheme for a microprocessor's L1 data cache

Journal Article IEEE International Conference on Computer Design, ICCD 2006 · December 1, 2006 We deconstruct and compare the two dominant existing approaches for L1 data cache (L1D) error protection, with respect to performance, L2 cache bandwidth, power, and area. The two approaches are: (1) parity on the L1D with write-through to an ECC-protected ... Full text Cite

Applying architectural vulnerability analysis to hard faults in the microprocessor

Journal Article Performance Evaluation Review · June 1, 2006 In this paper, we present a new metric, Hard-Fault Architectural Vulnerability Factor (H-AVF), to allow designers to more effectively compare alternate hard-fault tolerance schemes. In order to provide intuition on the use of H-AVF as a metric, we evaluate ... Full text Cite

Spin detection hardware for improved management of multithreaded systems

Journal Article IEEE Transactions on Parallel and Distributed Systems · June 1, 2006 Spinning is a synchronization mechanism commonly used in applications and operating systems. Excessive spinning, however, often indicates performance or correctness (e.g., livelock) problems. Detecting if applications and operating systems are spinning is ... Full text Cite

NANA: A nano-scale active network architecture

Journal Article ACM Journal on Emerging Technologies in Computing Systems · January 1, 2006 This article explores the architectural challenges introduced by emerging bottom-up fabrication of nanoelectronic circuits. The specific nanotechnology we explore proposes patterned DNA nanostructures as a scaffold for the placement and interconnection of ... Full text Cite

Self-checking and self-diagnosing 32-bit microprocessor multiplier

Journal Article Proceedings - International Test Conference · January 1, 2006 In this paper, we propose a low-cost fault tolerance technique for microprocessor multipliers, both non-pipelined (NP) and pipelined (P). Our fault tolerant multiplier designs are capable of detecting and correcting errors, diagnosing hard faults, and reco ... Full text Cite

Circuit-level modeling for concurrent testing of operational defects due to gate oxide breakdown

Journal Article Proceedings -Design, Automation and Test in Europe, DATE '05 · December 1, 2005 As device sizes shrink and current densities increase, the probability of device failures due to gate oxide breakdown (OBD) also increases. To provide designs that are tolerant to such failures, we must investigate and understand the manifestations of this ... Full text Cite

A mechanism for online diagnosis of hard faults in microprocessors

Journal Article Proceedings of the Annual International Symposium on Microarchitecture, MICRO · December 1, 2005 We develop a microprocessor design that tolerates hard faults, including fabrication defects and in-field faults, by leveraging existing microprocessor redundancy. To do this, we must: detect and correct errors, diagnose hard faults at the field deconfigur ... Full text Cite

Dynamic verification of sequential consistency

Journal Article Proceedings - International Symposium on Computer Architecture · November 10, 2005 In this paper, we develop the first feasibly implementable scheme for end-to-end dynamic verification of multithreaded memory systems. For multithreaded (including multiprocessor) memory systems, end-to-end correctness is defined by its memory consistency ... Full text Cite

Autonomic microprocessor execution via self-repairing arrays

Journal Article IEEE Transactions on Dependable and Secure Computing · January 1, 2005 To achieve high reliability despite hard faults that occur during operation and to achieve high yield despite defects introduced at fabrication, a microprocessor must be able to tolerate hard faults. In this paper, we present a framework for autonomic self ... Full text Cite

Self-assembled architectures and the temporal aspects of computing

Journal Article Computer · January 1, 2005 Despite the convenience of clean abstractions, technological trends are blurring the lines between design layers and creating new interactions between previously unrelated architecture layers. For example, virtual machines such as VMWare and Transmeta impl ... Full text Cite

Pulse: A dynamic deadlock detection mechanism using speculative execution

Conference USENIX 2005 Annual Technical Conference · January 1, 2005 Deadlock can occur wherever multiple processes interact. Most existing static and dynamic deadlock detection tools focus on simple types of deadlock, such as those caused by incorrect ordering of lock acquisitions. In this paper, we propose Pulse, a novel ... Cite

Semi-empirical SPICE models for carbon nanotube FET logic

Journal Article 2004 4th IEEE Conference on Nanotechnology · December 1, 2004 To evaluate the potential of carbon nanotube field effect transistors (CNFETs) to replace silicon CMOS technology, we develop a SPICE model of CNFET nanoelectronics. Our model is parameterizable, and it enables composition of models of various aspects of n ... Cite

Using speculation to simplify multiprocessor design

Journal Article Proceedings - International Parallel and Distributed Processing Symposium, IPDPS 2004 (Abstracts and CD-ROM) · December 1, 2004 Modern multiprocessors are complex systems that often require years to design and verify, A significant factor is that engineers must allocate a disproportionate share of their effort to ensure that rare corner-case events behave correctly. This paper prop ... Cite

Design tools for a DNA-guided self-assembling carbon nanotube technology

Journal Article Nanotechnology · September 1, 2004 The shift in technology away from silicon complementary metal-oxide semiconductors (CMOS) to novel nanoscale technologies requires new design tools. In this paper, we explore one particular nanotechnology: carbon nanotube transistors that are self-assemble ... Full text Cite

Communication breakdown: Analyzing CPU usage in commercial web workloads

Conference 2004 IEEE International Symposium on Performance Analysis of Systems and Software · June 14, 2004 There is increasing concern among developers that future web servers running commercial workloads may be limited by network processing overhead in the CPU as 10Gb ethernet becomes prevalent. We analyze CPU usage of real hardware running popular commercial ... Full text Cite

Tolerating hard faults in microprocessor array structures

Journal Article Proceedings of the International Conference on Dependable Systems and Networks · January 1, 2004 In this paper, we present a hardware technique, called Self-Repairing Array Structures (SRAS), for masking hard faults in microprocessor array structures, such as the reorder buffer and branch history table. SRAS masks errors that could otherwise lead to s ... Full text Cite

Quantifying instruction criticality for shared memory multiprocessors

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · December 1, 2003 A model was created for determining criticality in MP systems. An algorithm was devised for computing criticality and criticality of real MP workloads was evaluated. A directed acyclic graph (DAG) model for executing: critical path and slack; mapping DAGs ... Cite

Dynamic Verification of End-to-End Multiprocessor Invariants

Journal Article Proceedings of the International Conference on Dependable Systems and Networks · December 1, 2003 As implementations of shared memory multiprocessors become more complicated, hardware faults will increasingly cause errors that are difficult or impossible to detect with low-level, localized mechanisms. In this paper, we argue for dynamic verification (i ... Full text Cite

Analytic evaluation of shared-memory architectures

Journal Article IEEE Transactions on Parallel and Distributed Systems · February 1, 2003 This paper develops and validates an efficient analytical model for evaluating the performance of shared memory architectures with ILP processors. First, we instrument the SimOS simulator to measure the parameters for such a model and we find a surprisingl ... Full text Cite

Simulating a $2M commercial server on a $2K PC

Journal Article Computer (USA) · 2003 As dependence on database management systems and Web servers increases, so does the need for them to run reliably and efficiently-goals that rigorous simulations can help achieve. Execution-driven simulation models system hardware. These simulations captur ... Full text Link to item Cite

Quantifying instruction criticality for shared memory multiprocessors

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2003 Recent research on processor microarchitecture suggests using instruction criticality as a metric to guide hardware control policies. Fields et al. [3, 4] have proposed a directed acyclic graph (DAG) model for characterizing program microexecutions on unip ... Full text Cite

Using destination-set prediction to improve the latency/bandwidth tradeoff in shared-memory multiprocessors

Journal Article Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 2003 Destination-set prediction can improve the latency/bandwidth tradeoff in shared-memory multiprocessors. The destination set is the collection of processors that receive a particular coherence request. Snooping protocols send requests to the maximal destina ... Full text Cite

Specifying and verifying a broadcast and a multicast snooping cache coherence protocol

Journal Article IEEE Transactions on Parallel and Distributed Systems · June 1, 2002 In this paper, we develop a specification methodology that documents and specifies a cache coherence protocol in eight tables: the states, events, actions, and transitions of the cache and memory controllers. We then use this methodology to specify a detai ... Full text Cite

SafetyNet: Improving the availability of shared memory multiprocessors with global checkpoint/recovery

Journal Article Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 2002 We develop an availability solution, called SafetyNet, that uses a unified, lightweight checkpoint/recovery mechanism to support multiple long-latency fault detection schemes. At an abstract level, SafetyNet logically maintains multiple, globally consisten ... Cite

Bandwidth adaptive snooping

Conference Proceedings - International Symposium on High-Performance Computer Architecture · January 1, 2002 This paper advocates that cache coherence protocols use a bandwidth adaptive approach to adjust to varied system configurations (e.g., number of processors) and workload behaviors. We propose Bandwidth Adaptive Snooping Hybrid (BASH), a hybrid protocol tha ... Full text Cite

Correctly implementing value prediction in microprocessors that support multithreading or multiprocessing

Journal Article Proceedings of the Annual International Symposium on Microarchitecture · December 1, 2001 This paper explores the interaction of value prediction with thread-level parallelism techniques, including multithreading and multiprocessing, where correctness is defined by a memory consistency model. Value prediction subtly interacts with the memory co ... Full text Cite

Timestamp snooping: An approach for extending SMPs

Journal Article International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · December 1, 2000 Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the ... Cite

AMVA techniques for high service time variability

Journal Article Performance Evaluation Review · January 1, 2000 Motivated by experience gained during the validation of a recent Approximate Mean Value Analysis (AMVA) model of modern shared memory architectures, this paper re-examines the "standard" AMVA approximation for non-exponential FCFS queues. We find that this ... Full text Cite

Timestamp snooping: An approach for extending SMPs

Journal Article SIGPLAN Notices (ACM Special Interest Group on Programming Languages) · January 1, 2000 Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the ... Full text Cite

Timestamp snooping: An approach for extending SMPs

Journal Article Operating Systems Review (ACM) · January 1, 2000 Symmetric multiprocessor (SMP) servers provide superior performance for the commercial workloads that dominate the Internet. Our simulation results show that over one-third of cache misses by these applications result in cache-to-cache transfers, where the ... Full text Cite

Using Lamport clocks to reason about relaxed memory models

Journal Article IEEE High-Performance Computer Architecture Symposium Proceedings · January 1, 1999 Cache coherence protocols of current shared-memory multiprocessors are difficult to verify. Our previous work proposed an extension of Lamport's logical clocks for showing that multiprocessors can implement sequential consistency (SC) with an SGI Origin 20 ... Full text Cite

System-level specification framework for I/O architectures

Journal Article Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 1999 A computer system is useless unless it can interact with the outside world through input/output (I/O) devices. I/O systems are complex, including aspects such as memory-mapped operations, interrupts, and bus bridges. Often, I/O behavior is described for is ... Cite

Multicast snooping: A new coherence method using a multicast address network

Journal Article Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 1999 This paper proposes a new coherence method called `multicast snooping' that dynamically adapts between broadcast snooping and a directory protocol. Multicast snooping is unique because processors predict which caches should snoop each coherence transaction ... Cite

Analytic evaluation of shared-memory systems with ILP processors

Journal Article Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 1998 This paper develops and validates an analytical model for evaluating various types of architectural alternatives for shared-memory systems with processors that aggressively exploit instruction-level parallelism. Compared to simulation, the analytical model ... Cite

Lamport clocks: verifying a directory cache-coherence protocol

Journal Article Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 1998 Modern shared-memory multiprocessors use complex memory system implementations that include a variety of non-trivial and interacting optimizations. More time is spent in verifying the correctness of such implementations than in designing the system. In par ... Cite