Alvin R. Lebeck

Journal Article ACM Transactions on Architecture and Code Optimization · September 17, 2025 Markov-Chain Monte-Carlo (MCMC) algorithms offer a general framework for performing interpretable inference but have high overheads due to the computational complexity of the sampling process and the large number of samples required to produce an accurate ... Full text Cite

Beethoven: A Heterogeneous Multi-Core Accelerator System Composer

Conference 2025 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) · May 11, 2025 Full text Cite

Dense Server Design for Immersion Cooling

Journal Article ACM Transactions on Graphics · December 19, 2024 The growing demands for computational power in cloud computing have led to a significant increase in the deployment of high-performance servers. The growing power consumption of servers and the heat they produce is on track to outpace the capacity of conve ... Full text Cite

RDMA Congestion Control: It Is Only for the Compliant

Journal Article IEEE Micro · January 1, 2023 Remote direct memory access (RDMA) networks enable low latency and low central processing unit utilization, and their widespread adoption in datacenters enables improved application performance. However, there are performance isolation concerns for RDMA de ... Full text Cite

Understanding RDMA Microarchitecture Resources for Performance Isolation

Conference Proceedings of the 20th Usenix Symposium on Networked Systems Design and Implementation Nsdi 2023 · January 1, 2023 Recent years have witnessed the wide adoption of RDMA in the cloud to accelerate first-party workloads and achieve cost savings by freeing up CPU cycles. Now cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third- ... Cite

Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks

Conference Proceedings 2022 IEEE 36th International Parallel and Distributed Processing Symposium IPDPS 2022 · January 1, 2022 Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct M ... Full text Cite

Statistical robustness of Markov chain Monte Carlo accelerators

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · April 17, 2021 Statistical machine learning often uses probabilistic models and algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated wi ... Full text Cite

Accelerating Markov Random Field Inference with Uncertainty Quantification.

Journal Article CoRR · 2021 Cite

Lightweight Inter-transaction Caching with Precise Clocks and Dynamic Self-invalidation.

Journal Article CoRR · 2020 Cite

Beyond Application End-Point Results: Quantifying Statistical Robustness of MCMC Accelerators.

Journal Article CoRR · 2020 Cite

Currentcy: A unifying abstraction for expressing energy management policies

Conference Proceedings of the General Track 2003 Usenix Annual Technical Conference · January 1, 2020 The global nature of energy creates challenges and opportunities for developing operating system policies to effectively manage energy consumption in battery-powered mobile/wireless devices. The proposed currentcy model creates the framework for the operat ... Cite

Multi-version Indexing in Flash-based Key-Value Stores

Other · December 2, 2019 Open Access Cite

A Case for Quantifying Statistical Robustness of Specialized Probabilistic AI Accelerators

Other 2019 IBM IEEE CAS/EDS – AI Compute Symposium · October 2, 2019 Open Access Cite

Message from the Program Chairs

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · April 4, 2019 Cite

Managing tail latency in datacenter-scale file systems under production constraints

Conference Proceedings of the 14th Eurosys Conference 2019 · March 25, 2019 Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, whil ... Full text Cite

Adaptive simultaneous multi-tenancy for GPUs

Conference Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2019 Graphics Processing Units (GPUs) are energy-efficient massively parallel accelerators that are increasingly deployed in multi-tenant environments such as data-centers for general-purpose computing as well as graphics applications. Using GPUs in multi-tenan ... Full text Cite

Architecting a stochastic computing unit with molecular optical devices

Conference Proceedings International Symposium on Computer Architecture · July 19, 2018 The increasing difficulty in leveraging CMOS scaling for improved performance requires exploring alternative technologies. A promising technique is to exploit the physical properties of devices to specialize certain computations. A recently proposed approa ... Full text Cite

SandTrap: Tracking information flows on demand with parallel permissions

Conference Mobisys 2018 Proceedings of the 16th ACM International Conference on Mobile Systems Applications and Services · June 10, 2018 The most promising way to improve the performance of dynamic information-flow tracking (DIFT) for machine code is to only track instructions when they process tainted data. Unfortunately, prior approaches to on-demand DIFT are a poor match for modern mobil ... Full text Cite

An Optically Modulated Self-Assembled Resonance Energy Transfer Pass Gate.

Journal Article Nano letters · June 2017 We demonstrate an optically controlled molecular-scale pass gate that uses the photoinduced dark states of fluorescent molecules to modulate the flow of excitons. The device consists of four fluorophores spatially arranged on a self-assembled DNA nanostruc ... Full text Cite

Enabling Lightweight Transactions with Precision Time

Conference ACM SIGOPS Operating Systems Review · April 4, 2017 Full text Cite

Exploiting Dark Fluorophore States to Implement Resonance Energy Transfer Pre-Charge Logic

Journal Article IEEE Micro · January 1, 2017 As lithographic feature sizes approach fundamental scaling limits, a variety of computational domains remain incompatible with integrated circuits merely due to their operating principles. Resonance energy transfer (RET) logic offers a molecular-scale solu ... Full text Cite

Accelerating Markov Random Field Inference Using Molecular Optical Gibbs Sampling Units

Conference Proceedings 2016 43rd International Symposium on Computer Architecture ISCA 2016 · August 24, 2016 The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov ... Full text Cite

Exploiting Accelerators for Efficient High Dimensional Similarity Search

Conference Proceedings of the 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming · March 2016 Cite

Combined Compute and Storage: Configurable Memristor Arrays to Accelerate Search.

Other CoRR · 2016 Cite

Nanoscale Resonance Energy Transfer-Based Devices for Probabilistic Computing

Journal Article IEEE Micro · September 1, 2015 Despite the theoretical advances in probabilistic computing, a fundamental mismatch persists between the deterministic hardware that traditional computers use and the stochastic nature of probabilistic algorithms. In this article, the authors propose Reson ... Full text Open Access Cite

More is less, less is more: Molecular-scale photonic NoC power topologies

Journal Article International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · March 14, 2015 Molecular-scale Network-on-Chip (mNoC) crossbars use quantum dot LEDs as an on-chip light source, and chromophores to provide optical signal filtering for receivers. An mNoC reduces power consumption or enables scaling to larger crossbars for a reduced ene ... Full text Open Access Cite

mNoC: Large Nanophotonic Network-on-Chip Crossbars with Molecular Scale Devices

Journal Article ACM JOURNAL ON EMERGING TECHNOLOGIES IN COMPUTING SYSTEMS · 2015 Full text Open Access Cite

Rhythm: Harnessing data parallel hardware for server workloads

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · January 1, 2014 Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on the data center as a whole, using off-the-shelf ... Full text Open Access Cite

Modeling and simulation of a nanoscale optical computing system

Journal Article Journal of Parallel and Distributed Computing · January 1, 2014 Optical nanoscale computing is one promising alternative to the CMOS process. In this paper we explore the application of Resonance Energy Transfer (RET) logic to common digital circuits. We propose an Optical Logic Element (OLE) as a basic unit from which ... Full text Cite

Exploiting emerging technologies for nanoscale photonic Networks-on-Chip

Journal Article Sixth International Workshop on Network on Chip Architectures (NoCArc-13) · 2013 Cite

Address translation aware memory consistency

Journal Article IEEE Micro · January 1, 2011 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation systems. Detecting bugs and faults requires a clear specification of correct behavior. A new framework for address translation aware memory ... Full text Cite

Fractal Coherence: Scalably verifiable cache coherence

Conference Proceedings of the Annual International Symposium on Microarchitecture Micro · December 1, 2010 We propose an architectural design methodology for designing formally verifiable cache coherence protocols, called Fractal Coherence. Properly designed to be fractal in behavior, the proposed family of cache coherence protocols can be formally verified cor ... Full text Cite

Fractal consistency: Architecting the memory system to facilitate verification

Journal Article IEEE Computer Architecture Letters · July 1, 2010 One of the most challenging problems in developing a multicore processor is verfiying that the design is correct, and one of the most difficult aspects of pre-silicon verification is verifying that the memory system obeys the architecture’s specified ... Full text Open Access Cite

Encoded multichromophore response for simultaneous label-free detection.

Journal Article Small (Weinheim an der Bergstrasse, Germany) · April 2010 The self-assembly of molecularly precise nanostructures is widely expected to form the basis of future high-speed integrated circuits, but the technologies suitable for such circuits are not well understood. In this work, DNA self-assembly is used to creat ... Full text Cite

Molecular logic gates: Encoded Multichromophore Response for Simultaneous Label-Free Detection Small 7/2010.

Journal Article Small · March 26, 2010 Full text Link to item Cite

Specifying and dynamically verifying address translation-aware memory consistency

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · March 13, 2010 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ... Full text Cite

Routing in self-organizing nano-scale irregular networks

Journal Article ACM Journal on Emerging Technologies in Computing Systems · March 1, 2010 The integration of novel nanotechnologies onto silicon platforms is likely to increase fabrication defects compared with traditional CMOS technologies. Furthermore, the number of nodes connected with these networks makes acquiring a global defect map impra ... Full text Cite

Architectural implications of nanoscale-integrated sensing and computing

Journal Article IEEE Micro · January 1, 2010 The authors explore nanoscale sensor processor (nSP) architectures. Their design includes a simple accumulator-based instruction-set architecture, sensors, limited memory, and instruction-fused sensing. Using nSP technology based on optical resonance energ ... Full text Open Access Cite

Unified Instruction/Translation/Data (UNITD) coherence: One protocol to rule them all

Conference Proceedings International Symposium on High Performance Computer Architecture · January 1, 2010 We propose UNITD, a unified hardware coherence framework that integrates translation coherence into the existing cache coherence protocol. In UNITD coherence protocols, the TLBs participate in the cache coherence protocol just like the instruction and data ... Full text Open Access Cite

Specifying and dynamically verifying address translation-aware memory consistency

Conference ACM SIGPLAN Notices · January 1, 2010 Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ... Full text Cite

Nano-scale on-chip irregular network analysis

Conference Proceedings International Conference on Computer Communications and Networks ICCCN · November 12, 2009 Shrinking CMOS feature sizes and the integration of novel nanotechnologies onto silicon platforms are both likely to increase fabrication defects. As a result, on-chip networks become more and more irregular due to defects and it becomes more challenging t ... Full text Cite

Architectural implications of nanoscale integrated sensing and computing

Journal Article International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · March 7, 2009 This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific trai ... Full text Cite

Architectural implications of nanoscale integrated sensing and computing

Conference ACM SIGPLAN Notices · January 1, 2009 This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific trai ... Full text Cite

Introduction to DAC 2007 special section

Journal Article ACM Journal on Emerging Technologies in Computing Systems · August 1, 2008 Full text Cite

Chapter 8 Self-Assembled Computer Architectures

Chapter · March 25, 2008 This chapter summarizes current work on DNA-based self-assembly of computing systems. Section 2 presents a technology overview, specifically a discussion of nanoelectronic devices and desirable characteristics. It also describes two forms of DNA-based self ... Full text Cite

Introduction to DNA Self-assembled Computer Design

Book · 2008 The use of DNA self-assembly in microchip fabrication may well revolutionize computing, and this trail-blazing book is the first to bridge the gap between current chip technology and the molecular-scale circuitries that lie ahead. ... Cite

Nanoscale optical computing using resonance energy transfer logic

Journal Article IEEE Micro · January 1, 2008 Drawing on the nanometer-placement capabilities of self-assembly fabrication methods, the authors propose a new nanoscale device based on a single-molecule optical phenomenon called resonance energy transfer. This device enables a complete integrated techn ... Full text Cite

A self-organizing defect tolerant SIMD architecture

Journal Article ACM Journal on Emerging Technologies in Computing Systems · July 1, 2007 The continual decrease in transistor size (through either scaled CMOS or emerging nanotechnologies) promises to usher in an era of tera to peta-scale integration but with increasing defects. Regardless of fabrication methodology (top-down or bottom-up), de ... Full text Cite

Self-assembled networks: Control vs. complexity

Conference 2006 1st International Conference on Nano Networks and Workshops Nano Net · December 1, 2006 DNA-based self-assembly of nanoelectronic devices is an emerging technology that has the potential to enable terato peta-scale device integration. However, self-assembly currently is limited to manufacturing small computing blocks (nodes) which must then b ... Full text Cite

Panel: Nano-computing - Do we need new formal approaches?

Other Proceedings Fourth ACM and IEEE International Conference on Formal Methods and Models for Co Design Memocode 06 · December 1, 2006 With current CMOS technologies reaching beyond 65 nanometers mark, and the highlights of computing fabrics such as molecular, DNA guided assemblies, quantum computing, carbon nanotube based transistors etc. are bringing the focus onto nanotechnology.The te ... Cite

A defect tolerant self-organizing nanoscale SIMD architecture

Journal Article International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · October 23, 2006 The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing to ... Full text Cite

Spin detection hardware for improved management of multithreaded systems

Journal Article IEEE Transactions on Parallel and Distributed Systems · June 1, 2006 Spinning is a synchronization mechanism commonly used in applications and operating systems. Excessive spinning, however, often indicates performance or correctness (e.g., livelock) problems. Detecting if applications and operating systems are spinning is ... Full text Cite

Finite-size, Fully-Addressable DNA Tile Lattices Formed by Hierarchical Assembly Procedures

Journal Article Angewandte Chemie · 2006 Cite

NANA: A nano-scale active network architecture

Journal Article ACM Journal on Emerging Technologies in Computing Systems · January 1, 2006 This article explores the architectural challenges introduced by emerging bottom-up fabrication of nanoelectronic circuits. The specific nanotechnology we explore proposes patterned DNA nanostructures as a scaffold for the placement and interconnection of ... Full text Cite

Design automation for DNA self-assembled nanostructures

Journal Article Proceedings Design Automation Conference · January 1, 2006 DNA self-assembly is an emerging technology with potential as a future replacement of conventional lithographic fabrication. A key challenge is the specification of appropriate DNA sequences that are optimal according to specified metrics and satisfy vario ... Full text Cite

Self-Assembled Computer Architecture

Book · 2006 Cite

The design and fabrication of a fully addressable 8-tile DNA lattice

Conference 2nd Conference on Foundations of Nanoscience Self Assembled Architectures and Devices Fnano 2005 · December 1, 2005 We have designed and experimentally demonstrated the self-assembly of an addressable DNA lattice (i.e., a unique tile for each position in the lattice) using a two-step tile annealing procedure. Our method can be applied to a variety of systems including a ... Cite

Experiences in managing energy with ECOSystem

Journal Article IEEE Pervasive Computing · January 1, 2005 Full text Cite

Self-assembled architectures and the temporal aspects of computing

Journal Article Computer · January 1, 2005 Despite the convenience of clean abstractions, technological trends are blurring the lines between design layers and creating new interactions between previously unrelated architecture layers. For example, virtual machines such as VMWare and Transmeta impl ... Full text Cite

Pulse: A dynamic deadlock detection mechanism using speculative execution

Conference Usenix 2005 Annual Technical Conference · January 1, 2005 Deadlock can occur wherever multiple processes interact. Most existing static and dynamic deadlock detection tools focus on simple types of deadlock, such as those caused by incorrect ordering of lock acquisitions. In this paper, we propose Pulse, a novel ... Cite

Design tools for a DNA-guided self-assembling carbon nanotube technology

Journal Article Nanotechnology · September 1, 2004 The shift in technology away from silicon complementary metal-oxide semiconductors (CMOS) to novel nanoscale technologies requires new design tools. In this paper, we explore one particular nanotechnology: carbon nanotube transistors that are self-assemble ... Full text Cite

Communication breakdown: Analyzing CPU usage in commercial web workloads

Conference 2004 IEEE International Symposium on Performance Analysis of Systems and Software · June 14, 2004 There is increasing concern among developers that future web servers running commercial workloads may be limited by network processing overhead in the CPU as 10Gb ethernet becomes prevalent. We analyze CPU usage of real hardware running popular commercial ... Full text Cite

Exploiting global knowledge to achieve self-tuned congestion control for k-ary n-cube networks

Journal Article IEEE Trans. Parallel Distrib. Syst. (USA) · 2004 Network performance in tightly-coupled multiprocessors typically degrades rapidly beyond network saturation. Consequently, designers must keep a network below its saturation point by reducing the load on the network. Congestion control via source throttlin ... Full text Link to item Cite

The synergy between power-aware memory systems and processor voltage scaling

Journal Article Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2004 Energy consumption is becoming a limiting factor in the development of computer systems for a range of application domains. Since processor performance comes with a high power cost, there is increased interest in scaling the CPU voltage and clock frequency ... Cite

Tolerating Memory Latency through Push Prefetching for Pointer-Intensive Applications

Journal Article ACM Transactions on Architecture and Code Optimization · January 1, 2004 Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In th ... Full text Cite

Quantifying instruction criticality for shared memory multiprocessors

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · December 1, 2003 A model was created for determining criticality in MP systems. An algorithm was devised for computing criticality and criticality of real MP workloads was evaluated. A directed acyclic graph (DAG) model for executing: critical path and slack; mapping DAGs ... Cite

Quantifying instruction criticality for shared memory multiprocessors

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2003 Recent research on processor microarchitecture suggests using instruction criticality as a metric to guide hardware control policies. Fields et al. [3, 4] have proposed a directed acyclic graph (DAG) model for characterizing program microexecutions on unip ... Full text Cite

Modeling of DRAM power control policies using deterministic and stochastic petri nets

Conference Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · January 1, 2003 Modern DRAM technologies offer power management features for optimization between performance and energy consumption. This paper employs Petri nets to model and evaluate memory controller policies for manipulating multiple power states. The model has been ... Full text Cite

BLAM: A high-performance routing algorithm for virtual cut-through networks

Conference Proceedings International Parallel and Distributed Processing Symposium IPDPS 2003 · January 1, 2003 High performance, freedom from deadlocks, and freedom from livelocks are desirable properties of interconnection networks. Unfortunately, these can be conflicting goals because networks may either devote or under-utilize resources to avoid deadlocks and li ... Full text Cite

ECOSystem: Managing energy as a first class operating system resource

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · December 1, 2002 Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges b ... Cite

A programmable memory hierarchy for prefetching linked data structures

Conference Lecture Notes in Computer Science Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics · December 1, 2002 Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In th ... Full text Cite

ECOSystem: Managing energy as a first class operating system resource

Conference Operating Systems Review ACM · December 1, 2002 Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges b ... Cite

Recursive array layouts and fast matrix multiplication

Journal Article IEEE Transactions on Parallel and Distributed Systems · November 1, 2002 The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in mem ... Full text Cite

A large, fast instruction window for tolerating cache misses

Conference Conference Proceedings Annual International Symposium on Computer Architecture ISCA · January 1, 2002 Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window desi ... Full text Cite

Exact analysis of the cache behavior of nested loops

Conference SIGPLAN Notices ACM Special Interest Group on Programming Languages · January 1, 2001 We develop from first principles an exact model of the behavior of loop nests executing in a memory hierarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kind ... Full text Cite

The combinatorics of cache misses during matrix multiplication

Journal Article Journal of Computer and System Sciences · January 1, 2001 In this paper we construct an analytic model of cache misses during matrix multiplication. The analysis in this paper applies to square matrices of size m where the array layout function is given in terms of a function Θ that interleaves the bit ... Full text Cite

Memory controller policies for DRAM power management

Conference Proceedings of the International Symposium on Low Power Electronics and Design Digest of Technical Papers · January 1, 2001 The increasing importance of energy efficiency has produced a multitude of hardware devices with various power management features. This paper investigates memory controller policies for manipulating DRAM power states in cache-based systems. We develop an ... Full text Cite

Locality vs. criticality

Conference Conference Proceedings Annual International Symposium on Computer Architecture ISCA · January 1, 2001 Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a potential ... Full text Cite

Self-tuned congestion control for multiprocessor networks

Conference IEEE High Performance Computer Architecture Symposium Proceedings · January 1, 2001 Network performance in tightly-coupled multiprocessors typically degrades rapidly beyond network saturation. Consequently, designers must keep a network below its saturation point by reducing the load on the network. Congestion control via source throttlin ... Cite

Exact analysis of the cache behavior of nested loops

Conference Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI · January 1, 2001 We develop from first principles an exact model of the behavior of loop nests executing in a memory hierarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kind ... Cite

Every joule is precious: The case for revisiting operating system design for energy efficiency

Conference Proceedings of the 9th Workshop on ACM Sigops European Workshop Beyond the PC New Challenges for the Operating System Ew 2000 · September 17, 2000 This paper advocates revisiting all aspects of Operating System design and implementation with energy-efficiency as the primary objective rather than the traditional OS metrics of maximizing performance and fairness. Energy is an increasingly important res ... Full text Cite

Exploiting parallelism in geometry processing with general purpose processors and floating-point SIMD instructions

Journal Article IEEE Trans. Comput. (USA) · September 2000 Three-dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware handle ... Full text Link to item Cite

Push vs. pull: Data movement for linked data structures

Conference Proceedings of the International Conference on Supercomputing · January 1, 2000 As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric a ... Cite

Power aware page allocation

Conference SIGPLAN Notices ACM Special Interest Group on Programming Languages · January 1, 2000 One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that power these mobile devices. Memory is a particularly important target for efforts to improve energy efficiency. ... Full text Cite

Load latency tolerance in dynamically scheduled processors

Journal Article Journal of Instruction Level Parallelism · October 1, 1999 This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory ... Cite

Cache conscious programming in undergraduate computer science

Conference SIGCSE 1999 Proceedings of the 13th SIGCSE Technical Symposium on Computer Science Education · March 24, 1999 The wide-spread use of microprocessor based systems that utilize cache memory to alleviate excessively long DRAM access times introduces a new dimension in the quest to obtain good program performance. To fully exploit the performance potential of these fa ... Cite

Cache conscious programming in undergraduate computer science

Conference SIGCSE Bulletin Association for Computing Machinery Special Interest Group on Computer Science Education · January 1, 1999 The wide-spread use of microprocessor based systems that utilize cache memory to alleviate excessively long DRAM access times introduces a new dimension in the quest to obtain good program performance. To fully exploit the performance potential of these fa ... Full text Cite

Recursive array layouts and fast parallel matrix multiplication

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 1999 Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, tradition ... Full text Cite

Nonlinear array layouts for hierarchical memory systems

Conference Proceedings of the International Conference on Supercomputing · January 1, 1999 Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an ... Full text Cite

Annotated memory references: A mechanism for informed cache management

Conference Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 1999 As the importance of cache performance increases, allowing software to assist in cache management decisions becomes an attractive alternative. This paper focuses primarily on a mechanism for software to convey information to the memory hierarchy. We introd ... Full text Cite

Exploiting instruction level parallelism in geometry processing for three dimensional graphics applications

Conference Proceedings of the Annual International Symposium on Microarchitecture · December 1, 1998 Three dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware handle ... Cite

Load latency tolerance in dynamically scheduled processors

Conference Proceedings of the Annual International Symposium on Microarchitecture · December 1, 1998 This paper provides quantitative measurements of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory ... Cite

Tuning Strassen's matrix multiplication for memory efficiency

Conference Proceedings of the International Conference on Supercomputing · January 1, 1998 Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. W ... Full text Cite

Active memory: A new abstraction for memory system simulation

Journal Article ACM Transactions on Modeling and Computer Simulation · January 1, 1997 This article describes the active memory abstraction for memory-system simulation. In this abstraction - designed specifically for on-the-fly simulation - memory references logically invoke a user-specified function depending upon the reference's type and ... Full text Cite

Cut-through delivery in trapeze: an exercise in low-latency messaging

Conference IEEE International Symposium on High Performance Distributed Computing Proceedings · January 1, 1997 New network technology continues to improve both the latency and bandwidth of communication in computer clusters. The fastest high-speed networks approach or exceed the I/O bus bandwidths of 'gigabit-ready' hosts. These advances introduce new consideration ... Cite

Active memory: A new abstraction for memory-system simulation

Conference Proceedings of the 1995 ACM Sigmetrics Joint International Conference on Measurement and Modeling of Computer Systems Sigmetrics 1995 Performance 1995 · May 1, 1995 This paper describes the active memory abstraction for memory-system simulation. In this abstraction-designed specifically for on-the-fly simulation, memory references logically invoke a user-specified function depending upon the reference's type and acces ... Full text Cite

Dynamic self-invalidation: reducing coherence overhead in shared-memory multiprocessors

Conference ACM SIGARCH (Association for Computing Nachinery Special Interest Group on Computer Architecture) - Conference Proceedings · January 1, 1995 This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache blo ... Cite

Dynamic self-invalidation: Reducing coherence overhead in shared-memory multiprocessors

Conference Conference Proceedings Annual International Symposium on Computer Architecture ISCA · January 1, 1995 This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache blo ... Cite

Fine-grain access control fcw distributed shared memory

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · November 1, 1994 This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focu ... Full text Cite

Request Combining in Multiprocessors with Arbitrary Interconnection Networks

Journal Article IEEE Transactions on Parallel and Distributed Systems · January 1, 1994 Several techniques have been proposed to allow parallel access to a shard memory location by combining requests. They have one or more of the following attributes: requirements for a priori knowledge of the request to combine, restrictions on the routing o ... Full text Cite

Cache Profiling and the SPEC Benchmarks: A Case Study

Journal Article Computer · January 1, 1994 Full text Cite

Application-specific protocols for user-level shared memory

Conference Proceedings of the ACM IEEE Supercomputing Conference · January 1, 1994 Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match a ... Full text Cite

The Wisconsin wind tunnel: Virtual prototyping of parallel computers

Journal Article Proceedings of the 1993 ACM Sigmetrics Conference on Measurement and Modeling of Computer Systems Sigmetrics 1993 · June 1, 1993 We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel shared-memory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation ... Full text Cite

Wisconsin Architectural Research Tool Set

Journal Article Comput. Archit. News (USA) · January 9, 1993 Wisconsin Architectural Research Tool Set (WARTS) is a collection of tools for profiling and tracing programs and analyzing program traces. WARTS currently contains: QPT, a program profiler and tracing system; CPROF, a cache performance profiler; and Tycho ... Full text Link to item Cite

Mechanisms for cooperative shared memory

Conference Conference Proceedings Annual Symposium on Computer Architecture · January 1, 1993 This paper explores the complexity of implementing directory protocols by examining their mechanisms - primitive operations on directories, caches, and network interfaces. We compare the following protocols: Dir1B, Dir4B, Dir4 Full text Cite

Inexpensive Implementations Of Set-Associativity

Conference The 16th Annual International Symposium on Computer Architecture Full text Cite