Journal ArticleIEEE Micro · January 1, 2023
Remote direct memory access (RDMA) networks enable low latency and low central processing unit utilization, and their widespread adoption in datacenters enables improved application performance. However, there are performance isolation concerns for RDMA de ...
Full textCite
ConferenceProceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 · January 1, 2023
Recent years have witnessed the wide adoption of RDMA in the cloud to accelerate first-party workloads and achieve cost savings by freeing up CPU cycles. Now cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third- ...
Cite
ConferenceProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 · January 1, 2022
Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct M ...
Full textCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · April 19, 2021
Statistical machine learning often uses probabilistic models and algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated wi ...
Full textCite
ConferenceProceedings of the General Track: 2003 USENIX Annual Technical Conference · January 1, 2020
The global nature of energy creates challenges and opportunities for developing operating system policies to effectively manage energy consumption in battery-powered mobile/wireless devices. The proposed currentcy model creates the framework for the operat ...
Cite
ConferenceProceedings of the 14th EuroSys Conference 2019 · March 25, 2019
Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, whil ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2019
Graphics Processing Units (GPUs) are energy-efficient massively parallel accelerators that are increasingly deployed in multi-tenant environments such as data-centers for general-purpose computing as well as graphics applications. Using GPUs in multi-tenan ...
Full textCite
ConferenceProceedings - International Symposium on Computer Architecture · July 19, 2018
The increasing difficulty in leveraging CMOS scaling for improved performance requires exploring alternative technologies. A promising technique is to exploit the physical properties of devices to specialize certain computations. A recently proposed approa ...
Full textCite
ConferenceMobiSys 2018 - Proceedings of the 16th ACM International Conference on Mobile Systems, Applications, and Services · June 10, 2018
The most promising way to improve the performance of dynamic information-flow tracking (DIFT) for machine code is to only track instructions when they process tainted data. Unfortunately, prior approaches to on-demand DIFT are a poor match for modern mobil ...
Full textCite
Journal ArticleNano letters · June 2017
We demonstrate an optically controlled molecular-scale pass gate that uses the photoinduced dark states of fluorescent molecules to modulate the flow of excitons. The device consists of four fluorophores spatially arranged on a self-assembled DNA nanostruc ...
Full textCite
Journal ArticleIEEE Micro · January 1, 2017
As lithographic feature sizes approach fundamental scaling limits, a variety of computational domains remain incompatible with integrated circuits merely due to their operating principles. Resonance energy transfer (RET) logic offers a molecular-scale solu ...
Full textCite
ConferenceProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016 · August 24, 2016
The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov ...
Full textCite
Journal ArticleIEEE Micro · January 1, 2023
Remote direct memory access (RDMA) networks enable low latency and low central processing unit utilization, and their widespread adoption in datacenters enables improved application performance. However, there are performance isolation concerns for RDMA de ...
Full textCite
ConferenceProceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 · January 1, 2023
Recent years have witnessed the wide adoption of RDMA in the cloud to accelerate first-party workloads and achieve cost savings by freeing up CPU cycles. Now cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third- ...
Cite
ConferenceProceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium, IPDPS 2022 · January 1, 2022
Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct M ...
Full textCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · April 19, 2021
Statistical machine learning often uses probabilistic models and algorithms, such as Markov Chain Monte Carlo (MCMC), to solve a wide range of problems. Probabilistic computations, often considered too slow on conventional processors, can be accelerated wi ...
Full textCite
ConferenceProceedings of the General Track: 2003 USENIX Annual Technical Conference · January 1, 2020
The global nature of energy creates challenges and opportunities for developing operating system policies to effectively manage energy consumption in battery-powered mobile/wireless devices. The proposed currentcy model creates the framework for the operat ...
Cite
ConferenceProceedings of the 14th EuroSys Conference 2019 · March 25, 2019
Distributed file systems often exhibit high tail latencies, especially in large-scale datacenters and in the presence of competing (and possibly higher priority) workloads. This paper introduces techniques for managing tail latencies in these systems, whil ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2019
Graphics Processing Units (GPUs) are energy-efficient massively parallel accelerators that are increasingly deployed in multi-tenant environments such as data-centers for general-purpose computing as well as graphics applications. Using GPUs in multi-tenan ...
Full textCite
ConferenceProceedings - International Symposium on Computer Architecture · July 19, 2018
The increasing difficulty in leveraging CMOS scaling for improved performance requires exploring alternative technologies. A promising technique is to exploit the physical properties of devices to specialize certain computations. A recently proposed approa ...
Full textCite
ConferenceMobiSys 2018 - Proceedings of the 16th ACM International Conference on Mobile Systems, Applications, and Services · June 10, 2018
The most promising way to improve the performance of dynamic information-flow tracking (DIFT) for machine code is to only track instructions when they process tainted data. Unfortunately, prior approaches to on-demand DIFT are a poor match for modern mobil ...
Full textCite
Journal ArticleNano letters · June 2017
We demonstrate an optically controlled molecular-scale pass gate that uses the photoinduced dark states of fluorescent molecules to modulate the flow of excitons. The device consists of four fluorophores spatially arranged on a self-assembled DNA nanostruc ...
Full textCite
Journal ArticleIEEE Micro · January 1, 2017
As lithographic feature sizes approach fundamental scaling limits, a variety of computational domains remain incompatible with integrated circuits merely due to their operating principles. Resonance energy transfer (RET) logic offers a molecular-scale solu ...
Full textCite
ConferenceProceedings - 2016 43rd International Symposium on Computer Architecture, ISCA 2016 · August 24, 2016
The increasing use of probabilistic algorithms from statistics and machine learning for data analytics presents new challenges and opportunities for the design of computing systems. One important class of probabilistic machine learning algorithms is Markov ...
Full textCite
Journal ArticleIEEE Micro · September 1, 2015
Despite the theoretical advances in probabilistic computing, a fundamental mismatch persists between the deterministic hardware that traditional computers use and the stochastic nature of probabilistic algorithms. In this article, the authors propose Reson ...
Full textOpen AccessCite
Journal ArticleInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · March 14, 2015
Molecular-scale Network-on-Chip (mNoC) crossbars use quantum dot LEDs as an on-chip light source, and chromophores to provide optical signal filtering for receivers. An mNoC reduces power consumption or enables scaling to larger crossbars for a reduced ene ...
Full textOpen AccessCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · March 14, 2014
Trends in increasing web traffic demand an increase in server throughput while preserving energy efficiency and total cost of ownership. Present work in optimizing data center efficiency primarily focuses on the data center as a whole, using off-the-shelf ...
Full textOpen AccessCite
Journal ArticleJournal of Parallel and Distributed Computing · January 1, 2014
Optical nanoscale computing is one promising alternative to the CMOS process. In this paper we explore the application of Resonance Energy Transfer (RET) logic to common digital circuits. We propose an Optical Logic Element (OLE) as a basic unit from which ...
Full textCite
Journal ArticleIEEE Micro · January 1, 2011
Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation systems. Detecting bugs and faults requires a clear specification of correct behavior. A new framework for address translation aware memory ...
Full textCite
ConferenceProceedings of the Annual International Symposium on Microarchitecture, MICRO · December 1, 2010
We propose an architectural design methodology for designing formally verifiable cache coherence protocols, called Fractal Coherence. Properly designed to be fractal in behavior, the proposed family of cache coherence protocols can be formally verified cor ...
Full textCite
Journal ArticleIEEE Computer Architecture Letters · July 1, 2010
One of the most challenging problems in developing a multicore processor is verfiying that the design is correct, and one of the most difficult aspects of pre-silicon verification is verifying that the memory system obeys the architecture’s specified ...
Full textOpen AccessCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · May 19, 2010
Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ...
Full textCite
Journal ArticleSmall (Weinheim an der Bergstrasse, Germany) · April 2010
The self-assembly of molecularly precise nanostructures is widely expected to form the basis of future high-speed integrated circuits, but the technologies suitable for such circuits are not well understood. In this work, DNA self-assembly is used to creat ...
Full textCite
Journal ArticleACM Journal on Emerging Technologies in Computing Systems · March 1, 2010
The integration of novel nanotechnologies onto silicon platforms is likely to increase fabrication defects compared with traditional CMOS technologies. Furthermore, the number of nodes connected with these networks makes acquiring a global defect map impra ...
Full textCite
Journal ArticleIEEE Micro · January 1, 2010
The authors explore nanoscale sensor processor (nSP) architectures. Their design includes a simple accumulator-based instruction-set architecture, sensors, limited memory, and instruction-fused sensing. Using nSP technology based on optical resonance energ ...
Full textOpen AccessCite
ConferenceProceedings - International Symposium on High-Performance Computer Architecture · January 1, 2010
We propose UNITD, a unified hardware coherence framework that integrates translation coherence into the existing cache coherence protocol. In UNITD coherence protocols, the TLBs participate in the cache coherence protocol just like the instruction and data ...
Full textOpen AccessCite
ConferenceACM SIGPLAN Notices · January 1, 2010
Computer systems with virtual memory are susceptible to design bugs and runtime faults in their address translation (AT) systems. Detecting bugs and faults requires a clear specification of correct behavior. To address this need, we develop a framework for ...
Full textCite
ConferenceProceedings - International Conference on Computer Communications and Networks, ICCCN · November 12, 2009
Shrinking CMOS feature sizes and the integration of novel nanotechnologies onto silicon platforms are both likely to increase fabrication defects. As a result, on-chip networks become more and more irregular due to defects and it becomes more challenging t ...
Full textCite
ConferenceACM SIGPLAN Notices · January 1, 2009
This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific trai ...
Full textCite
Journal ArticleInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · January 1, 2009
This paper explores the architectural implications of integrating computation and molecular probes to form nanoscale sensor processors (nSP). We show how nSPs may enable new computing domains and automate tasks that currently require expert scientific trai ...
Full textCite
Chapter · March 25, 2008
This chapter summarizes current work on DNA-based self-assembly of computing systems. Section 2 presents a technology overview, specifically a discussion of nanoelectronic devices and desirable characteristics. It also describes two forms of DNA-based self ...
Full textCite
Book · 2008
The use of DNA self-assembly in microchip fabrication may well revolutionize computing, and this trail-blazing book is the first to bridge the gap between current chip technology and the molecular-scale circuitries that lie ahead. ...
Cite
Journal ArticleIEEE Micro · January 1, 2008
Drawing on the nanometer-placement capabilities of self-assembly fabrication methods, the authors propose a new nanoscale device based on a single-molecule optical phenomenon called resonance energy transfer. This device enables a complete integrated techn ...
Full textCite
Journal ArticleACM Journal on Emerging Technologies in Computing Systems · July 1, 2007
The continual decrease in transistor size (through either scaled CMOS or emerging nanotechnologies) promises to usher in an era of tera to peta-scale integration but with increasing defects. Regardless of fabrication methodology (top-down or bottom-up), de ...
Full textCite
Journal ArticleInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · December 1, 2006
The continual decrease in transistor size (through either scaled CMOS or emerging nano-technologies) promises to usher in an era of tera to peta-scale integration. However, this decrease in size is also likely to increase defect densities, contributing to ...
Full textCite
Conference2006 1st International Conference on Nano-Networks and Workshops, Nano-Net · December 1, 2006
DNA-based self-assembly of nanoelectronic devices is an emerging technology that has the potential to enable terato peta-scale device integration. However, self-assembly currently is limited to manufacturing small computing blocks (nodes) which must then b ...
Full textCite
OtherProceedings - Fourth ACM and IEEE International Conference on Formal Methods and Models for Co-Design, MEMOCODE'06 · December 1, 2006
With current CMOS technologies reaching beyond 65 nanometers mark, and the highlights of computing fabrics such as molecular, DNA guided assemblies, quantum computing, carbon nanotube based transistors etc. are bringing the focus onto nanotechnology.The te ...
Cite
Journal ArticleIEEE Transactions on Parallel and Distributed Systems · June 1, 2006
Spinning is a synchronization mechanism commonly used in applications and operating systems. Excessive spinning, however, often indicates performance or correctness (e.g., livelock) problems. Detecting if applications and operating systems are spinning is ...
Full textCite
Journal ArticleACM Journal on Emerging Technologies in Computing Systems · January 1, 2006
This article explores the architectural challenges introduced by emerging bottom-up fabrication of nanoelectronic circuits. The specific nanotechnology we explore proposes patterned DNA nanostructures as a scaffold for the placement and interconnection of ...
Full textCite
Journal ArticleProceedings - Design Automation Conference · January 1, 2006
DNA self-assembly is an emerging technology with potential as a future replacement of conventional lithographic fabrication. A key challenge is the specification of appropriate DNA sequences that are optimal according to specified metrics and satisfy vario ...
Full textCite
Conference2nd Conference on Foundations of Nanoscience: Self-Assembled Architectures and Devices, FNANO 2005 · December 1, 2005
We have designed and experimentally demonstrated the self-assembly of an addressable DNA lattice (i.e., a unique tile for each position in the lattice) using a two-step tile annealing procedure. Our method can be applied to a variety of systems including a ...
Cite
Journal ArticleComputer · January 1, 2005
Despite the convenience of clean abstractions, technological trends are blurring the lines between design layers and creating new interactions between previously unrelated architecture layers. For example, virtual machines such as VMWare and Transmeta impl ...
Full textCite
ConferenceUSENIX 2005 Annual Technical Conference · January 1, 2005
Deadlock can occur wherever multiple processes interact. Most existing static and dynamic deadlock detection tools focus on simple types of deadlock, such as those caused by incorrect ordering of lock acquisitions. In this paper, we propose Pulse, a novel ...
Cite
Journal ArticleNanotechnology · September 1, 2004
The shift in technology away from silicon complementary metal-oxide semiconductors (CMOS) to novel nanoscale technologies requires new design tools. In this paper, we explore one particular nanotechnology: carbon nanotube transistors that are self-assemble ...
Full textCite
Conference2004 IEEE International Symposium on Performance Analysis of Systems and Software · June 14, 2004
There is increasing concern among developers that future web servers running commercial workloads may be limited by network processing overhead in the CPU as 10Gb ethernet becomes prevalent. We analyze CPU usage of real hardware running popular commercial ...
Full textCite
Journal ArticleIEEE Trans. Parallel Distrib. Syst. (USA) · 2004
Network performance in tightly-coupled multiprocessors typically degrades rapidly beyond network saturation. Consequently, designers must keep a network below its saturation point by reducing the load on the network. Congestion control via source throttlin ...
Full textLink to itemCite
Journal ArticleLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2004
Energy consumption is becoming a limiting factor in the development of computer systems for a range of application domains. Since processor performance comes with a high power cost, there is increased interest in scaling the CPU voltage and clock frequency ...
Cite
Journal ArticleACM Transactions on Architecture and Code Optimization · January 1, 2004
Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In th ...
Full textCite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · December 1, 2003
A model was created for determining criticality in MP systems. An algorithm was devised for computing criticality and criticality of real MP workloads was evaluated. A directed acyclic graph (DAG) model for executing: critical path and slack; mapping DAGs ...
Cite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2003
Recent research on processor microarchitecture suggests using instruction criticality as a metric to guide hardware control policies. Fields et al. [3, 4] have proposed a directed acyclic graph (DAG) model for characterizing program microexecutions on unip ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 2003
Modern DRAM technologies offer power management features for optimization between performance and energy consumption. This paper employs Petri nets to model and evaluate memory controller policies for manipulating multiple power states. The model has been ...
Full textCite
ConferenceProceedings - International Parallel and Distributed Processing Symposium, IPDPS 2003 · January 1, 2003
High performance, freedom from deadlocks, and freedom from livelocks are desirable properties of interconnection networks. Unfortunately, these can be conflicting goals because networks may either devote or under-utilize resources to avoid deadlocks and li ...
Full textCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · December 1, 2002
Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges b ...
Cite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · December 1, 2002
Prefetching is often used to overlap memory latency with computation for array-based applications. However, prefetching for pointer-intensive applications remains a challenge because of the irregular memory access pattern and pointer-chasing problem. In th ...
Full textCite
ConferenceOperating Systems Review (ACM) · December 1, 2002
Energy consumption has recently been widely recognized as a major challenge of computer systems design. This paper explores how to support energy as a first-class operating system resource. Energy, because of its global system nature, presents challenges b ...
Cite
Journal ArticleIEEE Transactions on Parallel and Distributed Systems · November 1, 2002
The performance of both serial and parallel implementations of matrix multiplication is highly sensitive to memory system behavior. False sharing and cache conflicts cause traditional column-major or row-major array layouts to incur high variability in mem ...
Full textCite
ConferenceConference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 2002
Instruction window size is an important design parameter for many modern processors. Large instruction windows offer the potential advantage of exposing large amounts of instruction level parallelism. Unfortunately, naively scaling conventional window desi ...
Full textCite
ConferenceSIGPLAN Notices (ACM Special Interest Group on Programming Languages) · January 1, 2001
We develop from first principles an exact model of the behavior of loop nests executing in a memory hierarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kind ...
Full textCite
Journal ArticleJournal of Computer and System Sciences · January 1, 2001
In this paper we construct an analytic model of cache misses during matrix multiplication. The analysis in this paper applies to square matrices of size m where the array layout function is given in terms of a function Θ that interleaves the bits in the bi ...
Full textCite
ConferenceProceedings of the International Symposium on Low Power Electronics and Design, Digest of Technical Papers · January 1, 2001
The increasing importance of energy efficiency has produced a multitude of hardware devices with various power management features. This paper investigates memory controller policies for manipulating DRAM power states in cache-based systems. We develop an ...
Full textCite
ConferenceConference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 2001
Current memory hierarchies exploit locality of references to reduce load latency and thereby improve processor performance. Locality based schemes aim at reducing the number of cache misses and tend to ignore the nature of misses. This leads to a potential ...
Full textCite
ConferenceIEEE High-Performance Computer Architecture Symposium Proceedings · January 1, 2001
Network performance in tightly-coupled multiprocessors typically degrades rapidly beyond network saturation. Consequently, designers must keep a network below its saturation point by reducing the load on the network. Congestion control via source throttlin ...
Cite
ConferenceProceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI) · January 1, 2001
We develop from first principles an exact model of the behavior of loop nests executing in a memory hierarchy, by using a nontraditional classification of misses that has the key property of composability. We use Presburger formulas to express various kind ...
Cite
ConferenceProceedings of the 9th Workshop on ACM SIGOPS European Workshop: Beyond the PC: New Challenges for the Operating System, EW 2000 · September 17, 2000
This paper advocates revisiting all aspects of Operating System design and implementation with energy-efficiency as the primary objective rather than the traditional OS metrics of maximizing performance and fairness. Energy is an increasingly important res ...
Full textCite
Journal ArticleIEEE Trans. Comput. (USA) · September 2000
Three-dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware handle ...
Full textLink to itemCite
ConferenceProceedings of the International Conference on Supercomputing · January 1, 2000
As the performance gap between the CPU and main memory continues to grow, techniques to hide memory latency are essential to deliver a high performance computer system. Prefetching can often overlap memory latency with computation for array-based numeric a ...
Cite
ConferenceSIGPLAN Notices (ACM Special Interest Group on Programming Languages) · January 1, 2000
One of the major challenges of post-PC computing is the need to reduce energy consumption, thereby extending the lifetime of the batteries that power these mobile devices. Memory is a particularly important target for efforts to improve energy efficiency. ...
Full textCite
Journal ArticleJournal of Instruction-Level Parallelism · October 1, 1999
This paper provides a quantitative evaluation of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory ...
Cite
ConferenceSIGCSE 1999 - Proceedings of the 13th SIGCSE Technical Symposium on Computer Science Education · March 24, 1999
The wide-spread use of microprocessor based systems that utilize cache memory to alleviate excessively long DRAM access times introduces a new dimension in the quest to obtain good program performance. To fully exploit the performance potential of these fa ...
Cite
ConferenceSIGCSE Bulletin (Association for Computing Machinery, Special Interest Group on Computer Science Education) · January 1, 1999
The wide-spread use of microprocessor based systems that utilize cache memory to alleviate excessively long DRAM access times introduces a new dimension in the quest to obtain good program performance. To fully exploit the performance potential of these fa ...
Full textCite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 1999
Matrix multiplication is an important kernel in linear algebra algorithms, and the performance of both serial and parallel implementations is highly dependent on the memory system behavior. Unfortunately, due to false sharing and cache conflicts, tradition ...
Full textCite
ConferenceProceedings of the International Conference on Supercomputing · January 1, 1999
Programming languages that provide multidimensional arrays and a flat linear model of memory must implement a mapping between these two domains to order array elements in memory. This layout function is fixed at language definition time and constitutes an ...
Full textCite
ConferenceLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) · January 1, 1999
As the importance of cache performance increases, allowing software to assist in cache management decisions becomes an attractive alternative. This paper focuses primarily on a mechanism for software to convey information to the memory hierarchy. We introd ...
Full textCite
ConferenceProceedings of the Annual International Symposium on Microarchitecture · December 1, 1998
Three dimensional (3D) graphics applications have become very important workloads running on today's computer systems. A cost-effective graphics solution is to perform geometry processing of 3D graphics on the host CPU and have specialized hardware handle ...
Cite
ConferenceProceedings of the Annual International Symposium on Microarchitecture · December 1, 1998
This paper provides quantitative measurements of load latency tolerance in a dynamically scheduled processor. To determine the latency tolerance of each memory load operation, our simulations use flexible load completion policies instead of a fixed memory ...
Cite
ConferenceProceedings of the International Conference on Supercomputing · January 1, 1998
Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. W ...
Full textCite
Journal ArticleACM Transactions on Modeling and Computer Simulation · January 1, 1997
This article describes the active memory abstraction for memory-system simulation. In this abstraction - designed specifically for on-the-fly simulation - memory references logically invoke a user-specified function depending upon the reference's type and ...
Full textCite
ConferenceIEEE International Symposium on High Performance Distributed Computing, Proceedings · January 1, 1997
New network technology continues to improve both the latency and bandwidth of communication in computer clusters. The fastest high-speed networks approach or exceed the I/O bus bandwidths of 'gigabit-ready' hosts. These advances introduce new consideration ...
Cite
ConferenceProceedings of the 1995 ACM SIGMETRICS Joint International Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1995/PERFORMANCE 1995 · May 1, 1995
This paper describes the active memory abstraction for memory-system simulation. In this abstraction-designed specifically for on-the-fly simulation, memory references logically invoke a user-specified function depending upon the reference's type and acces ...
Full textCite
ConferenceACM SIGARCH (Association for Computing Nachinery Special Interest Group on Computer Architecture) - Conference Proceedings · January 1, 1995
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache blo ...
Cite
ConferenceConference Proceedings - Annual International Symposium on Computer Architecture, ISCA · January 1, 1995
This paper introduces dynamic self-invalidation (DSI), a new technique for reducing cache coherence overhead in shared-memory multiprocessors. DSI eliminates invalidation messages by having a processor automatically invalidate its local copy of a cache blo ...
Cite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · November 1, 1994
This paper discusses implementations of fine-grain memory access control, which selectively restricts reads and writes to cache-block-sized memory regions. Fine-grain access control forms the basis of efficient cache-coherent shared memory. This paper focu ...
Full textCite
Journal ArticleIEEE Transactions on Parallel and Distributed Systems · January 1, 1994
Several techniques have been proposed to allow parallel access to a shard memory location by combining requests. They have one or more of the following attributes: requirements for a priori knowledge of the request to combine, restrictions on the routing o ...
Full textCite
ConferenceProceedings of the ACM/IEEE Supercomputing Conference · January 1, 1994
Recent distributed shared memory (DSM) systems and proposed shared-memory machines have implemented some or all of their cache coherence protocols in software. One way to exploit the flexibility of this software is to tailor a coherence protocol to match a ...
Full textCite
Journal ArticleProceedings of the 1993 ACM SIGMETRICS Conference on Measurement and Modeling of Computer Systems, SIGMETRICS 1993 · June 1, 1993
We have developed a new technique for evaluating cache coherent, shared-memory computers. The Wisconsin Wind Tunnel (WWT) runs a parallel shared-memory program on a parallel computer (CM-5) and uses execution-driven, distributed, discrete-event simulation ...
Full textCite
Journal ArticleComput. Archit. News (USA) · January 9, 1993
Wisconsin Architectural Research Tool Set (WARTS) is a collection of tools for profiling and tracing programs and analyzing program traces. WARTS currently contains: QPT, a program profiler and tracing system; CPROF, a cache performance profiler; and Tycho ...
Full textLink to itemCite
ConferenceConference Proceedings - Annual Symposium on Computer Architecture · January 1, 1993
This paper explores the complexity of implementing directory protocols by examining their mechanisms - primitive operations on directories, caches, and network interfaces. We compare the following protocols: Dir1B, Dir4B, Dir4NB, DirnNB, Dir1SW and an impr ...
Full textCite