ConferenceProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 · January 1, 2024
TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software i ...
Cite
ConferenceProceedings - International Symposium on Computer Architecture · June 17, 2023
In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) d ...
Full textCite
ConferenceProceedings - International Symposium on High-Performance Computer Architecture · January 1, 2022
Molecular dynamics (MD) simulation, a computationally intensive method that provides invaluable insights into the behavior of biomolecules, typically requires large-scale parallelization. Implementation of fast parallel MD simulation demands both high band ...
Full textCite
ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC · November 14, 2021
Anton 3 is the newest member in a family of supercomputers specially designed for atomic-level simulation of molecules relevant to biology (e.g., DNA, proteins, and drug molecules). Anton 3 achieves order-of-magnitude improvements in time-To-solution over ...
Full textCite
Journal ArticleThe Journal of chemical physics · February 2020
The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard a ...
Full textCite
ConferenceProceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015 · July 17, 2015
Parallel implementations of molecular dynamics (MD) simulation require significant inter-node communication, but off-chip communication bandwidth is not scaling as quickly as on-chip logic density. We present three network features targeting this problem t ...
Full textCite
Conference2014 IEEE Hot Chips 26 Symposium, HCS 2014 · May 25, 2014
This article consists of a collection of slides from the author's conference presentation on the special features, supercomputing capabilities; system design and architectures, processing capabilities, and targeted markets for D.E. Shaw Research's ANTON2 c ...
Full textCite
ConferenceInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC · January 16, 2014
Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for ...
Full textCite
ConferenceProceedings - International Symposium on Computer Architecture · January 1, 2014
The design of network architectures has become increasingly complex as the chips connected by inter-node networks have emerged as distributed systems in their own right, complete with their own on-chip networks. In Anton 2, a massively parallel special-pur ...
Full textCite
ConferenceProceedings - Design Automation Conference · July 12, 2013
Cascade is a cycle-based C++ simulation infrastructure used in the design and verification of two successive versions of Anton, a specialized machine designed for high-speed molecular dynamics computation. Cascade was engineered to address the size and spe ...
Full textCite
ConferenceInternational Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS · April 5, 2013
Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ad ...
Full textCite
ConferenceACM SIGPLAN Notices · April 1, 2013
Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ad ...
Full textCite
ConferenceISPASS 2013 - IEEE International Symposium on Performance Analysis of Systems and Software · January 1, 2013
Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulation ...
Full textCite
Journal ArticleIEEE Micro · May 1, 2011
Anton, a massively parallel special-purpose machine that accelerates molecular dynamics simulations by orders of magnitude, uses a combination of specialized hardware mechanisms and restructured software algorithms to reduce and hide communication latency. ...
Full textCite
Conference2010 ACM/IEEE International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2010 · December 1, 2010
Strong scaling of scientific applications on parallel architectures is increasingly limited by communication latency. This paper describes the techniques used to mitigate latency in Anton, a massively parallel special-purpose machine that accelerates molec ...
Full textCite
ConferenceProceedings of the Conference on High Performance Computing Networking, Storage and Analysis, SC '09 · December 1, 2009
Anton is a recently completed special-purpose supercomputer designed for molecular dynamics (MD) simulations of biomolecular systems. The machine's specialized hardware dramatically increases the speed of MD calculations, making possible for the first time ...
Full textCite
Conference26th IEEE International Conference on Computer Design 2008, ICCD · December 1, 2008
One of the major design verification challenges in the development of Anton, a massively parallel special-purpose machine for molecular dynamics, was to provide evidence that computations spanning more than a quadrillion clock cycles will produce valid sci ...
Full textCite
Journal ArticleCommunications of the ACM · July 1, 2008
The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, che ...
Full textCite
ConferenceProceedings - International Symposium on Computer Architecture · October 22, 2007
The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, che ...
Full textCite
ConferenceProceedings - International Symposium on Computer Architecture · November 10, 2005
Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today ITb/s is feasible, and we expect 20Tb/s of I/O bandwidth by 2010. A h ...
Full textCite
Journal ArticleIEEE Computer Architecture Letters · January 1, 2004
We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using loc ...
Full textCite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2004
This paper introduces a new adaptive method, Channel Queue Routing (CQR), for load-balanced routing on k-ary n-cube interconnection networks. CQR estimates global congestion in the network from its channel queues while relying on the implicit network backp ...
Full textCite
Journal ArticleIEEE/ACM Transactions on Networking · October 1, 2003
In this paper, we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an unconstrained (zero overhead) switch by accumulating a batch of config ...
Full textCite
ConferenceConference Proceedings - Annual International Symposium on Computer Architecture, ISCA · July 18, 2003
We introduce a load-balanced adaptive routing algorithm for torus networks, GOAL - Globally Oblivious Adaptive Locally - that provides high throughput on adversarial traffic patterns, matching or exceeding fully randomized routing and exceeding the worst-c ...
Cite
ConferenceProceedings - International Symposium on High-Performance Computer Architecture · January 1, 2003
Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area-and energy-efficient than conventional programmable architectures. This paper explores the scalabilit ...
Full textCite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2003
The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just its locality, becomes an important performance metric. We ...
Full textCite
ConferenceProceedings of the SIGGRAPH/Eurographics Workshop on Graphics Hardware · December 1, 2002
The OpenGL and Reyes rendering pipelines each render complex scenes from similar scene descriptions but differ in their internal pipeline organizations. While the OpenGL organization has dominated hardware architectures over the past twenty years, a Reyes ...
Cite
Journal ArticleIEEE Computer Architecture Letters · January 1, 2002
This paper presents an algorithm to find a worst-case traffic pattern for any oblivious routing algorithm on an arbitrary interconnection network topology. The linearity of channel loading offered by oblivious routing algorithms enables the problem to be m ...
Full textCite
ConferenceProceedings - IEEE International Conference on Computer Design: VLSI in Computers and Processors · January 1, 2002
The Imagine stream processor is a 21 million transistor chip implemented by a collaboration between Stanford University and Texas Instruments in a 1.5V 0.15 μm process with five layers of aluminum metal. The VLSI design, clocking, and verification methodol ...
Cite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2002
This paper presents an algorithm to find a worst-case traffic pattern for any oblivious routing algorithm on an arbitrary interconnection network topology. The linearity of channel loading offered by oblivious routing algorithms enables the problem to be m ...
Full textCite
ConferenceProceedings - Symposium on the High Performance Interconnects, Hot Interconnects · January 1, 2002
In applications such as processor-memory interconnect, I/O networks, and router switch fabrics, an interconnection network must be scalable to thousands of high-bandwidth terminals while at the same time being economical in small configurations and robust ...
Full textCite
ConferenceProceedings - IEEE INFOCOM · January 1, 2002
In this paper we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an unconstrained (zero overhead) switch by accumulating a batch of configu ...
Cite
ConferenceAnnual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2002
We introduce Randomized Local Balance (RLB), a routing algorithm that strikes a balance between locality and load balance in torus networks, and analyze RLB's performance for begin and adversarial traffic permutations. Our results show that RLB outperforms ...
Full textCite
Journal ArticleProceedings-IEEE International Conference on Computer Design: VLSI in Computers and Processors · January 1, 2002
Media applications, such as image processing, signal processing, video, and graphics, require high computation rates and data bandwidths. The stream programming model is a natural and powerful way to describe these applications. Expressing media applicatio ...
Full textCite
Journal ArticleIEEE Micro · March 1, 2001
Imagine steam processor is developed to achieve high performance densities for media applications. Imagine processor consists of a programming model, software tools, and an architecture, all designed to operate on streams. A stream program organizes data a ...
Full textCite
ConferenceProceedings - Design Automation Conference · January 1, 2001
Using on-chip interconnection networks in place of ad-hoc global wiring structures the top level wires on a chip and facilitates modular design. With this approach, system modules (processors, memories, peripherals, etc...) communicate by sending packets t ...
Cite