Brian Towles

Conference Proceedings of the 21st Usenix Symposium on Networked Systems Design and Implementation Nsdi 2024 · January 1, 2024 TPUv4 (Tensor Processing Unit) is Google’s 3rd generation accelerator for machine learning training, deployed as a 4096-node supercomputer with a custom 3D torus interconnect. In this paper, we describe our experience designing and operating the software i ... Cite

TPU v4: An Optically Reconfigurable Supercomputer for Machine Learning with Hardware Support for Embeddings

Conference Proceedings International Symposium on Computer Architecture · June 17, 2023 In response to innovations in machine learning (ML) models, production workloads changed radically and rapidly. TPU v4 is the fifth Google domain specific architecture (DSA) and its third supercomputer for such ML models. Optical circuit switches (OCSes) d ... Full text Cite

The Specialized High-Performance Network on Anton 3

Conference Proceedings International Symposium on High Performance Computer Architecture · January 1, 2022 Molecular dynamics (MD) simulation, a computationally intensive method that provides invaluable insights into the behavior of biomolecules, typically requires large-scale parallelization. Implementation of fast parallel MD simulation demands both high band ... Full text Cite

Anton 3: Twenty Microseconds of Molecular Dynamics Simulation before Lunch

Conference International Conference for High Performance Computing Networking Storage and Analysis Sc · November 14, 2021 Anton 3 is the newest member in a family of supercomputers specially designed for atomic-level simulation of molecules relevant to biology (e.g., DNA, proteins, and drug molecules). Anton 3 achieves order-of-magnitude improvements in time-To-solution over ... Full text Cite

The ΛnTON 3 ASIC: A fire-breathing monster for molecular dynamics simulations

Conference 2021 IEEE Hot Chips 33 Symposium Hcs 2021 · August 22, 2021 Full text Cite

The u-series: A separable decomposition for electrostatics computation with improved accuracy.

Journal Article The Journal of chemical physics · February 2020 The evaluation of electrostatic energy for a set of point charges in a periodic lattice is a computationally expensive part of molecular dynamics simulations (and other applications) because of the long-range nature of the Coulomb interaction. A standard a ... Full text Cite

Filtering, Reductions and Synchronization in the Anton 2 Network

Conference Proceedings 2015 IEEE 29th International Parallel and Distributed Processing Symposium IPDPS 2015 · July 17, 2015 Parallel implementations of molecular dynamics (MD) simulation require significant inter-node communication, but off-chip communication bandwidth is not scaling as quickly as on-chip logic density. We present three network features targeting this problem t ... Full text Cite

The ANTON 2 chip a second-generation ASIC for molecular dynamics

Conference 2014 IEEE Hot Chips 26 Symposium Hcs 2014 · May 25, 2014 This article consists of a collection of slides from the author's conference presentation on the special features, supercomputing capabilities; system design and architectures, processing capabilities, and targeted markets for D.E. Shaw Research's ANTON2 c ... Full text Cite

Anton 2: Raising the Bar for Performance and Programmability in a Special-Purpose Molecular Dynamics Supercomputer

Conference International Conference for High Performance Computing Networking Storage and Analysis Sc · January 16, 2014 Anton 2 is a second-generation special-purpose supercomputer for molecular dynamics simulations that achieves significant gains in performance, programmability, and capacity compared to its predecessor, Anton 1. The architecture of Anton 2 is tailored for ... Full text Cite

Unifying on-chip and inter-node switching within the Anton 2 network

Conference Proceedings International Symposium on Computer Architecture · January 1, 2014 The design of network architectures has become increasingly complex as the chips connected by inter-node networks have emerged as distributed systems in their own right, complete with their own on-chip networks. In Anton 2, a massively parallel special-pur ... Full text Cite

The role of cascade, a cycle-based simulation infrastructure, in designing the anton special-purpose supercomputers

Conference Proceedings Design Automation Conference · July 12, 2013 Cascade is a cycle-based C++ simulation infrastructure used in the design and verification of two successive versions of Anton, a specialized machine designed for high-speed molecular dynamics computation. Cascade was engineered to address the size and spe ... Full text Cite

Hardware support for fine-grained event-driven computation in anton 2

Conference ACM SIGPLAN Notices · April 1, 2013 Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ad ... Full text Cite

Hardware support for fine-grained event-driven computation in anton 2

Conference International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS · March 16, 2013 Exploiting parallelism to accelerate a computation typically involves dividing it into many small tasks that can be assigned to different processing elements. An efficient execution schedule for these tasks can be difficult or impossible to determine in ad ... Full text Cite

A detailed and flexible cycle-accurate Network-on-Chip simulator

Conference Ispass 2013 IEEE International Symposium on Performance Analysis of Systems and Software · January 1, 2013 Network-on-Chips (NoCs) are becoming integral parts of modern microprocessors as the number of cores and modules integrated on a single chip continues to increase. Research and development of future NoC technology relies on accurate modeling and simulation ... Full text Cite

Overcoming communication latency barriers in massively parallel scientific computation

Journal Article IEEE Micro · May 1, 2011 Anton, a massively parallel special-purpose machine that accelerates molecular dynamics simulations by orders of magnitude, uses a combination of specialized hardware mechanisms and restructured software algorithms to reduce and hide communication latency. ... Full text Cite

Exploiting 162-nanosecond end-to-end communication latency on Anton

Conference 2010 ACM IEEE International Conference for High Performance Computing Networking Storage and Analysis Sc 2010 · December 1, 2010 Strong scaling of scientific applications on parallel architectures is increasingly limited by communication latency. This paper describes the techniques used to mitigate latency in Anton, a massively parallel special-purpose machine that accelerates molec ... Full text Cite

Millisecond-scale molecular dynamics simulations on Anton

Conference Proceedings of the Conference on High Performance Computing Networking Storage and Analysis Sc 09 · December 1, 2009 Anton is a recently completed special-purpose supercomputer designed for molecular dynamics (MD) simulations of biomolecular systems. The machine's specialized hardware dramatically increases the speed of MD calculations, making possible for the first time ... Full text Cite

Hierarchical simulation-based verification of anton, a special-purpose parallel machine

Conference 26th IEEE International Conference on Computer Design 2008 Iccd · December 1, 2008 One of the major design verification challenges in the development of Anton, a massively parallel special-purpose machine for molecular dynamics, was to provide evidence that computations spanning more than a quadrillion clock cycles will produce valid sci ... Full text Cite

Anton, a special-purpose machine for molecular dynamics simulation

Journal Article Communications of the ACM · July 1, 2008 The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, che ... Full text Cite

Anton, a special-purpose machine for molecular dynamics simulation

Conference Proceedings International Symposium on Computer Architecture · October 22, 2007 The ability to perform long, accurate molecular dynamics (MD) simulations involving proteins and other biological macro-molecules could in principle provide answers to some of the most important currently outstanding questions in the fields of biology, che ... Full text Cite

Microarchitecture of a high-radix router

Conference Proceedings International Symposium on Computer Architecture · November 10, 2005 Evolving semiconductor and circuit technology has greatly increased the pin bandwidth available to a router chip. In the early 90s, routers were limited to 10Gb/s of pin bandwidth. Today ITb/s is feasible, and we expect 20Tb/s of I/O bandwidth by 2010. A h ... Full text Cite

Globally Adaptive Load-Balanced Routing on Tori

Journal Article IEEE Computer Architecture Letters · January 1, 2004 We introduce a new method of adaptive routing on k-ary n-cubes, Globally Adaptive Load-Balance (GAL). GAL makes global routing decisions using global information. In contrast, most previous adaptive routing algorithms make local routing decisions using loc ... Full text Cite

Adaptive channel queue routing on k-ary n-cubes

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2004 This paper introduces a new adaptive method, Channel Queue Routing (CQR), for load-balanced routing on k-ary n-cube interconnection networks. CQR estimates global congestion in the network from its channel queues while relying on the implicit network backp ... Full text Cite

Guaranteed scheduling for switches with configuration overhead

Journal Article IEEE ACM Transactions on Networking · October 1, 2003 In this paper, we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an unconstrained (zero overhead) switch by accumulating a batch of config ... Full text Cite

GOAL: A load-balanced adaptive routing algorithm for torus networks

Conference Conference Proceedings Annual International Symposium on Computer Architecture ISCA · July 18, 2003 We introduce a load-balanced adaptive routing algorithm for torus networks, GOAL - Globally Oblivious Adaptive Locally - that provides high throughput on adversarial traffic patterns, matching or exceeding fully randomized routing and exceeding the worst-c ... Cite

Exploring the VLSI scalability of stream processors

Conference Proceedings International Symposium on High Performance Computer Architecture · January 1, 2003 Stream processors are high-performance programmable processors optimized to run media applications. Recent work has shown these processors to be more area-and energy-efficient than conventional programmable architectures. This paper explores the scalabilit ... Full text Cite

Throughput-centric routing algorithm design

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2003 The increasing application space of interconnection networks now encompasses several applications, such as packet routing and I/O interconnect, where the throughput of a routing algorithm, not just its locality, becomes an important performance metric. We ... Full text Cite

Comparing Reyes and OpenGL on a stream architecture

Conference Proceedings of the SIGGRAPH Eurographics Workshop on Graphics Hardware · December 1, 2002 The OpenGL and Reyes rendering pipelines each render complex scenes from similar scene descriptions but differ in their internal pipeline organizations. While the OpenGL organization has dominated hardware architectures over the past twenty years, a Reyes ... Cite

Worst-case traffic for oblivious routing functions

Journal Article IEEE Computer Architecture Letters · January 1, 2002 This paper presents an algorithm to find a worst-case traffic pattern for any oblivious routing algorithm on an arbitrary interconnection network topology. The linearity of channel loading offered by oblivious routing algorithms enables the problem to be m ... Full text Cite

VLSI design and verification of the imagine processor

Conference Proceedings IEEE International Conference on Computer Design VLSI in Computers and Processors · January 1, 2002 The Imagine stream processor is a 21 million transistor chip implemented by a collaboration between Stanford University and Texas Instruments in a 1.5V 0.15 μm process with five layers of aluminum metal. The VLSI design, clocking, and verification methodol ... Cite

Worst-case traffic for oblivious routing functions

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2002 This paper presents an algorithm to find a worst-case traffic pattern for any oblivious routing algorithm on an arbitrary interconnection network topology. The linearity of channel loading offered by oblivious routing algorithms enables the problem to be m ... Full text Cite

Scalable opto-electronic network (SOENet)

Conference Proceedings Symposium on the High Performance Interconnects Hot Interconnects · January 1, 2002 In applications such as processor-memory interconnect, I/O networks, and router switch fabrics, an interconnection network must be scalable to thousands of high-bandwidth terminals while at the same time being economical in small configurations and robust ... Full text Cite

Guaranteed scheduling for switches with configuration overhead

Conference Proceedings IEEE INFOCOM · January 1, 2002 In this paper we present three algorithms that provide performance guarantees for scheduling switches, such as optical switches, with configuration overhead. Each algorithm emulates an unconstrained (zero overhead) switch by accumulating a batch of configu ... Cite

Locality-preserving randomized oblivious routing on torus networks

Conference Annual ACM Symposium on Parallel Algorithms and Architectures · January 1, 2002 We introduce Randomized Local Balance (RLB), a routing algorithm that strikes a balance between locality and load balance in torus networks, and analyze RLB's performance for begin and adversarial traffic permutations. Our results show that RLB outperforms ... Full text Cite

Media processing applications on the imagine stream processor

Journal Article Proceedings IEEE International Conference on Computer Design VLSI in Computers and Processors · January 1, 2002 Media applications, such as image processing, signal processing, video, and graphics, require high computation rates and data bandwidths. The stream programming model is a natural and powerful way to describe these applications. Expressing media applicatio ... Full text Cite

Imagine: Media processing with streams

Journal Article IEEE Micro · March 1, 2001 Imagine steam processor is developed to achieve high performance densities for media applications. Imagine processor consists of a programming model, software tools, and an architecture, all designed to operate on streams. A stream program organizes data a ... Full text Cite

Route packets, not wires: On-chip interconnection networks

Conference Proceedings Design Automation Conference · January 1, 2001 Using on-chip interconnection networks in place of ad-hoc global wiring structures the top level wires on a chip and facilitates modular design. With this approach, system modules (processors, memories, peripherals, etc...) communicate by sending packets t ... Cite