Danyang Zhuo

Conference ACM SIGCOMM 2024 - Proceedings of the 2024 ACM SIGCOMM 2024 Conference · August 4, 2024 Performance of collective communication is critical for distributed systems. Using libraries to implement collective communication algorithms is not a good fit for a multi-tenant cloud environment because the tenant is not aware of the underlying physical ... Full text Cite

Conveyor: Efficient Tool-aware LLM Serving with Tool Partial Execution

Preprint · May 29, 2024 Link to item Cite

Enoki: High Velocity Linux Kernel Scheduler Development

Conference EuroSys 2024 - Proceedings of the 2024 European Conference on Computer Systems · April 22, 2024 Kernel task scheduling is important for application performance, adaptability to new hardware, and complex user requirements. However, developing, testing, and debugging new scheduling algorithms in Linux, the most widely used cloud operating system, is sl ... Full text Cite

Harmonic: Hardware-assisted RDMA Performance Isolation for Public Clouds

Conference Proceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation, NSDI 2024 · January 1, 2024 Performance isolation is essential for sharing resources in multi-tenant public clouds. Compared with traditional kernel-based networking, RDMA presents unique challenges especially because RDMA NIC’s complex microarchitecture resources are often hidden fr ... Cite

Fairness in Serving Large Language Models

Conference Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2024 · January 1, 2024 High-demand LLM inference services (e.g., ChatGPT and BARD) support a wide range of requests from short chat conversations to long document reading. To ensure that all client requests are processed fairly, most major LLM inference services have request rat ... Cite

HAL: Hardware-assisted Load Balancing for Energy-efficient SNIC-Host Cooperative Computing

Conference Proceedings - International Symposium on Computer Architecture · January 1, 2024 A typical SmartNIC (SNIC) integrates a processor comprising Arm CPU and accelerators with a conventional NIC. The processor is designed to energy-efficiently execute network functions frequently used by datacenter applications. With such a processor, the S ... Full text Cite

Application Defined Networks

Conference HotNets 2023 - Proceedings of the 22nd ACM Workshop on Hot Topics in Networks · November 28, 2023 With the rise of microservices, the execution environment of many cloud applications has become a set of virtual machines or containers connected by a flexible and feature-rich virtual network. We argue that the implementation of such virtual networks shou ... Full text Cite

Dissecting Overheads of Service Mesh Sidecars

Conference SoCC 2023 - Proceedings of the 2023 ACM Symposium on Cloud Computing · October 30, 2023 Service meshes play a central role in the modern application ecosystem by providing an easy and flexible way to connect microservices of a distributed application. However, because of how they interpose on application traffic, they can substantially increa ... Full text Cite

Punica: Multi-Tenant LoRA Serving

Preprint · October 27, 2023 Link to item Cite

Towards a Manageable Intra-Host Network

Conference HotOS 2023 - Proceedings of the 19th Workshop on Hot Topics in Operating Systems · June 22, 2023 Intra-host networks, including heterogeneous devices and interconnect fabrics, have become increasingly complex and crucial. However, intra-host networks today do not provide sufficient manageability. This prevents data center operators from running a reli ... Full text Cite

RDMA Congestion Control: It Is Only for the Compliant

Journal Article IEEE Micro · January 1, 2023 Remote direct memory access (RDMA) networks enable low latency and low central processing unit utilization, and their widespread adoption in datacenters enables improved application performance. However, there are performance isolation concerns for RDMA de ... Full text Cite

Understanding RDMA Microarchitecture Resources for Performance Isolation

Conference Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 · January 1, 2023 Recent years have witnessed the wide adoption of RDMA in the cloud to accelerate first-party workloads and achieve cost savings by freeing up CPU cycles. Now cloud providers are working towards supporting RDMA in general-purpose guest VMs to benefit third- ... Cite

Remote Procedure Call as a Managed System Service

Conference Proceedings of the 20th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2023 · January 1, 2023 Remote Procedure Call (RPC) is a widely used abstraction for cloud computing. The programmer specifies type information for each remote procedure, and a compiler generates stub code linked into each application to marshal and unmarshal arguments into messa ... Cite

An Online and Unified Algorithm for Projection Matrix Vector Multiplication with Application to Empirical Risk Minimization

Conference Proceedings of Machine Learning Research · January 1, 2023 Online matrix vector multiplication is a fundamental step and bottleneck in many machine learning algorithms. It is defined as follows: given a matrix at the pre-processing phase, at each iteration one receives a query vector and needs to form the matrix-v ... Cite

Remote Direct Memory Introspection

Conference 32nd USENIX Security Symposium, USENIX Security 2023 · January 1, 2023 Hypervisors have played a critical role in cloud security, but they introduce a large trusted computing base (TCB) and incur a heavy performance tax. As of late, hypervisor offloading has become an emerging trend, where privileged functions are sunk into s ... Cite

Bypass Exponential Time Preprocessing: Fast Neural Network Training via Weight-Data Correlation Preprocessing

Conference Advances in Neural Information Processing Systems · January 1, 2023 Over the last decade, deep neural networks have transformed our society, and they are already widely applied in various machine learning applications. State-of-the-art deep neural networks are becoming larger in size every year to deliver increasing model ... Cite

Serving and Optimizing Machine Learning Workflows on Heterogeneous Infrastructures

Journal Article Proceedings of the VLDB Endowment · November 1, 2022 With the advent of ubiquitous deployment of smart devices and the Internet of Things, data sources for machine learning inference have increasingly moved to the edge of the network. Existing machine learning inference platforms typically assume a homogeneo ... Full text Cite

Fast Graph Neural Tangent Kernel via Kronecker Sketching

Conference Proceedings of the 36th AAAI Conference on Artificial Intelligence, AAAI 2022 · June 30, 2022 Many deep learning tasks have to deal with graphs (e.g., protein structures, social networks, source code abstract syntax trees). Due to the importance of these tasks, people turned to Graph Neural Networks (GNNs) as the de facto method for learning on gra ... Full text Cite

NetHint: White-Box Networking for Multi-Tenant Data Centers

Conference Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 · January 1, 2022 A cloud provider today provides its network resources to its tenants as a black box, such that cloud tenants have little knowledge of the underlying network characteristics. Meanwhile, data-intensive applications have increasingly migrated to the cloud, an ... Cite

Collie: Finding Performance Anomalies in RDMA Subsystems

Conference Proceedings of the 19th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2022 · January 1, 2022 High-speed RDMA networks are getting rapidly adopted in the industry for their low latency and reduced CPU overheads. To verify that RDMA can be used in production, system administrators need to understand the set of application workloads that can potentia ... Cite

Alpa: Automating Inter- and Intra-Operator Parallelism for Distributed Deep Learning

Conference Proceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2022 · January 1, 2022 Alpa automates model-parallel training of large deep learning (DL) models by generating execution plans that unify data, operator, and pipeline parallelism. Existing model-parallel training systems either require users to manually create a parallelization ... Cite

Adore: Differentially Oblivious Relational Database Operators

Journal Article Proceedings of the VLDB Endowment · January 1, 2022 There has been a recent effort in applying differential privacy on memory access patterns to enhance data privacy. This is called differential obliviousness. Differential obliviousness is a promising direction because it provides a principled trade-off bet ... Full text Cite

Adaptive and Dynamic Multi-Resolution Hashing for Pairwise Summations

Conference Proceedings - 2022 IEEE International Conference on Big Data, Big Data 2022 · January 1, 2022 In this paper, we propose Adam-Hash: an adaptive and dynamic multi-resolution hashing data-structure for fast pairwise summation estimation. Given a data-set X ⊂ ℝd, a binary function f : ℝd × ℝd → ℝ, and a point y ∈ ℝd, the Pairwise Summation Estimate PSE ... Full text Cite

Hoplite: Efficient and fault-tolerant collective communication for task-based distributed systems

Conference SIGCOMM 2021 - Proceedings of the ACM SIGCOMM 2021 Conference · August 9, 2021 Task-based distributed frameworks (e.g., Ray, Dask, Hydro) have become increasingly popular for distributed applications that contain asynchronous and dynamic workloads, including asynchronous gradient descent, reinforcement learning, and model serving. As ... Full text Cite

Differentially oblivious database joins: Overcoming the worst-case curse of fully oblivious algorithms

Conference Leibniz International Proceedings in Informatics, LIPIcs · July 1, 2021 Numerous high-profile works have shown that access patterns to even encrypted databases can leak secret information and sometimes even lead to reconstruction of the entire database. To thwart access pattern leakage, the literature has focused on oblivious ... Full text Cite

An incremental path towards a safer OS kernel

Conference HotOS 2021 - Proceedings of the 2021 Workshop on Hot Topics in Operating Systems · June 1, 2021 Linux has become the de-facto operating system of our age, but its vulnerabilities are a constant threat to service availability, user privacy, and data integrity. While one might scrap Linux and start over, the cost of that would be prohibitive due to Lin ... Full text Cite

High velocity kernel file systems with bento

Conference Proceedings of the 19th USENIX Conference on File and Storage Technologies, FAST 2021 · January 1, 2021 High development velocity is critical for modern systems. This is especially true for Linux file systems which are seeing increased pressure from new storage devices and new demands on storage systems. However, high velocity Linux kernel development is cha ... Cite

TeraPipe: Token-Level Pipeline Parallelism for Training Large-Scale Language Models

Conference Proceedings of Machine Learning Research · January 1, 2021 Model parallelism has become a necessity for training modern large-scale deep language models. In this work, we identify a new and orthogonal dimension from existing model parallel approaches: it is possible to perform pipeline parallelism within a single ... Cite

Rearchitecting In-Memory Object Stores for Low Latency

Journal Article Proceedings of the VLDB Endowment · January 1, 2021 Low latency is increasingly critical for modern workloads, to the extent that compute functions are explicitly scheduled to be co-located with their in-memory object stores for faster access. However, the traditional object store architecture mandates that ... Full text Cite

ON INSTAHIDE, PHASE RETRIEVAL, AND SPARSE MATRIX FACTORIZATION

Conference ICLR 2021 - 9th International Conference on Learning Representations · January 1, 2021 In this work, we examine the security of InstaHide, a scheme recently proposed by Huang et al. (2020b) for preserving the security of private datasets in the context of distributed learning. To generate a synthetic training example to be shared among the d ... Cite

Gallium: Automated Software Middlebox Offloading to Programmable Switches

Conference SIGCOMM 2020 - Proceedings of the 2020 Annual Conference of the ACM Special Interest Group on Data Communication on the Applications, Technologies, Architectures, and Protocols for Computer Communication · July 30, 2020 Researchers have shown that offloading software middleboxes (e.g., NAT, firewall, load balancer) to programmable switches can yield orders-of-magnitude performance gains. However, it requires manually selecting the middle-box components to offload and rewr ... Full text Cite

Automated verification of customizable middlebox properties with gravel

Conference Proceedings of the 17th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2020 · January 1, 2020 Building a formally-verified software middlebox is attractive for network reliability. In this paper, we explore the feasibility of verifying “almost unmodified” software middleboxes. Our key observation is that software middleboxes are already designed an ... Cite

Ansor: Generating high-performance tensor programs for deep learning

Conference Proceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020 · January 1, 2020 High-performance tensor programs are crucial to guarantee efficient execution of deep neural networks. However, obtaining performant tensor programs for different operators on various hardware platforms is notoriously challenging. Currently, deep learning ... Cite