arXiv:2603.02376

CUCo: An Agentic Framework for Compute and Communication Co-design

Automatically generating high-performance CUDA kernels that jointly orchestrate computation and communication for distributed LLM workloads.

University of Texas at Austin  —  *Equal contribution

Abstract

Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. CUCo is a training-free, agent-driven framework that automatically generates high-performance CUDA kernels jointly orchestrating computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks optimization opportunities unavailable to existing approaches, reducing end-to-end latency by up to 1.57x over host-driven baselines.

Framework

1
Design Space Specification

A structured, declarative set of communication primitives that grounds agent reasoning in valid collective semantics. Formalizes configuration across five dimensions: backend (GIN/LSA), placement, synchronization scope, issuer granularity, and chunk size. Provides backend-conditioned context including API documentation, strategy knowledge, and hardware properties.

GIN / LSA backends 5-dimensional space Hardware-aware
2
Fast-Path Agent

A correctness-first pipeline that converts host-driven NCCL code into device-initiated (GIN/LSA) equivalents via CUDA analysis, host-to-device transformation through an LLM-judge loop, and evolve-block annotation.

CUDA analysis Host-to-device LLM-judge loop Evolve-block annotation
3
Slow-Path Agent

An LLM-driven evolutionary search that optimizes the fast-path baseline through island-based populations, phase-dependent explore/exploit mutation, and cascaded evaluation. Maintains a shared candidate database with meta-summarization to guide future mutations using both successful and failed designs.

Island-based evolution Explore / exploit phases Cascaded evaluation Meta-summarization

Overview

Overall workflow of CUCo showing the fast-path and slow-path agents

Figure 2. Overall workflow of CUCo — user-provided seed kernel flows through the fast-path agent (correctness) then the slow-path agent (optimization).

Performance

Flash Attention with Context Parallelism performance

Fig. 3 — Flash Attention + Context Parallelism across sequence lengths (4-GPU NVLink).

DeepSeek-V3 MoE layer performance across inter-node RoCE

Fig. 4 — DeepSeek-V3 MoE layer with expert skewness across inter-node RoCE links.

Intra-node KV cache transfer latency

Fig. 5 — Intra-node KV cache transfer latency across sequence lengths and KV dimensions.

Intra- and inter-node GEMM + AllGather latency

Fig. 6 — Intra- and inter-node GEMM + AllGather latency across square matrix sizes.

Citation

@misc{hu2026cucoagenticframeworkcompute,
      title={CUCo: An Agentic Framework for Compute and Communication Co-design},
      author={Bodun Hu and Yoga Sri Varshan V and Saurabh Agarwal and Aditya Akella},
      year={2026},
      eprint={2603.02376},
      archivePrefix={arXiv},
      primaryClass={cs.DC},
      url={https://arxiv.org/abs/2603.02376},
}