Automatically generating high-performance CUDA kernels that jointly orchestrate computation and communication for distributed LLM workloads.
Custom CUDA kernel development is essential for maximizing GPU utilization in large-scale distributed LLM training and inference, yet manually writing kernels that jointly leverage both computation and communication remains a labor-intensive and error-prone process. CUCo is a training-free, agent-driven framework that automatically generates high-performance CUDA kernels jointly orchestrating computation and communication. By co-optimizing these traditionally disjoint components, CUCo unlocks optimization opportunities unavailable to existing approaches, reducing end-to-end latency by up to 1.57x over host-driven baselines.
A structured, declarative set of communication primitives that grounds agent reasoning in valid collective semantics. Formalizes configuration across five dimensions: backend (GIN/LSA), placement, synchronization scope, issuer granularity, and chunk size. Provides backend-conditioned context including API documentation, strategy knowledge, and hardware properties.
A correctness-first pipeline that converts host-driven NCCL code into device-initiated (GIN/LSA) equivalents via CUDA analysis, host-to-device transformation through an LLM-judge loop, and evolve-block annotation.
An LLM-driven evolutionary search that optimizes the fast-path baseline through island-based populations, phase-dependent explore/exploit mutation, and cascaded evaluation. Maintains a shared candidate database with meta-summarization to guide future mutations using both successful and failed designs.
Figure 2. Overall workflow of CUCo — user-provided seed kernel flows through the fast-path agent (correctness) then the slow-path agent (optimization).
Fig. 3 — Flash Attention + Context Parallelism across sequence lengths (4-GPU NVLink).
Fig. 4 — DeepSeek-V3 MoE layer with expert skewness across inter-node RoCE links.
Fig. 5 — Intra-node KV cache transfer latency across sequence lengths and KV dimensions.
Fig. 6 — Intra- and inter-node GEMM + AllGather latency across square matrix sizes.
@misc{hu2026cucoagenticframeworkcompute,
title={CUCo: An Agentic Framework for Compute and Communication Co-design},
author={Bodun Hu and Yoga Sri Varshan V and Saurabh Agarwal and Aditya Akella},
year={2026},
eprint={2603.02376},
archivePrefix={arXiv},
primaryClass={cs.DC},
url={https://arxiv.org/abs/2603.02376},
}