AI Infra 每日动态 - 2026-04-22 Wednesday

📄 重点论文 7/7

ARGUS: Agentic GPU Optimization Guided by Data-Flow Invariants

Agent arXiv cs.DC · 04-22 12:00 CST · arxiv.org kw 13.0

ARGUS 用 data-flow invariants 作为编译期规范，让 coding agent 协同优化 tiling/shared-mem/流水线，弥补 GPU kernel 生成的稀疏 pass/fail 反馈；给 MoE/attention 等关键算子提供结构化诊断信号。

LLM inferenceGPU kernelattentionkernelcompilerGPU

UniEP: Unified Expert-Parallel MoE MegaKernel for LLM Training

训练 arXiv cs.DC · 04-22 12:00 CST · arxiv.org kw 11.0

UniEP 把 MoE expert parallel 拆散的通信压缩、计算通信 overlap 统一到一个 megakernel 里，兼顾数值稳定性，目标是让 Megatron-LM 能产品级接入，而不是堆 ad-hoc kernel。

expert parallelMegatron-LMveRLkernelGPU

Ocean: Fast Estimation-Based Sparse General Matrix-Matrix Multiplication on GPU

推理 arXiv cs.DC · 04-22 12:00 CST · arxiv.org kw 4.0

Ocean 质疑 SpGEMM 两趟 workflow 里 symbolic pass（占 28% 时间）的必要性，用估算取代精确符号阶段，在 H100 上加速稀疏矩阵乘；对稀疏 MoE/GNN kernel 有直接参考价值。

kernelGPUH100TPU

POLAR-PIC: A Holistic Framework for Matrixized PIC with Co-Designed Compute, Layout, and Communication

推理 arXiv cs.DC · 04-22 12:00 CST · arxiv.org kw 4.0

POLAR-PIC 针对 Matrix Processing Unit 重构 PIC 场插值为外积形式，物理有序粒子布局减少不规则访存；是 MPU 类硬件（PIM 近亲）做 co-design 的范式参考。

veRLGPU

CASS: Nvidia to AMD Transpilation with Data, Models, and Benchmark

推理 arXiv cs.AR · 04-22 12:00 CST · arxiv.org kw 4.0

CASS 放出 6 万对 CUDA↔HIP、SASS↔RDNA3 验证过的 host-device 代码对，训出跨架构转译模型，CUDA→HIP 88.2% 正确率，显著优于 GPT-5.1 / Claude-4.5 / Hipify。

servingcompilerGPUCUDA

YAIFS: Yet (not) Another Intelligent Fog Simulator: A Framework for Agent-Driven Computing Continuum Modeling & Simulation

Agent arXiv cs.DC · 04-22 12:00 CST · arxiv.org kw 2.0

YAIFS 把 MCP（Model Context Protocol）作为 agent 与仿真环境的标准交互层，让异构 agent 能统一 observe/control 分布式仿真；对 MCP 在 runtime 侧的集成模式有参考意义。

MCPModel Context Protocol

ChipLight: Cross-Layer Optimization of Chiplet Design with Optical Interconnects for LLM Training

训练 arXiv cs.AR · 04-22 12:00 CST · arxiv.org kw 0.0

ChipLight 协同设计 chiplet + 光互连集群：package 内高带宽 scale-up、package 间光链路 scale-out，联合优化架构/并行/拓扑，面向大规模 LLM 训练通信瓶颈。

🚀 代码更新 8/8

trunk/0274ad69c3effaef66b5776db5f752b6cf7d8154: [Inductor] Forward optimize_mem to combo kernel inductor_meta (#180790)

推理 PyTorch · 04-22 09:41 CST · github.com kw 7.0

PyTorch Inductor 修复 combo kernel 的 optimize_mem 未透传 bug：之前被 cached_autotune 默认设为 True，现在按 is_inference/is_backward 正确传入，和独立 kernel 的行为对齐。

MLAtritonkernel

trunk/89ed986a77847a4cec520920f6d27baa72102995

推理 PyTorch · 04-22 19:00 CST · github.com kw 4.0

PyTorch Inductor combo kernel 的 jit_line 改用 triton_meta_common() 统一生成 disabled 元信息，减少 combo/standalone kernel 之间 Triton meta 生成路径的分叉。

tritonkernel

trunk/9c406e3429630d5c45ba57d2fd987177e12bb676: Auto-generate fake kernels for Tag.out custom operators (#180987)

推理 PyTorch · 04-22 19:47 CST · github.com kw 1.0

PyTorch 为 Tag.out 自定义算子自动生成 fake kernel：按声明顺序返回 out= 参数，省掉用户手写 meta kernel；对做自定义算子接入 Inductor/torch.compile 的同学省事。

kernel

langgraph==1.1.9

Agent LangGraph · 04-21 21:43 CST · github.com kw 1.0

LangGraph 1.1.9 修了一个 bug：plain resume 场景下 ReplayState 不该传播到 subgraph，否则会串状态。属于 agent runtime 里 checkpoint/resume 语义边界的清理。

LangGraph

trunk/50294ed45005ceb3b8669c68b408236e3f06d6b1: [MPS] Flatten 5D tensors to 4D in batch_norm for performance (#180335)

推理 PyTorch · 04-22 11:08 CST · github.com kw 0.0

PyTorch MPS 后端把 BatchNorm3d 的 5D 张量 reshape 成 4D 再调 MPSGraph normalization，M4 Pro 上 fwd+bwd 从 8.7ms 降到 3.5ms（2.4x），吃的就是 MPSGraph 对高维张量不友好这个坑。

v0.14.4

Agent OpenAI Agents · 04-22 05:37 CST · github.com kw 0.0

OpenAI Agents SDK 0.14.4 加 BoxMount 支持，重构 sandbox 临时 mount 生命周期 / tar exclude 参数 / session helper；对 computer-use agent 的沙箱 runtime 细节有直接影响。

LLVM 22.1.4

推理 LLVM/MLIR · 04-21 21:55 CST · github.com kw 0.0

LLVM 22.1.4 发布：下游 Triton/MLIR/CUDA-Clang 依赖的编译器底座补丁更新，做 GPU kernel DSL 的同学升级前按惯例看下 release note。

Nightly Release v0.6.8-20260421

推理 FlashInfer · 04-21 14:54 CST · github.com kw 0.0

FlashInfer v0.6.8 nightly：vLLM/SGLang 主用的注意力 kernel 库持续迭代，关注最新 paged-KV / MLA / FP8 路径的性能回归基线。

📝 技术博客 1/1

ReasoningBank: Enabling agents to learn from experience

Agent Google Research · 04-22 00:42 CST · research.google kw 0.0

Google 提出 ReasoningBank，让 agent 把成功/失败轨迹沉淀成可检索的推理记忆，下次遇到同类任务直接召回策略；属于 agent memory 的工程化方案。

💬 社区热议 3/3

INT3 compression+fused metal kernels [R]

推理 r/MachineLearning · 04-22 14:54 CST · www.reddit.com kw 9.0

独立作者放出 INT3 权重压缩 + INT2 KV cache + 自写 Metal fused kernel，Mac M 系列端侧跑 Qwen 7B；做 Apple Silicon 端侧推理可关注其 Triton GPU 版本跟进。

KV cachetritonkernelGPUQwen

We open-sourced Chaperone-Thinking-LQ-1.0 — a 4-bit GPTQ + QLoRA fine-tuned DeepSeek-R1-32B that hits 84% on MedQA in ~20GB[N]

训练 r/MachineLearning · 04-22 04:07 CST · www.reddit.com kw 7.0

Chaperone-Thinking-LQ-1.0 在 DeepSeek-R1-Distill-Qwen-32B 上跑 4bit GPTQ + QAT 校准 + QLoRA 医学微调，把 60GB 压到 20GB，MedQA 84%；是一条完整量化训练 pipeline 示例。

GPTQquantizationGPUQwenDeepSeek

Qwen3.6-35B becomes competitive with cloud models when paired with the right agent

Agent r/LocalLLaMA · 04-22 19:22 CST · www.reddit.com kw 2.0

实证：同一个 9B Qwen 模型，只换 agent scaffold（harness）benchmark 就从 19% 拉到 45%；35B 换对 harness 冲进 Polyglot 前十。本地 coding agent 差距可能是 harness mismatch，不是模型本身。

QwenLLaMA