[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-AmberLJC--LLMSys-PaperList":3,"tool-AmberLJC--LLMSys-PaperList":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",144730,2,"2026-04-07T23:26:32",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":78,"stars":81,"forks":82,"last_commit_at":83,"license":78,"difficulty_score":84,"env_os":85,"env_gpu":86,"env_ram":86,"env_deps":87,"category_tags":90,"github_topics":78,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":92,"updated_at":93,"faqs":94,"releases":126},5502,"AmberLJC\u002FLLMSys-PaperList","LLMSys-PaperList","Large Language Model (LLM) Systems Paper List","LLMSys-PaperList 是一个专注于大语言模型（LLM）系统领域的学术资源聚合库。它精心整理并持续更新了大量关于 LLM 训练、推理服务、多模态系统及工业界技术报告的高质量论文、教程、幻灯片和项目链接。\n\n在大模型技术飞速迭代的背景下，研究人员和开发者往往难以从海量文献中快速定位到与“系统架构”相关的核心成果。LLMSys-PaperList 有效解决了这一信息过载与检索困难的问题，通过清晰的分类体系（如预训练优化、容错机制、边缘端部署、智能体系统等），帮助用户高效追踪该前沿领域的最新进展。\n\n这份清单特别适合 AI 系统研究人员、底层框架开发者以及对大模型工程化落地感兴趣的技术人员使用。其独特亮点在于不仅涵盖了经典的训练并行策略（如 Megatron-LM），还深入收录了针对异构集群调度、混合专家模型（MoE）训练效率以及碳足迹优化等前沿议题的顶会论文（如 SOSP、NSDI、EuroSys 等）。无论是希望复现先进算法的工程师，还是寻求选题灵感的学者，都能从中获得极具价值的参考指引，是探索 LLM 系统全栈技术不可或缺的导航图。","# Awesome LLM Systems Papers\n\nA curated list of Large Language Model systems related academic papers, articles, tutorials, slides and projects. Star this repository, and then you can keep abreast of the latest developments of this booming research field.\n## Table of Contents\n\n- [LLM Systems](#llm-systems)\n  - [Training](#training)\n    - [Pre-training](#pre-training)\n    - [Post Training](#systems-for-post-training--rlhf)\n    - [Fault Tolerance \u002F Straggler Mitigation](#fault-tolerance--straggler-mitigation)\n  - [Serving](#serving)\n    - [LLM serving](#llm-serving)\n    - [Agent Systems](#agent-systems)\n    - [Serving at the edge](#serving-at-the-edge)\n    - [System Efficiency Optimization - Model Co-design](#system-efficiency-optimization---model-co-design)\n  - [Multi-Modal Training Systems](#multi-modal-training-systems)\n  - [Multi-Modal Serving Systems](#multi-modal-serving-systems)\n- [LLM for Systems](#llm-for-systems)\n- [Industrial LLM Technical Report](#industrial-llm-technical-report)\n- [ML Conferences](#ml-conferences)\n  - [NeurIPS 2025](#neurips-2025)\n- [LLM Frameworks](#llm-frameworks)\n  - [Training](#training-1)\n  - [Post-Training](#post-training)\n  - [Serving](#serving-1)\n- [ML Systems](#ml-systems)\n- [Survey Paper](#survey-paper)\n- [LLM Benchmark \u002F Leaderboard \u002F Traces](#llm-benchmark--leaderboard--traces)\n- [Related ML Readings](#related-ml-readings)\n- [MLSys Courses](#mlsys-courses)\n- [Other Reading](#other-reading)\n\n\n## LLM Systems\n### Training\n#### Pre-training\n- [Megatron-LM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf): Training Multi-Billion Parameter Language Models Using Model Parallelism\n- [Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.04473.pdf)\n- [Reducing Activation Recomputation in Large Transformer Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05198.pdf)\n- [Optimized Network Architectures for Large Language Model Training with Billions of Parameters](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12169.pdf) | MIT\n- [Carbon Emissions and Large Neural Network Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.10350.pdf?fbclid=IwAR2o0_3HCtTnMxKbXka0OPrHzl8sCzQSSOYp0AOav76-zVWl_pYek2jX8Pk) | Google, UCB\n- [Perseus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06902v1): Removing Energy Bloat from Large Model Training | SOSP' 24\n- [MegaScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15627): Scaling Large Language Model Training to More Than 10,000 GPUs | ByteDance\n- [DISTMM](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fnsdi24\u002Fpresentation\u002Fhuang): Accelerating distributed multimodal model training | NSDI' 24\n- [A Codesign of Scheduling and Parallelization for Large Model Training in Heterogeneous Clusters](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.16125)\n- [Pipeline Parallelism with Controllable Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15362) | Sea AI Lab\n- [Boosting Large-scale Parallel Training Efficiency with C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04594): A Communication-Driven Approach\n- [Scaling Beyond the GPU Memory Limit for Large Mixture-of-Experts Model Training](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uLpyWQPyF9) | ICML' 24\n- [Alibaba HPN:](https:\u002F\u002Fennanzhai.github.io\u002Fpub\u002Fsigcomm24-hpn.pdf) A Data Center Network for Large Language ModelTraining\n- [The Llama 3 Herd of Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21783) (Section 3)\n- Enabling Parallelism Hot Switching for Efficient Training of Large Language Models | SOSP' 24\n- [Revisiting Reliability in Large-Scale Machine Learning Research Clusters](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21680)\n- [ScheMoE](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650083): An Extensible Mixture-of-Experts Distributed Training System with Tasks Scheduling | EuroSys '24\n- [DynaPipe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10418) : Optimizing Multi-task Training through Dynamic Pipelines | EuroSys '24\n- [HAP](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650074): SPMD DNN Training on Heterogeneous GPU Clusters with Automated Program Synthesis | EuroSys'24\n- [Demystifying Workload Imbalances in Large Transformer Model Training over Variable-length Sequences](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07894) | PKU\n- [Improving training time and GPU utilization in geo-distributed language model training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14458)\n- [DeepSeek-V3 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437)\n- [Comet](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.19811): Fine-grained Computation-communication Overlapping for Mixture-of-Experts | ByteDance\n- [ByteScale](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21231) : Efficient Scaling of LLM Training with a 2048K Context Length on More Than 12,000 GPUs | ByteDance\n- [Megalodon](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801): Efficient LLM Pretraining and Inference with Unlimited Context Length\n- [SPPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10377):Efficient Long-sequence LLM Training via Adaptive Sequence Pipeline Parallel Offloading\n- [TileLink: Generating Efficient Compute-Communication Overlapping Kernels using Tile-Centric Primitives](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20313) | MLSys' 25\n- [Every FLOP Counts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05139): Scaling a 300B Mixture-of-Experts LING LLM without Premium GPUs| Ant Group\n- [FlexSP](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3676641.3715998): Accelerating Large Language Model Training via Flexible Sequence Parallelism | ASPLOS '25\n- [WeiPipe](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3710848.3710869): Weight Pipeline Parallelism for Communication-Effective Long-Context Large Model Training | PPoPP ’25\n- [WLB-LLM: Workload-Balanced 4D Parallelism for Large Language Model TraininG](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.17924) | OSDI' 25\n- [Mixtera](https:\u002F\u002Fmboether.com\u002Fassets\u002Fpdf\u002Fbother2025mixtera.pdf): A Data Plane for Foundation Model Training | ETH\n- [Flex Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05496): A Programming Model for Generating Optimized Attention Kernels | MLSys' 25\n- [Balancing Pipeline Parallelism with Vocabulary Parallelism](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05288) | MLSys' 25\n- [SlimPipe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14519): Memory-Thrifty and Efficient Pipeline Parallelism for Long-Context LLM Training | Kuaishou\n- [Scaling Llama 3 Training with Efficient Parallelism Strategies](https:\u002F\u002Faisystemcodesign.github.io\u002Fpapers\u002FLlama3-ISCA25.pdf) | ISCA' 25\n- [Lumos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09307) : Efficient Performance Modeling and Estimation for Large-scale LLM Training| MLSys' 25\n- [BurstEngine](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.19836): an Efficient Distributed Framework for Training Transformers on Extremely Long Sequences of over 1M Tokens\n- [Zeppelin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21841): Balancing Variable-length Workloads in Data Parallel Large Model Training\n- [Robust LLM Training Infrastructure at ByteDance](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Sailor: Automating Distributed Training over Dynamic, Heterogeneous, and Geo-distributed Clusters](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Tempo: Compiled Dynamic Deep Learning with Symbolic Dependence Graphs](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Mycroft: Tracing Dependencies in Collective Communication Towards Reliable LLM Training](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [DCP: Addressing Input Dynamism In Long-Context Training via Dynamic Context Parallelism](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [TrainVerify: Equivalence-Based Verification for Distributed LLM Training](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Collective Communication for 100k+ GPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20171): Large-scale collective communication optimization for massive GPU clusters\n- [RDMA Point-to-Point Communication for LLM Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.27656): RDMA-based point-to-point communication optimization for distributed LLM systems\n- [MoEBlaze](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05296): Breaking the Memory Wall for Efficient MoE Training on Modern GPUs\n- [Kareus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.17654): Joint Reduction of Dynamic and Static Energy in Large Model Training\n- [AXLearn](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05411): Modular Large Model Training on Heterogeneous Infrastructure | MLSys' 26\n- [MoSE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06154): Mixture of Slimmable Experts for Efficient and Adaptive Language Models\n\n\n#### Systems for Post-training \u002F RLHF \n- [Ymir:](https:\u002F\u002Ftianweiz07.github.io\u002FPapers\u002F24-ics-2.pdf) A Scheduler for Foundation Model Fine-tuning Workloads in Datacenters | ICS' 24\n- [RLHFuse](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13221): Efficient RLHF Training for Large Language Models with Inter- and Intra-Stage Fusion | NSDI'25\n- [HybridFlow](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.19256): A Flexible and Efficient RLHF Framework\n- [ReaLHF](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2406.14088v1): Optimized RLHF Training for Large Language Models through Parameter Reallocation\n- [NeMo-Aligner](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.01481): Scalable Toolkit for Efficient Model Alignment | Nvidia\n- [An Adaptive Placement and Parallelism Framework for Accelerating RLHF Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11819) | Ant\n- [Systems Opportunities for LLM Fine-Tuning using Reinforcement Learning](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3721146.3721944)\n- [AReaL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24298): A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning | [Code](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) | Ant\n- [StreamRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15930): Scalable, Heterogeneous, and Elastic RL for LLMs with Disaggregated Stream Generation\n- [RL-Factory](https:\u002F\u002Fgithub.com\u002FSimple-Efficient\u002FRL-Factory): Train your Agent model via our easy and efficient framework\n- [PLoRA](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.02932): Efficient LoRA Hyperparameter Tuning for Large Models\n- [History Rhymes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18588): Accelerating LLM Reinforcement Learning with RhymeRL\n- [APRIL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.18521): Active Partial Rollouts in Reinforcement Learning to tame long-tail generation\n- [Laminar](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12633): A Scalable Asynchronous RL Post-Training Framework\n- [Seer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14617): Online Context Learning for Fast Synchronous LLM Reinforcement Learning\n- [SkyRL-Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16108): Efficient RL Training for Multi-turn LLM Agent\n\n#### Fault Tolerance \u002F Straggler Mitigation\n- [Oobleck:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.08125) Resilient Distributed Training of Large Models Using Pipeline Templates | SOSP' 23\n- [FALCON](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12588): Pinpointing and Mitigating Stragglers for Large-Scale Hybrid-Parallel Training\n- [Malleus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13333): Straggler-Resilient Hybrid Parallel Training of Large-scale Models via Malleable Data and Model Parallelization\n- [Fire-Flyer AI-HPC: A Cost-Effective Software-Hardware Co-Design for Deep Learning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14158) | DeepSeek SC' 24\n- [Lazarus: Resilient and Elastic Training of Mixture-of-Experts Models with Adaptive Expert Placement](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.04656)\n- [GEMINI:](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3600006.3613145) Fast Failure Recovery in Distributed Training with In-Memory Checkpoints\n- [ByteCheckpoint:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20143) A Unified Checkpointing System for LLM Development\n- [ReCycle](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14009): Resilient Training of Large DNNs using Pipeline Adaptation | SOSP' 24\n- [Minder](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.01791): Faulty Machine Detection for Large-scale Distributed Model Training | THU\n- [The Streaming Batch Model for Efficient and Fault-Tolerant Heterogeneous Execution](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12407)  \n- [TrainMover](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12636): Efficient ML Training Live Migration with No Memory Overhead | Alibaba\n- [Characterizing GPU Resilience and Impact on AI\u002FHPC Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11901) | UIUC\n- [Understanding Stragglers in Large Model Training Using What-if Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05713) | OSDI' 25\n- [GoCkpt](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07035): Gradient-Assisted Multi-Step Overlapped Checkpointing for Efficient LLM Training | PPoPP' 26\n- [BitSnap](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12376): Checkpoint Sparsification and Quantization in LLM Training\n\n\n### Serving\n#### LLM serving\n- [Orca](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi22\u002Fpresentation\u002Fyu): A Distributed Serving System for Transformer-Based Generative Models | OSDI'22\n- [Response Length Perception and Sequence Scheduling: An LLM-Empowered LLM Inference Pipeline](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13144) | NUS\n- [Efficiently Scaling Transformer Inference](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05102.pdf) | MLSys' 23\n- [Flover](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13484.pdf): A Temporal Fusion Framework for Efficient Autoregressive Model Parallel Inference \n- [FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14135.pdf)\n- [FlashAttention-3:](https:\u002F\u002Ftridao.me\u002Fblog\u002F2024\u002Fflash3\u002F) Fast and Accurate Attention with Asynchrony and Low-precision\n- [SageAttention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02367): Accurate 8-Bit Attention for Plug-and-play Inference Acceleration | ICLR 2025\n- [SageAttention2](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.10958): Efficient Attention with Thorough Outlier Smoothing and Per-thread INT4 Quantization | ICML 2025\n- [SageAttention3](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11594): SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training | NeurIPS 2025 spotlight\n- [SageAttention2++](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21136): SageAttention2++: A More Efficient Implementation of SageAttention2 | ICML ES-FoMo Workshop 2025\n- [DeepSpeed Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.00032) : Enabling Efficient Inference of Transformer Models at Unprecedented Scale.  \n- [TurboTransformers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.05680.pdf): An Efficient GPU Serving System For Transformer Models\n- [FlexGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.06865): High-throughput Generative Inference of Large Language Models with a Single GPU | ICML' 23\n- [MPCFormer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01452.pdf) : fast, performant, and private transformer inference with MPC | ICLR'23\n- [POLCA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12908): Power Oversubscription in LLM Cloud Providers | Microsoft\n- [SARATHI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16369): Efficient LLM Inference by Piggybacking Decodes with Chunked Prefills | Microsoft \n- [AttMemo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09262.pdf): Accelerating Self-Attention with Memoization on Big Memory Systems\n- [vLLM](https:\u002F\u002Fvllm.ai\u002F): Easy, Fast, and Cheap LLM Serving with PagedAttention | SOSP' 23\n- [Tabi](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3552326.3587438): An Efficient Multi-Level Inference System for Large Language Models | EuroSys' 23 \n- [Flash-LLM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.10285v1.pdf): Enabling Cost-Effective and Highly-Efficient Large Generative Model Inference with Unstructured Sparsity | VLDB' 24\n- [AutoGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155): Enabling Next-Gen LLM Applications via Multi-Agent Conversation | Microsoft\n- [FlashDecoding++](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01282.pdf): Faster Large Language Model Inference on GPUs | Tsinghua\n- [DeepSpeed-MII](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed-MII): Model Implementations for Inference (MII) ｜ Microsoft\n- [Punica](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18547): Multi-Tenant LoRA Serving | MLSys' 24\n- [S-LoRA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03285): Serving Thousands of Concurrent LoRA Adapters | MLSys' 24\n- [SpotServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15566): Serving Generative Large Language Models on Preemptible Instances | CMU\n- [SuperServe:](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16733.pdf) Fine-Grained Inference Serving for Unpredictable Workloads\n- [Fairness in Serving Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00588) | OSDI' 24\n- [Infinite-LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.02669): Efficient LLM Service for Long Context with DistAttention and Distributed KVCache\n- [CaraServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11240): CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference\n- [DistServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09670): Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving| OSDI' 24\n- [Inference without Interference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11181): Disaggregate LLM Inference for Mixed Downstream Workloads\n- [APIServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01869.pdf): Efficient API Support for Large-Language Model Inferencing\n- [FlexLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18789): A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning\n- [DéjàVu](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01876): KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving\n- [Optimizing LLM Queries in Relational Workloads](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05821) | UCB\n- [AttentionStore:](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.19708.pdf) Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | NUS\n- [MuxServe:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02015) Flexible Multiplexing for Efficient Multiple LLM Serving\n- [LoongServe:](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09526.pdf) Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | SOSP' 24\n- [RAGCache:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12457) Efficient Knowledge Caching for Retrieval-Augmented Generation | PKU\n- [Andes:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16283) Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | Umich\n- [BlockLLM:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18322) Multi-tenant Finer-grained Serving for Large Language Models\n- [vAttention:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04437) Dynamic Memory Management for Serving LLMs without PagedAttention\n- [Helix](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01566): Distributed Serving of Large Language Models via Max-Flow on Heterogeneous GPUs | CMU\n- [Eloquent](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.12961v2): A More Robust Transmission Scheme for LLM Token Streaming | NAIC' 24\n- [Optimizing Speculative Decoding for Serving Large Language Models Using Goodput](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14066v1) | UCB\n- [Enabling Elastic Model Serving with MultiWorld](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2407.08980v1) | Cisco Research\n- [Prepacking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09529): A Simple Method for Fast Prefilling and Increased Throughput in Large Language Models\n- [NanoFlow](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12757): Towards Optimal Large Language Model Serving Throughput\n- [Responsive ML inference in multi-tenanted environments using AQUA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21255)\n- [One Queue Is All You Need](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00047): Resolving Head-of-Line Blocking in Large Language Model Serving\n- [MemServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17565): Context Caching for Disaggregated LLM Serving with Elastic Memory Pool\n- [dLoRA: Dynamically Orchestrating Requests and Adapters for LoRA LLM Serving](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fwu-bingyang) | OSDI' 24\n- [Llumnix](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fsun-biao): Dynamic Scheduling for Large Language Model Serving | OSDI' 24\n- [Taming Throughput-Latency Tradeoff in LLM Inference with Sarathi-Serve](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fagrawal) | OSDI' 24\n- [InfiniGen](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Flee): Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management\n- [ServerlessLLM: Low-Latency Serverless Inference for Large Language Models](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Ffu) | OSDI' 24\n- [CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3651890.3672274) | SIGCOMM' 24\n- [Preble](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00023): Efficient Distributed Prompt Scheduling for LLM Serving\n- [Mnemosyne](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17264): Parallelization Strategies for Efficiently Serving Multi-Million Context Length LLM Inference Requests Without Approximations\n- [ConServe](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2410.01228v1): Harvesting GPUs for Low-Latency and High-Throughput Large Language Model Serving\n- [BlockLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18322): Multi-tenant Finer-grained Serving for Large Language Models\n- [Context Parallelism for Scalable Million-Token Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01783)\n- [Pie](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09317): Pooling CPU Memory for LLM Inference\n- [NEO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01142): Saving GPU Memory Crisis with CPU Offloading for Online LLM Inference\n- [FastSwitch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424): Optimizing Context Switching Efficiency in Fairness-aware Large Language Model Serving\n- [Flash Communication](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04964): Reducing Tensor Parallelization Bottleneck for Fast Large Language Model Inference\n- [FlashInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005): Efficient and Customizable Attention Engine for LLM Inference Serving\n- [Fast Inference for Augmented Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248)\n- [A System for Microserving of LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12488) | CMU\n- [iServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13111) : An Intent-based Serving System for LLMs| UT Austin\n- [Locality-aware Fair Scheduling in LLM Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14312) | UCB\n- [Towards Efficient Large Multimodal Model Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00937) | MSFT\n- [DeltaZip: Efficient Serving of Multiple Full-Model-Tuned LLMs](https:\u002F\u002Fanakli.inf.ethz.ch\u002Fpapers\u002Fdeltazip.pdf)\n- [PIM Is All You Need](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07578): A CXL-Enabled GPU-Free System for Large Language Model Inference | ASPLOS' 25\n- [λScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09922): Enabling Fast Scaling for Serverless Large Language Model Inference\n- [AIBrix: Towards Scalable and Cost-Effective LLM Inference Infrastructure](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Faibrix\u002Fblob\u002Fmain\u002Fdocs\u002Fpaper\u002FAIBrix_White_Paper_0219_2025.pdf) | vLLM\n- [Serving Models, Fast and Slow:Optimizing Heterogeneous LLM Inferencing Workloads at Scale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14617) \n- [Make LLM Inference Affordable to Everyone: Augmenting GPU Memory with NDP-DIMM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16963)\n- [Jenga](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18292): Effective Memory Management for Serving LLM with Heterogeneity\n- [AQUA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21255) : Network-Accelerated Memory Offloading for LLMs in Scale-Up GPU Domains | ASPLOS 2025\n- [MegaScale-Infer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.02263): Serving Mixture-of-Experts at Scale with Disaggregated Expert Parallelism | Bytedance\n- [Towards End-to-End Optimization of LLM-based Applications with Ayo](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3676641.3716278) | ASPLOS '25\n- [CacheBlend](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3689031.3696098) : Fast Large Language Model Serving for RAG with Cached Knowledge Fusion | EuroSys' 25 (Best Paper)\n- [ThunderServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09334): High-performance and Cost-efficient LLM Serving in Cloud Environments | MLSys' 25\n- [SLOs-Serve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08784): Optimized Serving of Multi-SLO LLMs\n- [Tempo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20068): Application-aware LLM Serving with Mixed SLO Requirements\n- [Hogwild! Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.06261): Parallel LLM Generation via Concurrent Attention\n- [Prism](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04021): Unleashing GPU Sharing for Cost-Efficient Multi-LLM Serving | UCLA\n- [RetroInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02922): A Vector-Storage Approach for Scalable Long-Context LLM Inference\n- [Efficient Serving of LLM Applications with Probabilistic Demand Modeling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14851)\n- [eLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14851) : Elastic Memory Management Framework for Efficient LLM Serving\n- [DiSCo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11417v2): Device-Server Collaborative LLM-Based Text Streaming Services\n- [DynaServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09285): Unified and Elastic Execution for Dynamic Disaggregated LLM Serving\n- [HyGen](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14808): Efficient LLM Serving via Elastic Online-Offline Request Co-location\n- [WaferLLM: A Wafer‑Scale LLM Inference System](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04563) | OSDI 25\n- [BlitzScale: Fast and Live Large Model Autoscaling with O(1) Host Caching](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17246) | OSDI 25\n- [TokenWeave: Efficient Compute-Communication Overlap for Distributed LLM Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11329) | [Code](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftokenweave) | ArXiv'25\n- [Nexus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06608v2): Taming Throughput-Latency Tradeoff in LLM Serving via Efficient GPU Sharing\n- [Taming the Chaos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19559): Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference | Seed\n- [TokenLake](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17219): A Unified Segment-level Prefix Cache Pool for Fine-grained Elastic Long-Context LLM Serving\n- [Expert-as-a-Service](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17863): Towards Efficient, Scalable, and Robust Large-scale MoE Serving\n- [Shift Parallelism](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.16495): Low-Latency, High-Throughput LLM Inference for Dynamic Workloads\n- [Defeating Nondeterminism in LLM Inference](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fdefeating-nondeterminism-in-llm-inference\u002F)\n- [Deterministic Inference across Tensor Parallel Sizes That Eliminates Training-Inference Mismatch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.17826): Ensuring deterministic inference across different tensor parallelism configurations\n- [The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04301)\n- [Barbarians at the Gate: How AI is Upending Systems Research](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06189)\n- [Mercury: Unlocking Multi-GPU Operator Optimization for LLMs via Remote Memory Scheduling](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [DiffKV: Differentiated Memory Management for Large Language Models with Parallel KV Compaction](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Pie: A Programmable Serving System for Emerging LLM Applications](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Jenga: Effective Memory Management for Serving LLM with Heterogeneity](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [IC-Cache: Efficient Large Language Model Serving via In-context Caching](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [PrefillOnly: An Inference Engine for Prefill-only Workloads in Large Language Model Applications](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [KTransformers: Unleashing the Full Potential of CPU\u002FGPU Hybrid Inference for MoE Models](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [The ML.ENERGY Benchmark](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06371): Toward Automated Inference Energy Measurement and Optimization | NeurIPS' 25\n- [Serve Programs, Not Prompts](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.25412): Efficient LLM serving system for structured program execution\n- [Continuum](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.02230): Efficient and Robust Multi-Turn LLM Agent Scheduling with KV Cache Time-to-Live\n- [AIConfigurator](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.06288): Lightning-Fast Configuration Optimization for Multi-Framework LLM Serving\n- [SuperInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20309): SLO-Aware Rotary Scheduling and Memory Management for LLM Inference on Superchips | MLSys' 26\n- [Scaling Up Efficient Small Language Models Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22101): Serving and Deployment for Semantic Job Search | MLSys' 26\n- [BestServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05871): Serving Strategies with Optimal Goodput in Collocation and Disaggregation Architectures\n- [OptiKIT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20408): Meeting SLOs, Slashing Hours - Automated Enterprise LLM Optimization | MLSys' 26\n- [BlendServe](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790133): Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching | ASPLOS' 26\n- [SwiftSpec](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790246): Ultra-Low Latency LLM Decoding by Scaling Asynchronous Speculative Decoding with Disaggregated Pipeline and Fused Kernels | ASPLOS' 26\n- [MuxWise](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790236): Towards High-Goodput LLM Serving with Prefill-decode Multiplexing | ASPLOS' 26\n- [MoEless](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06350): Efficient MoE LLM Serving via Serverless Computing\n- [Online Scheduling for LLM Inference with KV Cache Constraints](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07115): Optimal Batching and Scheduling for KV Cache-Constrained Inference\n- [BiScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18755): Energy-Efficient Disaggregated LLM Serving via Phase-Aware Placement and DVFS\n- [Harvest](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00328): Opportunistic Peer-to-Peer GPU Caching for LLM Inference\n- [TokenFlow](files\u002Freport.pdf): Responsive LLM Text Streaming Serving under Request Burst via Preemptive Scheduling | Plagiarism\n\n#### Agent Systems\n- [Supporting Our AI Overlords](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.00997): Redesigning Data Systems to be Agent-First | UCB\n- [ALTO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04311): An Efficient Network Orchestrator for Compound AI Systems | Stanford & UCB\n- [Parrot](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Flin-chaofan): Efficient Serving of LLM-based Applications with Semantic Variable | OSDI' 24\n- [Efficiently Serving LLM Reasoning Programs with Certaindex](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.20993) | UCSD\n- [Autellix](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13965): An Efficient Serving Engine for LLM Agents as General Programs | UCB\n- [RAGO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14649v2): Systematic Performance Optimization for Retrieval-Augmented Generation Serving | ISCA'25\n- [Circinus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16397): Efficient Query Planner for Compound ML Serving | UIUC\n- [Patchwork: A Unified Framework for RAG Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07833)\n- [DS SERVE](https:\u002F\u002Fberkeley-large-rag.github.io\u002FRAG-DS-Serve\u002F): A Framework for Efficient and Scalable Neural Retrieval | UCB\n- [KVFlow](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07400): Efficient Prefix Caching for Accelerating LLM-Based Multi-Agent Workflows\n- [DroidSpeak](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02820): KV Cache Sharing for Cross-LLM Communication and Multi-LLM Serving\n- [Murakkab](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18298): Resource-Efficient Agentic Workflow Orchestration in Cloud Platforms\n- [HedraRAG: Co-Optimizing Generation and Retrieval for Heterogeneous RAG Workflows](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [METIS: Fast Quality-Aware RAG Systems with Configuration Adaptation](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Aragog: Just-in-Time Model Routing for Scalable Serving of Agentic Workflows](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.20975)\n- [DualPath](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21548): Breaking the Storage Bandwidth Bottleneck in Agentic LLM Inference | DeepSeek\n\n#### Serving at the edge\n- [LLM in a flash: Efficient Large Language Model Inference with Limited Memory](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11514) | Apple\n- [STI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05022): Turbocharge NLP Inference at the Edge via Elastic Pipelining | ASPLOS 23 \n- [PowerInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.12456): Fast Large Language Model Serving with a Consumer-grade GPU | SOSP' 24\n- [MoE-Lightning: High-Throughput MoE Inference on Memory-constrained GPUs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11217)\n- [InfiniteHiP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08910): Extending Language Model Context Up to 3 Million Tokens on a Single GPU\n- [prima.cpp](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.08791): PRIMA.CPP: Speeding Up 70B-Scale LLM Inference on Low-Resource Everyday Home Clusters\n- [Characterizing Mobile SoC for Accelerating Heterogeneous LLM Inference](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n\n\n#### System Efficiency Optimization - Model Co-design\n- [Sparse-Linear Attention](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2509.24006): SLA: Beyond Sparsity in Diffusion Transformers via Fine-Tunable Sparse–Linear Attention | Tsinghua\n- [Fast Distributed Inference Serving for Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05920) | PKU\n- [FrugalGPT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.05176.pdf): How to Use Large Language Models While Reducing Cost and Improving Performance |  Stanford\n- [H2O](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048): Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models | ICML ES-FoMo Workshop 2023\n- [Inference with Reference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04487): Lossless Acceleration of Large Language Models\n- [SkipDecode](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.02628): Autoregressive Skip Decoding with Batching and Caching for Efficient LLM Inferencex\n- [Scissorhands](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17118): Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time\n- [Knowledge-preserving Pruning for Pre-trained Language Models without Retraining](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03449.pdf) | SNU\n- [Accelerating LLM Inference with Staged Speculative Decoding](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04623.pdf) | ICML' 23\n- [SpecInfer](https:\u002F\u002Fwww.cs.cmu.edu\u002F~zhihaoj2\u002Fpapers\u002Fspecinfer.pdf): Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | CMU\n- [Deja Vu](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fliu23am.html): Contextual Sparsity for Efficient LLMs at Inference Time | ICML' 23\n- [S3](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06000.pdf): Increasing GPU Utilization during Generative Inference for Higher Throughput | Havard\n- [LLMCad](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04255): Fast and Scalable On-device Large Language Model Inference\n- [Skeleton-of-Thought](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.15337): Large Language Models Can Do Parallel Decoding | THU\n- [LoRAShear](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18356): Efficient Large Language Model Structured Pruning and Knowledge Recovery ｜ Microsoft\n- [Ring Attention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf) with Blockwise Transformers for Near-Infinite Context | UCB\n- [Learned Best-Effort LLM Serving](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07886) | UCB\n- [Star Attention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116) : Efficient LLM Inference over Long Sequences| NVIDIA\n- [FFN Fusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18908): Rethinking Sequential Computation in Large Language Models\n- [SpargeAttention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18137): SpargeAttention: Accurate and Training-free Sparse Attention Accelerating Any Model Inference | ICML' 25\n- [Training Transformers with 4-bit Integers](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11987) | NeurIPS' 23\n- [Jetfire: Efficient and Accurate Transformer Pretraining with INT8 Data Flow and Per-Block Quantization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12422) | ICML' 24\n- [COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313) | ICLR'25\n- [Efficient Mixed-Precision Large Language Model Inference with TurboMind](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.15601v1) | Shanghai AI Lab\n- [Reducing GPU Memory Fragmentation via Spatio-Temporal Allocation Planning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16274) | EuroSys' 26\n\n### Multi-Modal Training Systems\n- [DISTMM](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fnsdi24\u002Fpresentation\u002Fhuang): Accelerating distributed multimodal model training | NSDI' 24\n- [Optimus:](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.03505) Accelerating Large-Scale Multi-Modal LLM Training by Bubble Exploitation\n- [Addressing Model and Data Heterogeneity in Multimodal Large Language Model Training](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04275v1) | PKU\n- [Cornstarch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11367): Distributed Multimodal Training Must Be Multimodality-Aware | UMich\n- [PipeWeaver](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14145): Addressing Data Dynamicity in Large Multimodal Model Training with Dynamic Interleaved Pipeline | SJTU\n\n### Multi-Modal Serving Systems\n- [xDiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01738): an Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism\n- [MOSEL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18481.pdf): Inference Serving Using Dynamic Modality Selection\n- [Approximate Caching for Efficiently Serving Diffusion Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04429) | Adobe Research\n- [Generative AI Beyond LLMs](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14385): System Implications of Multi-Modal Generation | Meta\n- [Characterizing and Efficiently Accelerating Multimodal Generation Model Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00215) | Meta\n- [DistriFusion:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19481) Distributed Parallel Inference for High-Resolution Diffusion Models |  MIT\n- [LongVILA: Scaling Long-Context Visual Language Models for Long Videos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10188) | NVIDIA\n- [FlexCache: Flexible Approximate Cache System for Video Diffusion](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04012) | University of Waterloo\n- [DDiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13497v1): Dynamic Resource Allocation for Diffusion Transformer Model Serving\n- [PATCHEDSERVE](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.09253): A Patch Management Framework for SLO-Optimized Hybrid Resolution Diffusion Serving\n- [ElasticMM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.10069): Efficient Multimodal LLMs Serving with Elastic Multimodal Parallelism\n- [TetriServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01565): Efficient DiT Serving for Heterogeneous Image Generation\n- [dInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08666): An Efficient Inference Framework for Diffusion Language Models\n- [Fast-dLLM v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.26328): Efficient Block-Diffusion LLM\n- [Argus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06724): Quality-Aware High-Throughput Text-to-Image Inference Serving System\n- [Cornserve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14098): Efficiently Serving Any-to-Any Multimodal Models\n- [HydraInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12658): Hybrid Disaggregated Scheduling for Multimodal Large Language Model Serving\n- [Enabling Disaggregated Multi-Stage MLLM Inference via GPU-Internal Scheduling and Resource Sharing](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.17574)\n- [VoxServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00269): Streaming-Centric Serving System for Speech Language Models\n- [dLLM-Serve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.17077): Taming the Memory Footprint Crisis for Efficient Diffusion LLM Serving\n- [HADIS](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00642): Hybrid Adaptive Diffusion Model Serving for Efficient Text-to-Image Generation\n\n\n## LLM for Systems\n- [Large Language Models for Compiler Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07062)\n- [The Hitchhiker's Guide to Program Analysis](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00245): A Journey with Large Language Models\n- [LLM-Assisted Code Cleaning For Training Accurate Code Generators](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14904) | UCB\n- [Efficient Multi-Task Large Model Training via Data Heterogeneity-aware Model Management](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03365)\n- [If At First You Don't Succeed, Try, Try, Again...?](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fif-at-first-you-dont-succeed-try-try-again-insights-and-llm-informed-tooling-for-detecting-retry-bugs-in-software-systems\u002F) | SOSP' 24\n- [Aceso](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3627703.3629554): Efficient Parallel DNN Training through Iterative Bottleneck Alleviation | EuroSys '24\n- [GMorph](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650074): Accelerating Multi-DNN Inference via Model Fusion | EuroSys '24\n- [Automatic Root Cause Analysis via Large Language Models for Cloud Incidents](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3629553) | EuroSys '24\n- [KNighter: Transforming Static Analysis with LLM-Synthesized Checkers](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Barbarians at the Gate: How AI is Upending Systems Research](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.06189)\n- [Let the Barbarians In](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14806): How AI Can Accelerate Systems Performance Research\n- [AI Research Engineering Skills Library](https:\u002F\u002Fgithub.com\u002FzechenzhangAGI\u002Fclaude-ai-research-skills): A collection of AI research engineering skills and best practices\n- [K-Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.19128): LLM Kernel Generation via Co-Evolving Intrinsic World Model\n\n## Industrial LLM Technical Report   \n \n- [Qwen2.5 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15115) - (Dec 2024)   \n- [Qwen 3 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388) – (May 2025)\n- [LLaMA: Open and Efficient Foundation Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.13971) - (Feb 2023)\n- [Llama 2: Open Foundation and Fine‑Tuned Chat Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.09288) - (Jul 2023)\n- [The Llama 3 Herd of Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11083) - (Aug 2024)\n- [Gemini: A Family of Highly Capable Multimodal Models](https:\u002F\u002Fassets.bwbx.io\u002Fdocuments\u002Fusers\u002FiqjWHBFdfxIU\u002Fr7G7RrtT6rnM\u002Fv0) - (Dec 2023)\n- [Gemini 1.5: Unlocking multimodal understanding across millions of tokens](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v1_5_report.pdf) - (Feb 2024)\n- [Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next‑Generation Agentic Capabilities](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v2_5_report.pdf) - (Jun 2025)\n- [Phi‑4‑reasoning Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21318) – (Apr 2025)\n- [Phi‑4 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08905) – (Dec 2024)\n- [Kimi‑VL Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07491) – (Apr 2025)\n- [Kimi k1.5: Scaling Reinforcement Learning with LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) – (Jan 2025)\n- [DeepSeek-LLM Technical Report](https:\u002F\u002Fnairl.kr\u002Fwp-content\u002Fuploads\u002F2025\u002F02\u002Fdeepseek_r1_techreport.pdf) - (Jan 2024)\n- [DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434) - (05\u002F2024)\n- [DeepSeek-V3 Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437) - (12\u002F2024)\n- [DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning](https:\u002F\u002Fwww.boozallen.com\u002Fcontent\u002Fdam\u002Fhome\u002Fdocs\u002Fai\u002Fa-technical-primer-on-deepseek.pdf) - (012025)\n- [Kimi-VL: Multimodal LLM with Vision, Language, and Long Context](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07491) – (Apr 2025)\n- [Kimi k1.5: Reinforcement Learning with Multimodal LLMs](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) – (Jan 2025)\n- [Kimi-K2: Open Agentic Intelligence](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) – (Jul 2025)\n- [GPT-oss-120b & GPT-oss-20b](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F419b6906-9da6-406c-a19d-1bb078ac7637\u002Foai_gpt-oss_model_card.pdf) – (Aug 2025)\n\n## ML Conferences\n### NeurIPS 2025\n\nA curated collection of **[NeurIPS 2025 papers](neurips25-mlsys\u002F)** focused on efficient systems for generative AI models. The collection includes papers on:\n- [Architecture & Efficient Mechanisms](neurips25-mlsys\u002Farchitecture.md) - Efficient attention, KV-cache systems, speculative decoding\n- [Model Compression & Quantization](neurips25-mlsys\u002Fcompression.md) - Quantization, pruning, KV cache compression\n- [Inference & Serving](neurips25-mlsys\u002Finference.md) - LLM serving, scheduling, distributed inference\n- [Multi-Modal & Diffusion](neurips25-mlsys\u002Fmulti-modality.md) - VLM efficiency, diffusion optimization\n- [Reinforcement Learning](neurips25-mlsys\u002Frl.md) - RL training infrastructure, policy optimization\n- [Training Systems](neurips25-mlsys\u002Ftraining.md) - Distributed training, memory efficiency\n\nSee the **[full NeurIPS 2025 collection](neurips25-mlsys\u002F)** for detailed categorization and paper summaries.\n\n## LLM Frameworks\n### Training\n- [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed): a deep learning optimization library that makes distributed training and inference easy, efficient, and effective | Microsoft\n- [Accelerate](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex) | Hugging Face\n- [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n- [Megatron](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) | Nvidia\n- [NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) | Nvidia\n- [torchtitan](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan) | PyTorch\n- [veScale](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fvescale) | ByteDance\n- [DeepSeek Open Infra](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002Fopen-infra-index)\n- [VeOmni](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni): Scaling any Modality Model Training  \n- [Cornstarch](https:\u002F\u002Fgithub.com\u002Fcornstarch-org\u002FCornstarch): Distributed Multimodal Training Must Be Multimodality-Aware | UMich\n\n\n- **Post-Training**\n  - [TRL](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl): Transformers Reinforcement Learning\n  - [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF): An Easy-to-use, Scalable and High-performance RLHF Framework based on Ray\n  - [VeRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl): Volcano Engine Reinforcement Learning for LLMs\n  - [rLLM](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm): Reinforcement Learning for Language Agents\n  - [SkyRL](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyRL): A Modular Full-stack RL Library for LLMs\n  - [AReal](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL): Distributed RL System for LLM Reasoning\n  - [ROLL](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL): Reinforcement Learning Optimization for Large-Scale Learning\n  - [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime): a LLM post-training framework aiming for RL Scaling\n  - [RAGEN](https:\u002F\u002Fgithub.com\u002FRAGEN-AI\u002FRAGEN): Training Agents by Reinforcing Reasoning\n  - [Agent Lightning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03680): Train ANY AI Agents with Reinforcement Learning\n \n  \n### Serving\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) | Nvidia\n- [Ray-LLM](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray-llm) | Ray\n- [TGI](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Findex) | Hugging Face\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) | UCB\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) | UCB\n- [KV Transformers](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002Fktransformers)\n- [Dynamo](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo): A Datacenter Scale Distributed Inference Serving Framework | NVIDA\n- [LMCache](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache): Supercharge Your LLM with the Fastest KV Cache Layer\n \n\n## [ML Systems](mlsystems.md)\n\n\n## Survey Paper\n- [Efficient Large Language Models: A Survey](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03863)\n- [Challenges and Applications of Large Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10169.pdf)\n- [Beyond Efficiency: A Systematic Survey of Resource-Efficient Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00625)\n- [Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15234)\n\n\n## LLM Benchmark \u002F Leaderboard ? Traces\n-  [LLM Energy Leaderboard](https:\u002F\u002Fml.energy\u002Fleaderboard) | Umich\n-  [LLM-Perf Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Foptimum\u002Fllm-perf-leaderboard) | HuggingFace\n-  [Aviary Explorer](https:\u002F\u002Faviary.anyscale.com\u002F) | Anyscale\n-  [Open LLM Leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard) | HuggingFace\n-  [HELM](https:\u002F\u002Fcrfm.stanford.edu\u002Fhelm\u002Flatest\u002F) | Stanford\n-  [LMSYS](https:\u002F\u002Fchat.lmsys.org) | UCB\n-  [Towards Efficient and Reliable LLM Serving: A Real-World Workload Study](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17644)\n \n\n\n\n\n## Related ML Readings\n- [Large Transformer Model Inference Optimization](https:\u002F\u002Flilianweng.github.io\u002Fposts\u002F2023-01-10-inference-optimization\u002F)\n- [Transformer Inference Arithmetic](https:\u002F\u002Fkipp.ly\u002Ftransformer-inference-arithmetic\u002F)\n- [The Transformer Family Version 2.0](https:\u002F\u002Flilianweng.github.io\u002Fposts\u002F2023-01-27-the-transformer-family-v2\u002F)\n- [Full Stack Optimization of Transformer Inference](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14017.pdf): a Survey | UCB\n- [The Smol Training Playbook](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceTB\u002Fsmol-training-playbook): The Secrets to Building World-Class LLMs | Hugging Face\n- [The Ultra-Scale Playbook](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnanotron\u002Fultrascale-playbook): Training LLMs on GPU Clusters | Hugging Face\n\n## MLSys Courses\n- Systems for Machine Learning | (Stanford)[https:\u002F\u002Fcs229s.stanford.edu\u002Ffall2023\u002F]\n- Systems for Generative AI | (Umich)[https:\u002F\u002Fgithub.com\u002Fmosharaf\u002Feecs598\u002Ftree\u002Fw24-genai]\n- Systems for AI - LLMs | (GT)[https:\u002F\u002Fcs8803-sp24.anand-iyer.com\u002F]\n\n\n## Other Reading\n- [A curated list of Large Language Model](https:\u002F\u002Fgithub.com\u002FHannibal046\u002FAwesome-LLM)\n- [AI systems paper list](https:\u002F\u002Fgithub.com\u002Flambda7xx\u002Fawesome-AI-system)\n- [A baseline repository of Auto-Parallelism in Training Neural Networks](https:\u002F\u002Fgithub.com\u002FConnollyLeon\u002Fawesome-Auto-Parallelism)\n- [Numbers every LLM Developer should know](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fllm-numbers)\n- [100,000 H100 Clusters:](https:\u002F\u002Fwww.semianalysis.com\u002Fp\u002F100000-h100-clusters-power-network) Power, Network Topology, Ethernet vs InfiniBand, Reliability, Failures, Checkpointing\n- [OpenAI Keynote on Building Scalable AI Infrastructure](https:\u002F\u002Fwww.servethehome.com\u002Fopenai-keynote-on-building-scalable-ai-infrastructure\u002F)\n- [Awesome ML SYS Tutorial](https:\u002F\u002Fgithub.com\u002Fzhaochenyang20\u002FAwesome-ML-SYS-Tutorial\u002Ftree\u002Fmain)\n","# 令人惊叹的大语言模型系统论文\n\n一份精选的大语言模型系统相关学术论文、文章、教程、幻灯片和项目的列表。请给本仓库标星，以便及时了解这一蓬勃发展的研究领域的最新进展。\n## 目录\n\n- [LLM 系统](#llm-systems)\n  - [训练](#training)\n    - [预训练](#pre-training)\n    - [后训练](#systems-for-post-training--rlhf)\n    - [容错\u002F长尾任务缓解](#fault-tolerance--straggler-mitigation)\n  - [推理服务](#serving)\n    - [LLM 推理服务](#llm-serving)\n    - [智能体系统](#agent-systems)\n    - [边缘端推理服务](#serving-at-the-edge)\n    - [系统效率优化——模型协同设计](#system-efficiency-optimization---model-co-design)\n  - [多模态训练系统](#multi-modal-training-systems)\n  - [多模态推理服务系统](#multi-modal-serving-systems)\n- [用于系统的 LLM](#llm-for-systems)\n- [工业级 LLM 技术报告](#industrial-llm-technical-report)\n- [机器学习会议](#ml-conferences)\n  - [NeurIPS 2025](#neurips-2025)\n- [LLM 框架](#llm-frameworks)\n  - [训练](#training-1)\n  - [后训练](#post-training)\n  - [推理服务](#serving-1)\n- [机器学习系统](#ml-systems)\n- [综述论文](#survey-paper)\n- [LLM 基准测试\u002F排行榜\u002F追踪数据](#llm-benchmark--leaderboard--traces)\n- [相关机器学习阅读材料](#related-ml-readings)\n- [MLSys 课程](#mlsys-courses)\n- [其他阅读材料](#other-reading)\n\n\n## LLM 系统\n\n### 训练\n#### 预训练\n- [Megatron-LM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1909.08053.pdf)：使用模型并行训练数十亿参数的语言模型\n- [使用 Megatron-LM 在 GPU 集群上高效训练大规模语言模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.04473.pdf)\n- [减少大型 Transformer 模型中的激活重计算](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.05198.pdf)\n- [面向数十亿参数大型语言模型训练的优化网络架构](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.12169.pdf) | MIT\n- [碳排放与大型神经网络训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2104.10350.pdf?fbclid=IwAR2o0_3HCtTnMxKbXka0OPrHzl8sCzQSSOYp0AOav76-zVWl_pYek2jX8Pk) | Google、UCB\n- [Perseus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.06902v1)：消除大型模型训练中的能耗膨胀 | SOSP' 24\n- [MegaScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.15627)：将大型语言模型训练扩展到超过 10,000 张 GPU 上 | 字节跳动\n- [DISTMM](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fnsdi24\u002Fpresentation\u002Fhuang)：加速分布式多模态模型训练 | NSDI' 24\n- [异构集群中大型模型训练的调度与并行化协同设计](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.16125)\n- [可控内存的流水线并行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.15362) | Sea AI Lab\n- [通过 C4 提升大规模并行训练效率](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.04594)：一种通信驱动的方法\n- [突破 GPU 显存限制以训练大型专家混合模型](https:\u002F\u002Fopenreview.net\u002Fpdf?id=uLpyWQPyF9) | ICML' 24\n- [阿里巴巴 HPN：](https:\u002F\u002Fennanzhai.github.io\u002Fpub\u002Fsigcomm24-hpn.pdf) 用于大型语言模型训练的数据中心网络\n- [Llama 3 模型群](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21783)（第 3 节）\n- 实现并行性热切换以高效训练大型语言模型 | SOSP' 24\n- [重新审视大规模机器学习研究集群中的可靠性问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21680)\n- [ScheMoE](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650083)：一个可扩展的专家混合分布式训练系统，支持任务调度 | EuroSys '24\n- [DynaPipe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.10418)：通过动态流水线优化多任务训练 | EuroSys '24\n- [HAP](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650074)：在异构 GPU 集群上进行 SPMD DNN 训练，并结合自动化程序合成 | EuroSys'24\n- [揭秘变长序列下大型 Transformer 模型训练中的工作负载不均衡问题](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07894) | 北京大学\n- [提升地理分布式语言模型训练的训练时间和 GPU 利用率](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14458)\n- [DeepSeek-V3 技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437)\n- [Comet](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.19811)：面向专家混合模型的细粒度计算-通信重叠 | 字节跳动\n- [ByteScale](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.21231)：在超过 12,000 张 GPU 上高效扩展 LLM 训练，上下文长度达 2048K | 字节跳动\n- [Megalodon](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.08801)：实现无限上下文长度的高效 LLM 预训练和推理\n- [SPPO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10377)：通过自适应序列流水线并行卸载实现高效的长序列 LLM 训练\n- [TileLink：利用以 Tile 为中心的原语生成高效的计算-通信重叠核函数](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20313) | MLSys' 25\n- [每一个 FLOP 都至关重要](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.05139)：在不使用高端 GPU 的情况下扩展 3000 亿参数的专家混合 LING LLM | 蚂蚁集团\n- [FlexSP](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3676641.3715998)：通过灵活的序列并行化加速大型语言模型训练 | ASPLOS '25\n- [WeiPipe](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3710848.3710869)：面向通信高效的长上下文大型模型训练的权重流水线并行 | PPoPP ’25\n- [WLB-LLM：面向大型语言模型训练的工作负载平衡 4D 并行](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2503.17924) | OSDI' 25\n- [Mixtera](https:\u002F\u002Fmboether.com\u002Fassets\u002Fpdf\u002Fbother2025mixtera.pdf)：用于基础模型训练的数据平面 | ETH\n- [Flex Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05496)：一种用于生成优化注意力核函数的编程模型 | MLSys' 25\n- [平衡流水线并行与词汇表并行](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05288) | MLSys' 25\n- [SlimPipe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14519)：面向长上下文 LLM 训练的省显存高效流水线并行 | 快手\n- [利用高效并行策略扩展 Llama 3 训练](https:\u002F\u002Faisystemcodesign.github.io\u002Fpapers\u002FLlama3-ISCA25.pdf) | ISCA' 25\n- [Lumos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09307)：面向大规模 LLM 训练的高效性能建模与估算 | MLSys' 25\n- [BurstEngine](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.19836)：一个高效的分布式框架，用于训练超长序列（超过 100 万 tokens）的 Transformer 模型\n- [Zeppelin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21841)：在数据并行的大模型训练中平衡变长工作负载\n- [字节跳动稳健的 LLM 训练基础设施](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Sailor：自动化在动态、异构且地理分布式的集群上进行分布式训练](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Tempo：基于符号依赖图的编译型动态深度学习](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Mycroft：追踪集体通信中的依赖关系，以实现可靠的 LLM 训练](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [DCP：通过动态上下文并行解决长上下文训练中的输入动态性问题](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [TrainVerify：面向分布式 LLM 训练的等价性验证](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [面向 10 万+ 张 GPU 的集体通信](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.20171)：针对大规模 GPU 集群的大型集体通信优化\n- [面向 LLM 系统的 RDMA 点对点通信](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.27656)：为分布式 LLM 系统优化基于 RDMA 的点对点通信\n- [MoEBlaze](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.05296)：突破现代 GPU 上高效 MoE 训练的内存墙\n- [Kareus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.17654)：同时降低大型模型训练中的动态与静态能耗\n- [AXLearn](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.05411)：在异构基础设施上进行模块化的大型模型训练 | MLSys' 26\n- [MoSE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.06154)：面向高效且自适应语言模型的可裁剪专家混合\n\n#### 微调\u002FRLHF 系统\n- [Ymir:](https:\u002F\u002Ftianweiz07.github.io\u002FPapers\u002F24-ics-2.pdf) 数据中心中基础模型微调工作负载的调度器 | ICS' 24\n- [RLHFuse](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13221): 基于阶段间与阶段内融合的大语言模型高效 RLHF 训练 | NSDI'25\n- [HybridFlow](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2409.19256): 一种灵活高效的 RLHF 框架\n- [ReaLHF](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2406.14088v1): 通过参数重分配优化大语言模型的 RLHF 训练\n- [NeMo-Aligner](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.01481): 面向高效模型对齐的可扩展工具包 | Nvidia\n- [用于加速 RLHF 训练的自适应放置与并行化框架](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.11819) | Ant\n- [利用强化学习进行 LLM 微调的系统机遇](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3721146.3721944)\n- [AReaL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.24298): 用于语言推理的大规模异步强化学习系统 | [代码](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL) | Ant\n- [StreamRL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15930): 具有解耦流式生成的、可扩展、异构且弹性的 LLM 强化学习\n- [RL-Factory](https:\u002F\u002Fgithub.com\u002FSimple-Efficient\u002FRL-Factory): 通过我们简单高效的框架训练您的 Agent 模型\n- [PLoRA](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.02932): 大模型的高效 LoRA 超参数调优\n- [History Rhymes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18588): 利用 RhymeRL 加速 LLM 强化学习\n- [APRIL](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.18521): 在强化学习中采用主动部分回放以抑制长尾生成\n- [Laminar](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.12633): 一种可扩展的异步 RL 后训练框架\n- [Seer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.14617): 用于快速同步 LLM 强化学习的在线上下文学习\n- [SkyRL-Agent](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.16108): 面向多轮 LLM Agent 的高效 RL 训练\n\n#### 容错性 \u002F 拖后腿节点缓解\n- [Oobleck:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.08125) 基于流水线模板的大模型弹性分布式训练 | SOSP' 23\n- [FALCON](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12588): 针对大规模混合并行训练中的拖后腿节点进行精准定位与缓解\n- [Malleus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13333): 通过可塑的数据与模型并行化实现抗拖后腿的大规模模型混合并行训练\n- [Fire-Flyer AI-HPC：面向深度学习的成本效益软硬件协同设计](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14158) | DeepSeek SC' 24\n- [Lazarus：基于自适应专家放置的混合专家模型弹性训练](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.04656)\n- [GEMINI:](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3600006.3613145) 基于内存内检查点的分布式训练快速故障恢复\n- [ByteCheckpoint:](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20143) 一种面向 LLM 开发的统一检查点系统\n- [ReCycle](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2405.14009): 利用流水线自适应实现大型 DNN 的弹性训练 | SOSP' 24\n- [Minder](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.01791): 大规模分布式模型训练中的故障机器检测 | THU\n- [面向高效且容错的异构执行的流式批处理模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.12407)\n- [TrainMover](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.12636): 无内存开销的高效 ML 训练实时迁移 | Alibaba\n- [GPU 韧性和对 AI\u002FHPC 系统的影响分析](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11901) | UIUC\n- [利用假设分析理解大模型训练中的拖后腿现象](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05713) | OSDI' 25\n- [GoCkpt](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.07035): 基于梯度辅助的多步重叠式检查点技术，用于高效 LLM 训练 | PPoPP' 26\n- [BitSnap](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.12376): LLM 训练中的检查点稀疏化与量化\n\n### 服务\n#### 大语言模型服务\n- [Orca](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi22\u002Fpresentation\u002Fyu)：基于Transformer的生成式模型分布式服务系统 | OSDI'22\n- [响应长度感知与序列调度：一种由大语言模型驱动的大语言模型推理流水线](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13144) | 新加坡国立大学\n- [高效扩展Transformer推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.05102.pdf) | MLSys' 23\n- [Flover](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.13484.pdf)：用于高效自回归模型并行推理的时间融合框架\n- [FlashAttention：具有IO感知的快速且内存高效的精确注意力机制](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2205.14135.pdf)\n- [FlashAttention-3：](https:\u002F\u002Ftridao.me\u002Fblog\u002F2024\u002Fflash3\u002F) 具有异步和低精度的快速准确注意力机制\n- [SageAttention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2410.02367)：适用于即插即用推理加速的精准8位注意力机制 | ICLR 2025\n- [SageAttention2](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.10958)：通过彻底的异常值平滑处理和线程级INT4量化实现的高效注意力机制 | ICML 2025\n- [SageAttention3](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.11594)：SageAttention3：面向推理的微尺度FP4注意力机制及8位训练探索 | NeurIPS 2025亮点论文\n- [SageAttention2++](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.21136)：SageAttention2++：SageAttention2的更高效实现 | ICML ES-FoMo研讨会2025\n- [DeepSpeed Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.00032)：在空前规模下实现Transformer模型的高效推理\n- [TurboTransformers](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2010.05680.pdf)：面向Transformer模型的高效GPU推理系统\n- [FlexGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.06865)：仅用单个GPU即可实现高吞吐量的大语言模型生成式推理 | ICML' 23\n- [MPCFormer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2211.01452.pdf)：利用多方计算实现快速、高性能且私密的Transformer推理 | ICLR'23\n- [POLCA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12908)：大语言模型云服务商中的电力超分配 | 微软\n- [SARATHI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.16369)：通过分块预填充搭便车解码实现高效的大语言模型推理 | 微软\n- [AttMemo](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2301.09262.pdf)：在大型内存系统上利用记忆化技术加速自注意力计算\n- [vLLM](https:\u002F\u002Fvllm.ai\u002F)：采用分页注意力机制的简单、快速且廉价的大语言模型服务 | SOSP' 23\n- [Tabi](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3552326.3587438)：面向大语言模型的高效多级推理系统 | EuroSys' 23\n- [Flash-LLM](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.10285v1.pdf)：利用非结构化稀疏性实现经济高效且高度优化的大规模生成式模型推理 | VLDB' 24\n- [AutoGen](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.08155)：通过多智能体对话实现下一代大语言模型应用 | 微软\n- [FlashDecoding++](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2311.01282.pdf)：在GPU上加速大语言模型推理 | 清华大学\n- [DeepSpeed-MII](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed-MII)：用于推理的模型实现（MII）｜ 微软\n- [Punica](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18547)：多租户LoRA服务 | MLSys' 24\n- [S-LoRA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.03285)：服务于数千个并发LoRA适配器 | MLSys' 24\n- [SpotServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.15566)：在抢占式实例上服务生成式大语言模型 | 卡内基梅隆大学\n- [SuperServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.16733.pdf)：针对不可预测工作负载的细粒度推理服务\n- [大语言模型服务中的公平性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00588) | OSDI' 24\n- [Infinite-LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.02669)：结合DistAttention和分布式KV缓存，为长上下文提供高效的大语言模型服务\n- [CaraServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11240)：面向生成式大语言模型推理的CPU辅助且排名感知的LoRA服务\n- [DistServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.09670)：将预填充与解码分离，以优化大语言模型服务的吞吐量 | OSDI' 24\n- [无干扰推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.11181)：将大语言模型推理解耦，以支持混合下游工作负载\n- [APIServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.01869.pdf)：为大语言模型推理提供高效的API支持\n- [FlexLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.18789)：一个用于同时进行大语言模型推理和参数高效微调的系统\n- [DéjàVu](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.01876)：用于快速、容错的生成式大语言模型服务的KV缓存流式传输\n- [优化关系型工作负载中的大语言模型查询](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.05821) | 加州大学伯克利分校\n- [AttentionStore](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2403.19708.pdf)：在大语言模型服务中跨多轮对话实现低成本注意力复用 | 新加坡国立大学\n- [MuxServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.02015)：灵活的多路复用技术，用于高效的服务多个大语言模型\n- [LoongServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.09526.pdf)：利用弹性序列并行性高效地服务长上下文大语言模型 | SOSP' 24\n- [RAGCache](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12457)：用于检索增强生成的高效知识缓存 | 北京大学\n- [Andes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.16283)：定义并提升基于大语言模型的文本流媒体服务质量 | 密歇根大学\n- [BlockLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18322)：面向大语言模型的多租户细粒度服务\n- [vAttention](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04437)：无需分页注意力机制即可实现大语言模型服务的动态内存管理\n- [Helix](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.01566)：通过异构GPU上的最大流算法实现大语言模型的分布式服务 | 卡内基梅隆大学\n- [Eloquent](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2401.12961v2)：一种更为鲁棒的大语言模型令牌流传输方案 | NAIC' 24\n- [优化基于良好吞吐量的大语言模型服务中的推测解码](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.14066v1) | 加州大学伯克利分校\n- [利用MultiWorld实现弹性模型服务](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2407.08980v1) | 思科研究院\n- [Prepacking](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.09529)：一种用于快速预填充并提高大语言模型吞吐量的简单方法\n- [NanoFlow](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12757)：迈向最优的大语言模型服务吞吐量\n- [使用AQUA在多租户环境中实现响应迅速的机器学习推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21255)\n- [只需一个队列就够了](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00047)：解决大语言模型服务中的队头阻塞问题\n- [MemServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.17565)：带有弹性内存池的解聚式大语言模型服务中的上下文缓存\n- [dLoRA：动态编排请求和适配器以服务LoRA大语言模型](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fwu-bingyang) | OSDI' 24\n- [Llumnix](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fsun-biao)：面向大语言模型服务的动态调度 | OSDI' 24\n- [利用Sarathi-Serve驯服大语言模型推理中的吞吐量-延迟权衡](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Fagrawal) | OSDI' 24\n- [InfiniGen](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Flee)：通过动态KV缓存管理实现大语言模型的高效生成式推理\n- [ServerlessLLM：面向大语言模型的低延迟无服务器推理](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Ffu) | OSDI' 24\n- [CacheGen：用于快速大语言模型服务的KV缓存压缩与流式传输](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3651890.3672274) | SIGCOMM' 24\n- [Preble](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.00023)：高效的大语言模型服务分布式提示调度\n- [Mnemosyne](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17264)：用于高效服务不含近似值的数百万上下文长度大语言模型推理请求的并行化策略\n- [ConServe](https:\u002F\u002Farxiv.org\u002Fhtml\u002F2410.01228v1)：挖掘GPU潜力以实现低延迟、高吞吐量的大语言模型服务\n- [BlockLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.18322)：面向大语言模型的多租户细粒度服务\n- [面向可扩展百万标记推理的上下文并行性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01783)\n- [Pie](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09317)：为大语言模型推理汇集CPU内存\n- [NEO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01142)：通过在线大语言模型推理中的CPU卸载来缓解GPU内存危机\n- [FastSwitch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18424)：在注重公平性的大语言模型服务中优化上下文切换效率\n- [Flash Communication](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04964)：减少张量并行化瓶颈，以实现快速的大语言模型推理\n- [FlashInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005)：面向大语言模型推理服务的高效且可定制的注意力引擎\n- [面向增强型大语言模型的快速推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18248)\n- [一个用于大语言模型微型服务的系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.12488) | 卡内基梅隆大学\n- [iServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13111)：一个基于意图的大语言模型服务系统 | 德克萨斯大学奥斯汀分校\n- [大语言模型服务中的局部性感知公平调度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14312) | 加州大学伯克利分校\n- [迈向高效的大规模多模态模型服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00937) | 微软\n- [DeltaZip：高效服务多个全模型调优的大语言模型](https:\u002F\u002Fanakli.inf.ethz.ch\u002Fpapers\u002Fdeltazip.pdf)\n- [PIM就是你所需要的](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07578)：一个支持CXL的无GPU大语言模型推理系统 | ASPLOS' 25\n- [λScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09922)：实现无服务器大语言模型推理的快速扩展\n- [AIBrix：迈向可扩展且经济高效的大语言模型推理基础设施](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Faibrix\u002Fblob\u002Fmain\u002Fdocs\u002Fpaper\u002FAIBrix_White_Paper_0219_2025.pdf) | vLLM\n- [快速与慢速模型服务：优化大规模异构大语言模型推理工作负载](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14617)\n- [让每个人都能负担得起大语言模型推理：通过NDP-DIMM扩充GPU内存](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16963)\n- [Jenga](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18292)：有效管理异构环境下的大语言模型服务内存\n- [AQUA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21255)：在网络加速下对大规模GPU集群中的大语言模型进行内存卸载 | ASPLOS 2025\n- [MegaScale-Infer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.02263)：通过解聚专家并行化以规模化方式服务混合专家模型 | 字节跳动\n- [借助Ayo实现基于大语言模型的应用端到端优化](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3676641.3716278) | ASPLOS '25\n- [CacheBlend](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3689031.3696098)：通过缓存知识融合实现快速的RAG大语言模型服务 | EuroSys' 25（最佳论文）\n- [ThunderServe](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.09334)：在云环境中提供高性能且经济高效的大语言模型服务 | MLSys' 25\n- [SLOs-Serve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08784)：优化多SLO大语言模型的服务\n- [Tempo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20068)：具有混合SLO要求的应用感知大语言模型服务\n- [Hogwild! Inference](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.06261)：通过并发注意力实现并行大语言模型生成\n- [Prism](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2505.04021)：释放GPU共享潜力，以实现经济高效的多大语言模型服务 | UCLA\n- [RetroInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02922)：一种面向可扩展长上下文大语言模型推理的向量存储方法\n- [基于概率需求建模的高效大语言模型服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14851)\n- [eLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14851)：用于高效大语言模型服务的弹性内存管理框架\n- [DiSCo](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11417v2)：设备-服务器协作的大语言模型文本流媒体服务\n- [DynaServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09285)：为动态解聚式大语言模型服务提供统一且弹性化的执行\n- [HyGen](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.14808)：通过在线-离线请求的弹性共处实现高效的大语言模型服务\n- [WaferLLM：一种晶圆级大语言模型推理系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04563) | OSDI 25\n- [BlitzScale：以O(1)主机缓存实现快速实时的大模型自动缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17246) | OSDI 25\n- [TokenWeave：用于分布式大语言模型推理的高效计算-通信重叠](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11329) | [代码](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Ftokenweave) | ArXiv'25\n- [Nexus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.06608v2)：通过高效GPU共享驯服大语言模型服务中的吞吐量-延迟权衡\n- [驯服混乱](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.19559)：协调异构且解聚式的大语言模型推理的自动缩放 | Seed\n- [TokenLake](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.17219)：一个统一的细分级别前缀缓存池，用于细粒度的弹性长上下文大语言模型服务\n- [专家即服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.17863)：迈向高效、可扩展且稳健的大规模MoE服务\n- [Shift Parallelism](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.16495)：面向动态工作负载的低延迟、高吞吐量大语言模型推理\n- [击败大语言模型推理中的非确定性](https:\u002F\u002Fthinkingmachines.ai\u002Fblog\u002Fdefeating-nondeterminism-in-llm-inference\u002F)\n- [消除训练-推理不匹配的跨张量并行规模的确定性推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.17826)：确保在不同张量并行配置下的一致性推理\n- [动态推理的成本：从AI基础设施视角解读AI代理与测试时缩放](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04301)\n- [城门外的蛮族：AI如何颠覆系统研究](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2510.06189)\n- [Mercury：通过远程内存调度解锁面向大语言模型的多GPU算子优化](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [DiffKV：面向大语言模型的差异化内存管理，支持并行KV压缩](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Pie：一个面向新兴大语言模型应用的可编程服务系统](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Aegaeon：在市场上实现并发大语言模型服务的有效GPU池](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Jenga：有效管理异构环境下大语言模型服务的内存](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [IC-Cache：通过上下文内缓存实现高效的大语言模型服务](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [PrefillOnly：一个用于大语言模型应用中仅预填充工作负载的推理引擎](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [KTransformers：释放CPU\u002FGPU混合推理在MoE模型中的全部潜力](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [ML.ENERGY基准测试](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06371)：迈向自动化的大语言模型推理能量测量与优化 | NeurIPS' 25\n- [服务程序，而非提示](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.25412)：用于结构化程序执行的高效大语言模型服务系统\n- [Continuum](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.02230)：高效且稳健的多轮大语言模型代理调度，结合KV缓存的存活时间\n- [AIConfigurator](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.06288)：面向多框架大语言模型服务的闪电般快速配置优化\n- [SuperInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20309)：面向超级芯片的大语言模型推理，具备SLO感知的轮换调度和内存管理 | MLSys' 26\n- [扩大高效小型语言模型服务规模](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.22101)：面向语义求职的大语言模型服务与部署 | MLSys' 26\n- [BestServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05871)：在共置与解聚架构中实现最佳良好吞吐量的服务策略\n- [OptiKIT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.20408)：满足SLO要求、大幅缩短时间——企业级大语言模型自动化优化 | MLSys' 26\n- [BlendServe](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790133)：通过资源感知批处理优化自回归大型模型的离线推理 | ASPLOS' 26\n- [SwiftSpec](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790246)：通过解聚式管道和融合核，在扩展异步推测解码的同时实现超低延迟的大语言模型解码 | ASPLOS' 26\n- [MuxWise](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fabs\u002F10.1145\u002F3779212.3790236)：迈向高良好吞吐量的大语言模型服务，采用预填充-解码多路复用 | ASPLOS' 26\n- [MoEless](https:\u002F\u002Farxiv.org\u002Fabs\u002F2603.06350)：通过无服务器计算实现高效MoE大语言模型服务\n- [基于KV缓存约束的大语言模型推理在线调度](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07115)：针对KV缓存限制的推理的最佳批处理与调度\n- [BiScale](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.18755)：通过相位感知放置和DVFS实现节能的解聚式大语言模型服务\n- [Harvest](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00328)：面向大语言模型推理的机会性点对点GPU缓存\n- [TokenFlow](files\u002Freport.pdf)：通过抢占式调度应对请求突发，实现响应迅速的大语言模型文本流媒体服务 | 抄袭\n\n#### 代理系统\n- [支持我们的AI霸主](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2509.00997)：重新设计数据系统，以代理为先 | UCB\n- [ALTO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.04311)：用于复合AI系统的高效网络编排器 | 斯坦福大学 & UCB\n- [Parrot](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fosdi24\u002Fpresentation\u002Flin-chaofan)：基于语义变量的LLM应用高效服务 | OSDI' 24\n- [使用Certaindex高效服务LLM推理程序](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.20993) | UCSD\n- [Autellix](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.13965)：作为通用程序的LLM代理的高效服务引擎 | UCB\n- [RAGO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14649v2)：检索增强生成服务的系统性性能优化 | ISCA'25\n- [Circinus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16397)：用于复合ML服务的高效查询计划器 | UIUC\n- [Patchwork：RAG服务的统一框架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07833)\n- [DS SERVE](https:\u002F\u002Fberkeley-large-rag.github.io\u002FRAG-DS-Serve\u002F)：用于高效可扩展神经检索的框架 | UCB\n- [KVFlow](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.07400)：用于加速基于LLM的多代理工作流的高效前缀缓存\n- [DroidSpeak](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02820)：跨LLM通信与多LLM服务中的KV缓存共享\n- [Murakkab](https:\u002F\u002Farxiv.org\u002Fabs\u002F2508.18298)：云平台中资源高效的代理式工作流编排\n- [HedraRAG：针对异构RAG工作流的生成与检索协同优化](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [METIS：具有配置自适应功能的快速质量感知RAG系统](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [Aragog：面向代理式工作流的可扩展服务的即时模型路由](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2511.20975)\n- [DualPath](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.21548)：突破代理式LLM推理中的存储带宽瓶颈 | DeepSeek\n\n#### 边缘计算服务\n- [闪存中的LLM：有限内存下的高效大型语言模型推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.11514) | 苹果公司\n- [STI](https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.05022)：通过弹性流水线加速边缘端NLP推理 | ASPLOS 23\n- [PowerInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.12456)：使用消费级GPU实现快速大型语言模型服务 | SOSP' 24\n- [MoE-Lightning：在内存受限的GPU上进行高吞吐量的MoE推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11217)\n- [InfiniteHiP](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.08910)：在单个GPU上将语言模型上下文扩展至300万 tokens\n- [prima.cpp](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2504.08791)：PRIMA.CPP：在低资源的家用集群上加速70B规模LLM推理\n- [面向加速异构LLM推理的移动SoC特性研究](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n\n\n#### 系统效率优化——模型协同设计\n- [稀疏线性注意力](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2509.24006)：SLA：通过可微调的稀疏线性注意力超越扩散Transformer中的稀疏性 | 清华大学\n- [大型语言模型的快速分布式推理服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.05920) | 北京大学\n- [FrugalGPT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.05176.pdf)：如何在降低成本并提升性能的同时使用大型语言模型 | 斯坦福大学\n- [H2O](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14048)：大型语言模型高效生成式推理的重采样Oracle | ICML ES-FoMo Workshop 2023\n- [参考推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.04487)：无损加速大型语言模型\n- [SkipDecode](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.02628)：利用批处理和缓存实现自回归跳过解码，以高效进行LLM推理\n- [Scissorhands](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.17118)：利用重要性假设的持久性，在测试时压缩LLM KV缓存\n- [无需再训练的预训练语言模型知识保留型剪枝](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.03449.pdf) | 首尔国立大学\n- [分阶段推测解码加速LLM推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2308.04623.pdf) | ICML' 23\n- [SpecInfer](https:\u002F\u002Fwww.cs.cmu.edu\u002F~zhihaoj2\u002Fpapers\u002Fspecinfer.pdf)：通过推测推理和标记树验证加速生成式LLM服务 | CMU\n- [Deja Vu](https:\u002F\u002Fproceedings.mlr.press\u002Fv202\u002Fliu23am.html)：推理时的上下文稀疏性以提高LLM效率 | ICML' 23\n- [S3](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2306.06000.pdf)：在生成式推理过程中提高GPU利用率以获得更高吞吐量 | 哈佛大学\n- [LLMCad](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.04255)：快速且可扩展的设备端大型语言模型推理\n- [思维骨架](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.15337)：大型语言模型可以进行并行解码 | 清华大学\n- [LoRAShear](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.18356)：高效的大规模语言模型结构化剪枝与知识恢复｜微软\n- [环形注意力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.01889.pdf)：采用分块Transformer实现近乎无限的上下文 | UCB\n- [学习型尽力而为LLM服务](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.07886) | UCB\n- [星形注意力](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2411.17116)：长序列上的高效LLM推理| NVIDIA\n- [FFN融合](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18908)：重新思考大型语言模型中的顺序计算\n- [SpargeAttention](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2502.18137)：SpargeAttention：准确且无需训练的稀疏注意力，可加速任何模型的推理 | ICML' 25\n- [使用4位整数训练Transformer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.11987) | NeurIPS' 23\n- [Jetfire：采用INT8数据流和逐块量化实现高效准确的Transformer预训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.12422) | ICML' 24\n- [COAT：压缩优化器状态和激活以实现内存高效的FP8训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19313) | ICLR'25\n- [使用TurboMind实现高效混合精度大型语言模型推理](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.15601v1) | 上海人工智能实验室\n- [通过时空分配规划减少GPU内存碎片](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.16274) | EuroSys' 26\n\n\n\n### 多模态训练系统\n- [DISTMM](https:\u002F\u002Fwww.usenix.org\u002Fconference\u002Fnsdi24\u002Fpresentation\u002Fhuang)：加速分布式多模态模型训练 | NSDI' 24\n- [Optimus：](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.03505)：通过气泡效应加速大规模多模态LLM训练\n- [解决多模态大型语言模型训练中的模型与数据异质性问题](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04275v1) | 北京大学\n- [Cornstarch](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11367)：分布式多模态训练必须具备多模态意识 | 密歇根大学\n- [PipeWeaver](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.14145)：采用动态交错流水线应对大型多模态模型训练中的数据动态性 | 上海交通大学\n\n### 多模态推理服务系统\n- [xDiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.01738)：具有大规模并行性的扩散Transformer（DiT）推理引擎\n- [MOSEL](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2310.18481.pdf)：基于动态模态选择的推理服务\n- [用于高效服务扩散模型的近似缓存](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.04429) | Adobe Research\n- [超越大语言模型的生成式AI](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2312.14385)：多模态生成的系统影响 | Meta\n- [多模态生成模型推理的特性分析与高效加速](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00215) | Meta\n- [DistriFusion：](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.19481) 高分辨率扩散模型的分布式并行推理 | MIT\n- [LongVILA：面向长视频的长上下文视觉语言模型扩展](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10188) | NVIDIA\n- [FlexCache：用于视频扩散的灵活近似缓存系统](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04012) | 滑铁卢大学\n- [DDiT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13497v1)：扩散Transformer模型服务的动态资源分配\n- [PATCHEDSERVE](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.09253)：面向SLO优化的混合分辨率扩散服务的补丁管理框架\n- [ElasticMM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.10069)：采用弹性多模态并行性的高效多模态大语言模型服务\n- [TetriServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.01565)：面向异构图像生成的高效DiT服务\n- [dInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.08666)：扩散语言模型的高效推理框架\n- [Fast-dLLM v2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.26328)：高效的块扩散大语言模型\n- [Argus](https:\u002F\u002Farxiv.org\u002Fabs\u002F2511.06724)：质量感知的高吞吐量文本到图像推理服务系统\n- [Cornserve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14098)：高效服务任意模态之间的多模态模型\n- [HydraInfer](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.12658)：用于多模态大型语言模型服务的混合解聚调度\n- [通过GPU内部调度与资源共享实现解聚式多阶段MLLM推理](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.17574)\n- [VoxServe](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.00269)：以流媒体为中心的语音语言模型服务系统\n- [dLLM-Serve](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.17077)：缓解内存占用危机，实现高效扩散语言模型服务\n- [HADIS](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.00642)：用于高效文本到图像生成的混合适应性扩散模型服务\n\n\n## 用于系统研究的大语言模型\n- [用于编译器优化的大语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.07062)\n- [程序员分析指南](https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.00245)：一场与大语言模型同行的旅程\n- [LLM辅助代码清理以训练准确的代码生成器](https:\u002F\u002Farxiv.org\u002Fabs\u002F2311.14904) | UCB\n- [通过数据异质性感知的模型管理实现高效多任务大型模型训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03365)\n- [如果初次失败，再试、再试、再试……？](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fif-at-first-you-dont-succeed-try-try-again-insights-and-llm-informed-tooling-for-detecting-retry-bugs-in-software-systems\u002F) | SOSP' 24\n- [Aceso](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002Fpdf\u002F10.1145\u002F3627703.3629554)：通过迭代缓解瓶颈实现高效并行DNN训练 | EuroSys '24\n- [GMorph](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3650074)：通过模型融合加速多DNN推理 | EuroSys '24\n- [利用大语言模型对云事件进行自动根因分析](https:\u002F\u002Fdl.acm.org\u002Fdoi\u002F10.1145\u002F3627703.3629553) | EuroSys '24\n- [KNighter：用LLM合成的检查器重塑静态分析](https:\u002F\u002Fsigops.org\u002Fs\u002Fconferences\u002Fsosp\u002F2025\u002Faccepted.html) | SOSP' 25\n- [城门外的蛮族：AI如何颠覆系统研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2510.06189)\n- [让蛮族进来](https:\u002F\u002Farxiv.org\u002Fabs\u002F2512.14806)：AI如何加速系统性能研究\n- [AI研究工程技能库](https:\u002F\u002Fgithub.com\u002FzechenzhangAGI\u002Fclaude-ai-research-skills)：AI研究工程技能与最佳实践合集\n- [K-Search](https:\u002F\u002Farxiv.org\u002Fabs\u002F2602.19128)：通过协同进化的内在世界模型生成LLM内核\n\n## 工业级大语言模型技术报告\n\n- [Qwen2.5技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15115) - (2024年12月)\n- [Qwen 3技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.09388) – (2025年5月)\n- [LLaMA：开放且高效的基座语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.13971) - (2023年2月)\n- [Llama 2：开放的基座模型与微调后的聊天模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.09288) - (2023年7月)\n- [Llama 3模型家族](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.11083) - (2024年8月)\n- [Gemini：高度能力的多模态模型家族](https:\u002F\u002Fassets.bwbx.io\u002Fdocuments\u002Fusers\u002FiqjWHBFdfxIU\u002Fr7G7RrtT6rnM\u002Fv0) - (2023年12月)\n- [Gemini 1.5：解锁跨越数百万标记的多模态理解](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v1_5_report.pdf) - (2024年2月)\n- [Gemini 2.5：以先进推理、多模态、长上下文及下一代代理能力推动前沿](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002Fgemini\u002Fgemini_v2_5_report.pdf) - (2025年6月)\n- [Phi‑4‑reasoning技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21318) – (2025年4月)\n- [Phi‑4技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08905) – (2024年12月)\n- [Kimi‑VL技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07491) – (2025年4月)\n- [Kimi k1.5：利用LLM扩展强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) – (2025年1月)\n- [DeepSeek-LLM技术报告](https:\u002F\u002Fnairl.kr\u002Fwp-content\u002Fuploads\u002F2025\u002F02\u002Fdeepseek_r1_techreport.pdf) - (2024年1月)\n- [DeepSeek-V2：强大、经济且高效的专家混合语言模型](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.04434) - (2024年5月)\n- [DeepSeek-V3技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19437) - (2024年12月)\n- [DeepSeek-R1：通过强化学习激励LLM的推理能力](https:\u002F\u002Fwww.boozallen.com\u002Fcontent\u002Fdam\u002Fhome\u002Fdocs\u002Fai\u002Fa-technical-primer-on-deepseek.pdf) - (2025年1月)\n- [Kimi-VL：具备视觉、语言和长上下文的多模态LLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07491) – (2025年4月)\n- [Kimi k1.5：利用多模态LLM进行强化学习](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12599) – (2025年1月)\n- [Kimi-K2：开放的代理智能](https:\u002F\u002Farxiv.org\u002Fabs\u002F2507.20534) – (2025年7月)\n- [GPT-oss-120b & GPT-oss-20b](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F419b6906-9da6-406c-a19d-1bb078ac7637\u002Foai_gpt-oss_model_card.pdf) – (2025年8月)\n\n## 机器学习会议\n\n### NeurIPS 2025\n\n一个精选的 **[NeurIPS 2025 论文集](neurips25-mlsys\u002F)**，专注于生成式 AI 模型的高效系统。该集合包括以下主题的论文：\n- [架构与高效机制](neurips25-mlsys\u002Farchitecture.md) - 高效注意力机制、KV 缓存系统、推测解码\n- [模型压缩与量化](neurips25-mlsys\u002Fcompression.md) - 量化、剪枝、KV 缓存压缩\n- [推理与服务](neurips25-mlsys\u002Finference.md) - LLM 服务、调度、分布式推理\n- [多模态与扩散模型](neurips25-mlsys\u002Fmulti-modality.md) - VLM 效率、扩散模型优化\n- [强化学习](neurips25-mlsys\u002Frl.md) - RL 训练基础设施、策略优化\n- [训练系统](neurips25-mlsys\u002Ftraining.md) - 分布式训练、内存效率\n\n请参阅 **[完整的 NeurIPS 2025 论文集](neurips25-mlsys\u002F)**，以获取详细的分类和论文摘要。\n\n## LLM 框架\n### 训练\n- [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)：一个深度学习优化库，使分布式训练和推理变得简单、高效且有效 | 微软\n- [Accelerate](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Faccelerate\u002Findex) | Hugging Face\n- [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)\n- [Megatron](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM) | Nvidia\n- [NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) | Nvidia\n- [torchtitan](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchtitan) | PyTorch\n- [veScale](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fvescale) | 字节跳动\n- [DeepSeek Open Infra](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002Fopen-infra-index)\n- [VeOmni](https:\u002F\u002Fgithub.com\u002FByteDance-Seed\u002FVeOmni)：扩展任意模态模型的训练\n- [Cornstarch](https:\u002F\u002Fgithub.com\u002Fcornstarch-org\u002FCornstarch)：分布式多模态训练必须具备多模态感知能力 | UMich\n\n\n- **后训练**\n  - [TRL](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftrl)：Transformers 强化学习\n  - [OpenRLHF](https:\u002F\u002Fgithub.com\u002FOpenRLHF\u002FOpenRLHF)：基于 Ray 的易用、可扩展且高性能的 RLHF 框架\n  - [VeRL](https:\u002F\u002Fgithub.com\u002Fvolcengine\u002Fverl)：火山引擎针对 LLM 的强化学习\n  - [rLLM](https:\u002F\u002Fgithub.com\u002Fagentica-project\u002Frllm)：面向语言智能体的强化学习\n  - [SkyRL](https:\u002F\u002Fgithub.com\u002FNovaSky-AI\u002FSkyRL)：面向 LLM 的模块化全栈强化学习库\n  - [AReal](https:\u002F\u002Fgithub.com\u002FinclusionAI\u002FAReaL)：用于 LLM 推理的分布式 RL 系统\n  - [ROLL](https:\u002F\u002Fgithub.com\u002Falibaba\u002FROLL)：大规模学习的强化学习优化\n  - [slime](https:\u002F\u002Fgithub.com\u002FTHUDM\u002Fslime)：旨在实现 RL 扩展的 LLM 后训练框架\n  - [RAGEN](https:\u002F\u002Fgithub.com\u002FRAGEN-AI\u002FRAGEN)：通过强化推理来训练智能体\n  - [Agent Lightning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2508.03680)：使用强化学习训练任何 AI 智能体\n \n  \n### 服务\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM) | Nvidia\n- [Ray-LLM](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray-llm) | Ray\n- [TGI](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftext-generation-inference\u002Fen\u002Findex) | Hugging Face\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm) | UCB\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang) | UCB\n- [KV Transformers](https:\u002F\u002Fgithub.com\u002Fkvcache-ai\u002Fktransformers)\n- [Dynamo](https:\u002F\u002Fgithub.com\u002Fai-dynamo\u002Fdynamo)：一个数据中心规模的分布式推理服务框架 | NVIDA\n- [LMCache](https:\u002F\u002Fgithub.com\u002FLMCache\u002FLMCache)：用最快的 KV 缓存层加速你的 LLM\n\n## [ML 系统](mlsystems.md)\n\n\n## 综述论文\n- [高效大型语言模型：综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.03863)\n- [大型语言模型的挑战与应用](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2307.10169.pdf)\n- [超越效率：资源高效大型语言模型的系统性综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.00625)\n- [迈向高效的生成式大型语言模型服务：从算法到系统的综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15234)\n\n\n## LLM 基准测试 \u002F 排行榜？追踪数据\n- [LLM 能源排行榜](https:\u002F\u002Fml.energy\u002Fleaderboard) | Umich\n- [LLM-Perf 排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Foptimum\u002Fllm-perf-leaderboard) | HuggingFace\n- [Aviary Explorer](https:\u002F\u002Faviary.anyscale.com\u002F) | Anyscale\n- [开放 LLM 排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard) | HuggingFace\n- [HELM](https:\u002F\u002Fcrfm.stanford.edu\u002Fhelm\u002Flatest\u002F) | 斯坦福\n- [LMSYS](https:\u002F\u002Fchat.lmsys.org) | UCB\n- [迈向高效可靠的 LLM 服务：一项真实工作负载研究](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.17644)\n\n\n\n\n## 相关 ML 阅读材料\n- [大型 Transformer 模型推理优化](https:\u002F\u002Flilianweng.github.io\u002Fposts\u002F2023-01-10-inference-optimization\u002F)\n- [Transformer 推理中的算术运算](https:\u002F\u002Fkipp.ly\u002Ftransformer-inference-arithmetic\u002F)\n- [Transformer 家族 2.0 版本](https:\u002F\u002Flilianweng.github.io\u002Fposts\u002F2023-01-27-the-transformer-family-v2\u002F)\n- [Transformer 推理的全栈优化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14017.pdf)：一份综述 | UCB\n- [小型训练手册](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceTB\u002Fsmol-training-playbook)：构建世界级 LLM 的秘诀 | Hugging Face\n- [超大规模训练手册](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fnanotron\u002Fultrascale-playbook)：在 GPU 集群上训练 LLM | Hugging Face\n\n## MLSys 课程\n- 机器学习系统 | (斯坦福)[https:\u002F\u002Fcs229s.stanford.edu\u002Ffall2023\u002F]\n- 生成式 AI 系统 | (Umich)[https:\u002F\u002Fgithub.com\u002Fmosharaf\u002Feecs598\u002Ftree\u002Fw24-genai]\n- AI 系统 - LLMs | (GT)[https:\u002F\u002Fcs8803-sp24.anand-iyer.com\u002F]\n\n\n## 其他阅读材料\n- [大型语言模型精选列表](https:\u002F\u002Fgithub.com\u002FHannibal046\u002FAwesome-LLM)\n- [AI 系统论文列表](https:\u002F\u002Fgithub.com\u002Flambda7xx\u002Fawesome-AI-system)\n- [神经网络训练中自动并行化的基准仓库](https:\u002F\u002Fgithub.com\u002FConnollyLeon\u002Fawesome-Auto-Parallelism)\n- [每个 LLM 开发者都应该知道的数字](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fllm-numbers)\n- [10 万个 H100 集群：](https:\u002F\u002Fwww.semianalysis.com\u002Fp\u002F100000-h100-clusters-power-network) 功率、网络拓扑、以太网 vs InfiniBand、可靠性、故障、检查点\n- [OpenAI 关于构建可扩展 AI 基础设施的主旨演讲](https:\u002F\u002Fwww.servethehome.com\u002Fopenai-keynote-on-building-scalable-ai-infrastructure\u002F)\n- [Awesome ML SYS 教程](https:\u002F\u002Fgithub.com\u002Fzhaochenyang20\u002FAwesome-ML-SYS-Tutorial\u002Ftree\u002Fmain)","# LLMSys-PaperList 快速上手指南\n\n**LLMSys-PaperList** 并非一个需要编译或运行的软件工具，而是一个**精选的学术资源列表仓库**。它汇集了大语言模型（LLM）系统领域（包括训练、推理、容错、多模态等方向）的最新论文、技术报告、教程和项目。\n\n本指南旨在帮助开发者快速获取该列表资源，并高效利用其中的文献进行学习与研究。\n\n## 环境准备\n\n由于本项目本质是一个托管在 GitHub 上的文档列表，对环境要求极低：\n\n*   **操作系统**：Windows, macOS, Linux 均可。\n*   **前置依赖**：\n    *   **Git**：用于克隆仓库到本地（推荐）。\n    *   **浏览器**：用于直接在线阅读或访问论文链接。\n    *   **Markdown 阅读器**（可选）：如 VS Code、Typora 或 Obsidian，用于在本地获得更好的阅读体验。\n\n> **国内加速建议**：\n> 如果直接访问 GitHub 速度较慢，推荐使用国内镜像源克隆，或使用代理加速。\n> *   **Gitee 镜像**（如有同步）：`https:\u002F\u002Fgitee.com\u002Fmirrors\u002FLLMSys-PaperList` (需确认具体镜像地址)\n> *   **FastGit 加速**：将 `github.com` 替换为 `fastgit.org` 进行克隆。\n\n## 安装步骤（获取资源）\n\n你可以通过以下两种方式获取该论文列表：\n\n### 方式一：克隆到本地（推荐，便于离线浏览和更新）\n\n打开终端（Terminal 或 CMD），执行以下命令：\n\n```bash\n# 原始仓库\ngit clone https:\u002F\u002Fgithub.com\u002FLLMSys-PaperList\u002FLLMSys-PaperList.git\n\n# 或者使用国内加速地址示例 (如果可用)\n# git clone https:\u002F\u002Ffastgit.org\u002FLLMSys-PaperList\u002FLLMSys-PaperList.git\n```\n\n进入目录：\n```bash\ncd LLMSys-PaperList\n```\n\n### 方式二：在线浏览\n\n直接访问 GitHub 仓库页面查看 `README.md` 文件，无需安装任何软件：\n*   地址：`https:\u002F\u002Fgithub.com\u002FLLMSys-PaperList\u002FLLMSys-PaperList`\n\n## 基本使用\n\n获取列表后，你可以根据研究方向快速定位所需资源。以下是几种典型的使用场景：\n\n### 1. 按主题查找论文\n打开本地的 `README.md` 文件或在线页面，利用目录（Table of Contents）跳转。例如：\n\n*   **想学习大模型训练并行策略**：\n    找到 `LLM Systems` -> `Training` -> `Pre-training` 章节。\n    *   推荐阅读：*Megatron-LM*, *MegaScale*, *DeepSeek-V3 Technical Report*。\n*   **想研究 RLHF 系统优化**：\n    找到 `LLM Systems` -> `Training` -> `Systems for Post-training \u002F RLHF` 章节。\n    *   推荐阅读：*HybridFlow*, *NeMo-Aligner*, *RLHFuse*。\n*   **关注容错与长尾延迟优化**：\n    找到 `Fault Tolerance \u002F Straggler Mitigation` 章节。\n    *   推荐阅读：*Oobleck*, *FALCON*, *ByteCheckpoint*。\n\n### 2. 获取论文原文\n列表中每个条目都附带了链接（通常为 arXiv PDF、会议主页或代码仓库）。\n*   **操作**：直接点击链接下载 PDF。\n*   **国内访问提示**：arXiv 在国内访问可能不稳定，建议配置科研加速器或使用学校\u002F机构提供的镜像服务下载论文。\n\n### 3. 追踪最新动态\n该仓库持续更新顶级会议（如 NeurIPS, SOSP, OSDI, MLSys 等）的最新录用论文。\n*   **更新方法**：如果你已克隆仓库，只需在目录下运行：\n    ```bash\n    git pull\n    ```\n    即可同步最新的论文列表。\n\n### 4. 结合代码实践\n部分条目不仅包含论文，还附带了 `[Code]` 标签或指向 GitHub 项目的链接。\n*   **示例**：在 `RLHF` 章节找到 *AReaL* 项目。\n*   **操作**：点击代码链接，按照该项目独立的 README 指示进行环境配置和运行，以复现论文中的系统效果。\n\n---\n**提示**：本列表是进入 LLM 系统领域的绝佳入口，建议结合具体的工程需求（如“如何在千卡集群上优化 MoE 训练”）定向检索相关论文，而非通读全文。","某初创公司的大模型系统工程师正致力于优化千卡集群上的训练效率，急需寻找解决长序列训练负载不均和异构集群调度问题的最新学术方案。\n\n### 没有 LLMSys-PaperList 时\n- **信息检索如大海捞针**：在 arXiv 和各大会议网站手动筛选\"LLM Systems\"相关论文，耗时数天仍难以区分哪些是纯理论推导，哪些是已落地的系统工程实践。\n- **关键技术点遗漏**：容易错过像 `DynaPipe`（动态流水线优化多任务训练）或 `HAP`（异构集群自动程序合成）这类针对特定痛点的最新 SOSP\u002FEuroSys 顶会成果，导致重复造轮子。\n- **缺乏系统化分类**：面对预训练、后训练、推理服务、多模态等混杂的技术栈，难以快速定位到“故障容错”或“边缘端服务”等细分领域的权威资料。\n- **工业界与学术界脱节**：很难同时获取字节跳动 `MegaScale` 万卡训练经验与学术界前沿算法的对比视角，导致技术方案选型缺乏实战数据支撑。\n\n### 使用 LLMSys-PaperList 后\n- **精准直达核心文献**：通过清晰的目录结构，工程师在几分钟内直接定位到\"Fault Tolerance\"和\"Heterogeneous Clusters\"板块，迅速锁定 `ScheMoE` 和 `Alibaba HPN` 等关键论文。\n- **全覆盖技术演进链**：借助从预训练到边缘推理的完整分类，团队快速构建了包含 `Megatron-LM` 基础架构至最新 `C4` 通信优化方案的技术图谱，确保无盲区。\n- **理论与实战双向验证**：并列查看学术界的 `Perseus` 能耗优化研究与工业界的 Llama 3 技术报告，为集群降本能效比提供了兼具创新性与可行性的双重依据。\n- **紧跟前沿会议动态**：直接追踪 NeurIPS 2025 及 MLSys 课程资源，让团队能提前布局下一代混合专家模型（MoE）的分布式训练策略。\n\nLLMSys-PaperList 将原本数周的技术调研工作压缩至数小时，成为大模型系统研发者把握领域脉搏、规避技术弯路的高效导航仪。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAmberLJC_LLMSys-PaperList_d32538d3.png","AmberLJC","Jiachen LIU","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FAmberLJC_dbc0f706.jpg","Building AI Co-Scientist for Everyone.\r\nCS PhD @ UMich | Systems for LLM | Meta (MSL), SJTU, MIT\r\n","Meta","Palo Alto",null,"https:\u002F\u002Famberljc.github.io\u002F","https:\u002F\u002Fgithub.com\u002FAmberLJC",1907,98,"2026-04-08T03:43:53",1,"","未说明",{"notes":88,"python":86,"dependencies":89},"LLMSys-PaperList 不是一个可执行的软件工具或框架，而是一个 curated list（精选列表）仓库，用于收集和大语言模型系统相关的学术论文、文章、教程和项目链接。因此，该仓库本身没有运行环境、GPU、内存或依赖库的需求。用户只需浏览 README 中的链接即可获取相关论文资源。",[],[35,14,91],"其他","2026-03-27T02:49:30.150509","2026-04-08T19:00:22.069319",[95,100,105,109,113,117,122],{"id":96,"question_zh":97,"answer_zh":98,"source_url":99},24953,"如何向项目添加新的论文或博客文章？","用户可以通过创建 GitHub Issue 来请求添加新论文。在 Issue 正文中提供论文的 arXiv 链接或博客 URL，并明确指定希望将其添加到的分类部分（如 'LLM Serving'、'LLM Pretraining' 等）。维护者或自动化机器人（如 @claude）会读取 README.md 了解现有格式，提取标题，并将条目按格式 `[Title](link)` 添加到对应的章节中。","https:\u002F\u002Fgithub.com\u002FAmberLJC\u002FLLMSys-PaperList\u002Fissues\u002F18",{"id":101,"question_zh":102,"answer_zh":103,"source_url":104},24954,"项目中的论文是如何分类和组织的？","论文存储在 `README.md` 文件中，并按照特定的系统类别进行组织。常见的分类包括：LLM Serving（推理服务）、LLM Training（训练）、Agent Systems（代理系统）、Multi-Modal Serving Systems（多模态服务系统）、RAG Systems（检索增强生成系统）以及 Edge serving（边缘服务）等。添加新论文时，必须将其放入最相关的类别部分。","https:\u002F\u002Fgithub.com\u002FAmberLJC\u002FLLMSys-PaperList\u002Fissues\u002F22",{"id":106,"question_zh":107,"answer_zh":108,"source_url":104},24955,"添加论文时需要遵循什么格式规范？","所有论文条目必须严格遵循 Markdown 链接格式：`[论文标题](链接)`。标题应准确反映论文内容，链接通常为 arXiv 摘要页或 PDF 地址。例如：`[BurstEngine: an Efficient Distributed Framework...](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.19836)`。不符合此格式的提交可能会被拒绝或要求修正。",{"id":110,"question_zh":111,"answer_zh":112,"source_url":104},24956,"如何处理来自特定会议（如 SOSP、MLSys）的论文批量添加？","用户可以发起 Issue 请求添加特定会议的论文，需提供会议录用论文列表的官方链接（如 SOSP 2025 accepted papers 页面）。自动化脚本或维护者会筛选出与 MLSys、LLM、GenAI 相关的论文，按类别（如 8 篇推理、6 篇训练等）整理，并批量添加到 README.md 中。如果自动化工具无法访问外部链接，可能需要用户手动提供论文列表。",{"id":114,"question_zh":115,"answer_zh":116,"source_url":104},24957,"如果提供的论文链接无法被自动化工具访问怎么办？","如果自动化机器人（如 Claude Code）因权限限制无法抓取外部网页（如会议录用列表），它会通知用户。此时有两种解决方案：1. 授予工具访问外部 URL 的权限；2. 采用手动工作流，即用户自行访问目标网页，复制论文标题和链接列表粘贴到 Issue 评论中，再由机器人或维护者进行处理。",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},24958,"多模态大模型（MLLM）相关的论文应该放在哪里？","涉及多模态大模型推理和服务的论文（如 HydraInfer, Enabling Disaggregated Multi-Stage MLLM Inference）应添加到 `README.md` 中的 'Multi-Modal Serving Systems'（多模态服务系统）部分，而不是普通的 LLM Serving 部分，以确保分类的准确性。","https:\u002F\u002Fgithub.com\u002FAmberLJC\u002FLLMSys-PaperList\u002Fissues\u002F46",{"id":123,"question_zh":124,"answer_zh":125,"source_url":99},24959,"如何区分并添加关于“非确定性（Nondeterminism）”的技术博客？","技术博客（如 'Defeating Nondeterminism in LLM Inference'）如果包含重要的系统优化见解，也可以被收录。这类资源通常根据其主题内容添加到相应的技术板块，例如关于推理非确定性的博客会被添加到 'LLM Serving' 部分，格式与学术论文类似，但会注明是博客文章。",[]]