[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenBMB--MiniCPM":3,"tool-OpenBMB--MiniCPM":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",150720,2,"2026-04-11T11:33:10",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":76,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":100,"env_os":101,"env_gpu":102,"env_ram":103,"env_deps":104,"category_tags":116,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":117,"updated_at":118,"faqs":119,"releases":149},6603,"OpenBMB\u002FMiniCPM","MiniCPM","MiniCPM4 & MiniCPM4.1: Ultra-Efficient LLMs on End Devices, achieving 3+ generation speedup on reasoning tasks","MiniCPM 是由 OpenBMB 团队打造的一系列超高效大型语言模型，专为在手机、笔记本等终端设备上流畅运行而设计。它主要解决了传统大模型对硬件要求高、推理速度慢的难题，让用户无需依赖昂贵的云端服务器，即可在本地设备享受强大的 AI 能力。\n\n无论是希望在移动端部署 AI 应用的开发者、追求极致效率的研究人员，还是希望保护数据隐私的企业用户，MiniCPM 都是理想选择。其最新发布的 MiniCPM4 及 4.1 系列更是实现了质的飞跃：在保持与更大参数模型相当性能的同时，生成速度提升了 3 倍以上，特别适合处理复杂的逻辑推理任务。\n\nMiniCPM 的核心技术亮点在于其创新的架构设计。例如，MiniCPM-SALA 架构巧妙融合了稀疏注意力与线性注意力机制，能够高效处理百万级 token 的超长上下文；而混合推理模式则让模型既能进行深度思考，也能快速响应简单指令。此外，该系列模型在中文理解、数学计算及代码生成方面表现优异，部分小参数版本甚至能媲美主流的大尺寸开源模型。通过深度的系统级优化，MiniCPM 真正推动了高性能大模型从云端走向端侧的普及。","\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_ba52a6d2c048.png\" width=\"500em\" >\u003C\u002Fimg> \n\u003C\u002Fdiv>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FREADME-cn.md\">中文\u003C\u002Fa> | \u003Cb>English\u003C\u002Fb>\n    \u003Cp>\n\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07900\" target=\"_blank\">MiniCPM Paper\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fmodelbest.feishu.cn\u002Fwiki\u002FD2tFw8Pcsi5CIzkaHNacLK64npg\" target=\"_blank\">MiniCPM Wiki (in Chinese)\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V\u002F\" target=\"_blank\">MiniCPM-V Repo\u003C\u002Fa> |\nJoin our \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002F3cGQn9b3YM\" target=\"_blank\">discord\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.jpg\" target=\"_blank\">WeChat\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FKIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c\">Join Us\u003C\u002Fa>\n\u003C\u002Fp>\n\n> [!NOTE]\n> ### 🏆 2026 Sparse Operator Acceleration & Race (SOAR) is Now Live!\n>\n> **The MiniCPM-SALA architecture is just the beginning. Realizing its full potential requires deep system-level synergy and cross-layer compilation optimization.**\n>\n> OpenBMB, in collaboration with **SGLang** and **NVIDIA**, invites global geeks to tackle the limits of **9B-scale, 1M-token inference** on a dedicated **NVIDIA 6000D** environment.\n>\n> * 💰 **Prize Pool:** >$100,000 USD (Top Prize: **$89,000**)\n> * 🚀 **Goal:** Optimize single and multi-batch performance via cross-layer compilation.\n>\n> 👉 **[Learn more and Register](https:\u002F\u002Fsoar.openbmb.cn\u002F)**\n\n## Changelog🔥\n- [2026.02.11] **[MiniCPM-SALA](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-SALA)** is released! This is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling. 🔥🔥🔥\n- [2025.09.29] **[InfLLM-V2 paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24663) is released!** We can train a sparse attention model with only 5B long-text tokens. 🔥🔥🔥\n- [2025.09.05] **[MiniCPM4.1 series](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fminicpm-4-6841ab29d180257e940baa9b)** are released! This series is a hybrid reasoning model with trainable sparse attention, which can be used in both deep reasoning mode and non-reasoning mode. 🔥🔥🔥\n- [2025.06.06] Released [**MiniCPM4**](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fminicpm-4-6841ab29d180257e940baa9b)! This model achieves ultimate efficiency improvements while maintaining optimal performance at the same scale! It can achieve over 5x generation acceleration on typical end-side chips!\n- [2024.09.05] We release [**MiniCPM3-4B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM3-4B)! This model outperforms Phi-3.5-mini-instruct and GPT-3.5-Turbo-0125 and is comparable to several models with 7B-9B parameters like Llama3.1-8B-Instruct, Qwen2-7B-Instruct, and GLM-4-9B-Chat.\n- [2024.07.05] Released [**MiniCPM-S-1B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-S-1B-sft)! This model achieves an average sparsity of 87.89% in the FFN layer, reducing FFN FLOPs by 84%, while maintaining downstream task performance.\n- [2024.04.11] Released [**MiniCPM-2B-128k**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-128k), [**MiniCPM-MoE-8x2B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-MoE-8x2B) and [**MiniCPM-1B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-1B-sft-bf16)! Click [here](https:\u002F\u002Fopenbmb.vercel.app\u002F) to read our technical blog.\n- [2024.02.01] Released [**MiniCPM-2B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16)! This model performs similarly to Mistral-7B on public benchmarks (with better performance in Chinese, math, and code abilities) and overall outperforms models like Llama2-13B, MPT-30B, and Falcon-40B.\n\n## Quick Links\n\n- [Changelog🔥](#changelog)\n- [Quick Links](#quick-links)\n- [Model Downloads](#model-downloads)\n- [MiniCPM-SALA](#minicpm-sala)\n- [MiniCPM4 and MiniCPM4.1 Series](#minicpm4-and-minicpm41-series)\n    - [Highlights](#highlights)\n    - [Introduction](#introduction)\n  - [Evaluation Results](#evaluation-results)\n    - [Efficiency Evaluation](#efficiency-evaluation)\n    - [Comprehensive Evaluation](#comprehensive-evaluation)\n    - [Long Text Evaluation](#long-text-evaluation)\n  - [Inference](#inference)\n    - [Hybird Reasoning Mode](#hybird-reasoning-mode)\n    - [HuggingFace](#huggingface)\n    - [vLLM](#vllm)\n      - [Speculative Decoding](#speculative-decoding)\n        - [1. Download MiniCPM4.1 Draft Model](#1-download-minicpm41-draft-model)\n        - [2. Install EAGLE3-Compatible vLLM](#2-install-eagle3-compatible-vllm)\n        - [3. Launch vLLM Server with Speculative Decoding](#3-launch-vllm-server-with-speculative-decoding)\n        - [4. Client Usage Example](#4-client-usage-example)\n        - [vLLM Configuration Parameters](#vllm-configuration-parameters)\n      - [Standard Inference (Without Speculative Decoding)](#standard-inference-without-speculative-decoding)\n    - [SGLang](#sglang)\n      - [Speculative Decoding](#speculative-decoding-1)\n        - [1. Download MiniCPM4.1 Draft Model](#1-download-minicpm41-draft-model-1)\n        - [2. Install EAGLE3-Compatible SGLang](#2-install-eagle3-compatible-sglang)\n        - [3. Launch SGLang Server with Speculative Decoding](#3-launch-sglang-server-with-speculative-decoding)\n        - [4. Client Usage](#4-client-usage)\n        - [Configuration Parameters](#configuration-parameters)\n      - [Standard Inference (Without Speculative Decoding)](#standard-inference-without-speculative-decoding-1)\n    - [CPM.cu](#cpmcu)\n    - [llama.cpp and Ollama](#llamacpp-and-ollama)\n      - [llama.cpp](#llamacpp)\n      - [Ollama](#ollama)\n  - [BitCPM4: Quantization](#bitcpm4-quantization)\n    - [BitCPM4 Evaluation](#bitcpm4-evaluation)\n    - [BitCPM4 Inference](#bitcpm4-inference)\n  - [MiniCPM4 Application](#minicpm4-application)\n    - [MiniCPM4-Survey: Trustworthy Survey Generation](#minicpm4-survey-trustworthy-survey-generation)\n      - [Demo and Quick Start](#demo-and-quick-start)\n      - [Performance Evaluation](#performance-evaluation)\n    - [MiniCPM4-MCP: Tool Use with Model Context Protocol](#minicpm4-mcp-tool-use-with-model-context-protocol)\n      - [Demo](#demo)\n      - [Performance Evaluation](#performance-evaluation-1)\n    - [MiniCPM Intel AIPC Client: A New Edge Large Model Powerhouse](#minicpm-intel-aipc-client-a-new-edge-large-model-powerhouse)\n      - [Key Features](#key-features)\n      - [System Requirements](#system-requirements)\n      - [Download](#download)\n- [LICENSE](#license)\n    - [Model LICENSE](#model-license)\n    - [Statement](#statement)\n- [Institutions](#institutions)\n- [Citation](#citation)\n\n\n## Model Downloads\n\n  | HuggingFace | ModelScope |\n  |-------------|------------|\n  | [MiniCPM-SALA](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-SALA) | [MiniCPM-SALA](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-SALA) |\n  | [MiniCPM4.1-8B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B) | [MiniCPM4.1-8B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4.1-8B) |\n  | [MiniCPM4.1-8B-GPTQ](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GPTQ) | [MiniCPM4.1-8B-GPTQ](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-GPTQ) | \n  | [MiniCPM4.1-8B-AutoAWQ](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-AutoAWQ) | [MiniCPM4.1-8B-AutoAWQ](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-AutoAWQ) | \n  | [MiniCPM-4.1-8B-Marlin](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-4.1-8B-Marlin) | [MiniCPM-4.1-8B-Marlin](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM-4.1-8B-Marlin) | \n  | [MiniCPM4.1-8B-GGUF](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) | [MiniCPM4.1-8B-GGUF](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) | \n  | [MiniCPM4.1-8B-MLX](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-MLX) | [MiniCPM4.1-8B-MLX](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-MLX) | \n  | [MiniCPM4.1-8B-Eagle3](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3) | [MiniCPM4.1-8B-Eagle3](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3) | \n  | [MiniCPM4-8B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B)    | [MiniCPM4-8B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B) |\n  | [MiniCPM4-0.5B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B) | [MiniCPM4-0.5B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B) |\n  | [BitCPM4-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FBitCPM4-1B)        | [BitCPM4-1B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FBitCPM4-1B) |\n  | [BitCPM4-0.5B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FBitCPM4-0.5B)    | [BitCPM4-0.5B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FBitCPM4-0.5B) |\n  | [MiniCPM4-Survey](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-Survey) | [MiniCPM4-Survey](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-Survey) |\n  | [MiniCPM4-MCP](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-MCP)  | [MiniCPM4-MCP](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-MCP) |\n\n\n\u003Cdetails>\n\u003Csummary>📋 Click to view all MiniCPM series models\u003C\u002Fsummary>\n\n  | HuggingFace | ModelScope |\n  |-------------|------------|\n  | [MiniCPM4-8B-Eagle-FRSpec](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-FRSpec) | [MiniCPM4-8B-Eagle-FRSpec](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-FRSpec) |\n  | [MiniCPM4-8B-Eagle-FRSpec-QAT](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-FRSpec-QAT) | [MiniCPM4-8B-Eagle-FRSpec-QAT](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-FRSpec-QAT) |\n  | [MiniCPM4-8B-Eagle-vLLM](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-vLLM) | [MiniCPM4-8B-Eagle-vLLM](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-vLLM) |\n  | [MiniCPM4-8B-marlin-Eagle-vLLM](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-marlin-Eagle-vLLM) | [MiniCPM4-8B-marlin-Eagle-vLLM](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-marlin-Eagle-vLLM) |\n  | [MiniCPM4-0.5B-QAT-Int4-unquantized](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B-QAT-Int4-unquantized) | [MiniCPM4-0.5B-QAT-Int4-unquantized](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B-QAT-Int4-unquantized) |\n  | [MiniCPM4-0.5B-QAT-Int4-GPTQ-format](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B-QAT-Int4-GPTQ-format) | [MiniCPM4-0.5B-QAT-Int4-GPTQ-format](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B-QAT-Int4-GPTQ-format) |\n  | [MiniCPM3-4B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM3-4B) | [MiniCPM3-4B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM3-4B) |\n  | [MiniCPM-2B-sft](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16) | [MiniCPM-2B-sft](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FminiCPM-bf16)|\n  | [MiniCPM-2B-dpo](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-dpo-bf16) | [MiniCPM-2B-dpo](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-2B-dpo-bf16\u002Fsummary) |\n  | [MiniCPM-2B-128k](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-128k) | [MiniCPM-2B-128k](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenbmb\u002FMiniCPM-2B-128k\u002Fsummary) |\n  | [MiniCPM-MoE-8x2B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-MoE-8x2B) | [MiniCPM-MoE-8x2B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-MoE-8x2B) |\n  | [MiniCPM-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-1B-sft-bf16) | [MiniCPM-1B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-1B-sft-bf16) |\n  | [MiniCPM-S-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-S-1B-sft) | [MiniCPM-S-1B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-S-1B-sft) |\n\u003C\u002Fdetails>\n\n## MiniCPM-SALA\n#### Highlights\n\nMiniCPM-SALA (Sparse Attention and Linear Attention) is the first large-scale hybrid model effectively integrating sparse and linear attention for million-token context modeling\n\n✅ Innovative Hybrid Architecture: Synergizes 25% Sparse Attention (InfLLM-v2) for high-fidelity long context modeling with 75% Linear Attention (Lightning Attention) for global efficiency.\n\n✅ Shattering Efficiency Walls: Breaks the \"Compute Wall\" and the \"Memory Wall,\" achieving 3.5× inference speed and significantly lower KV-cache overhead compared to dense baselines. \n\n✅ Million-Token Context: Empowered by HyPE (Hybrid Positional Embedding), it scales to 1M+ tokens while maintaining strong length generalization. \n\n✅ HALO Adaptation: Utilizes Hybrid Attention via Layer Optimization (HALO), a novel distillation recipe that effectively transfers dense attention capabilities to the hybrid architecture, avoiding the severe performance degradation typical of pure linear models.\n\n#### Introduction\n\nMiniCPM-SALA is an efficient hybrid model in which 25% of the layers adopt [InfLLM-V2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24663) and the remaining 75% utilize Lightning Attention. This architecture enables inference of one million tokens on consumer GPUs such as the NVIDIA RTX 5090.\n\n- **SALA Hybrid Attention Mechanism**\n  - Integrates 25% InfLLM-V2 and 75% Lightning Attention, effectively leveraging the granular focus of sparse attention for local details and the high efficiency of linear attention for broad context.\n\n- **Transformer-to-Hybrid Continue Training**\n  - Circumvents the inefficiencies of cold-start training by performing an architectural transformation on the pre-trained weights, thereby reducing the total training budget to approximately 25% relative to training a comparable model from scratch.\n\n- **[HyPE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.22156) (Hybrid Positional Encoding)**\n  - Harmonizes the performance across both short and long contexts, which can maintain general capabilities (e.g., knowledge, mathematics, and coding) comparable to modern full-attention models like Qwen3-8B and achieve substantial advantages across multiple long-context benchmarks.\n\n- **Efficient Inference on Long Sequences**\n  - Achieves up to 3.5x the inference speed of Qwen3-8B at a sequence length of 256K tokens on A6000D, supports inference at context lengths of up to 1M tokens on both NVIDIA A6000D and 5090 GPUs, whereas Qwen3-8B fails at this length due to out-of-memory (OOM) errors.\n\n### Evaluation Results\n\n#### Efficiency Evaluation\n\nWe benchmarked MiniCPM-SALA (9B) against Qwen3-8B on NVIDIA A6000D and RTX 5090 GPUs to evaluate inference speed and memory efficiency. The results demonstrate a significant performance leap: MiniCPM-SALA not only achieves up to a 2.5x speedup in time-to-first-token (TTFT) but also overcomes the memory bottlenecks of full-attention architectures. While Qwen3-8B suffers from OOM errors at extended lengths, MiniCPM-SALA successfully scales to 1M-token contexts on a single consumer-grade RTX 5090, effectively democratizing ultra-long context inference on edge hardware.\n\n![inference_speed_a6000d](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_a0fe509948d6.png)\n\n![inference_speed_5090](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_f873b9c606f6.png)\n\n#### Long-Context Evaluation\n\nMiniCPM-SALA consistently outperforms other open-source LLMs of similar scale across most involved long-context benchmarks. Specifically, it achieves the highest scores in the RULER and NoLiMa tests at all context lengths (up to 128K) and maintains the highest overall average score of 38.97, suggesting superior performance in handling long-context information processing.\n\n![long_text_evaluation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_e0ff6f36c237.png)\n\n#### Ultra-long Context Evaluation\n\nThe evaluation demonstrates that MiniCPM-SALA exhibits effective length extrapolation capabilities, maintaining a score of 81.6 at a 2048K context length despite being trained on only 520K tokens. The model achieves this without auxiliary techniques like YaRN, likely due to its NoPE configuration in sparse attention layers.\n\n![ultra_long_text_evaluation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_8ea1f8948235.png)\n\n#### Standard Evaluation\n\nMiniCPM-SALA achieves an average score of 76.53 across standard benchmarks, outperforming comparable models such as Qwen3-8B and Falcon-H1R-7B. The architecture maintains robust performance in Knowledge, Code, and Math.\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_6329cd82b749.png)\n\n### Inference\n\nTo achieve optimal performance, we recommend using the `Temperature=0.9`.\n\n#### HuggingFace\n\nOur model is readily compatible with 🤗 Hugging Face transformers. You can perform inference with our model as follows:\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_path = \"openbmb\u002FMiniCPM-SALA\"\ntokenizer = AutoTokenizer.from_pretrained(model_path)\nmodel = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=\"auto\")\nmodel.eval()\n\nprompts = [\"My name is\", \"The capital of China is\"]\nwith torch.no_grad():\n    inputs = tokenizer(prompts, return_tensors=\"pt\").to(model.device)\n    outputs = model.generate(**inputs)\noutput_texts = tokenizer.batch_decode(outputs)\nprint(output_texts)\n```\n\n#### SGLang\n\n##### Requirements\n\n- CUDA 12.x or higher\n- `gcc` \u002F `g++` compiler\n- `uv` package manager (script will check)\n\n##### Installation\n\n```bash\n# Clone repository\ngit clone -b minicpm_sala https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fsglang.git\ncd sglang\n\n# One-click installation (creates venv and compiles all dependencies)\nbash install_minicpm_sala.sh\n\n# Or specify PyPI mirror\nbash install_minicpm_sala.sh https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fpypi\u002Fweb\u002Fsimple\n```\n\nThe installation script performs the following steps:\n\n1. Creates `sglang_minicpm_sala_env` virtual environment (Python 3.12)\n2. Clones dependencies to `3rdparty\u002F` (infllmv2) and initializes submodules (sparse_kernel)\n3. Installs MiniCPM-SALA (current repo)\n4. Compiles and installs `infllmv2_cuda_impl`\n5. Compiles and installs `sparse_kernel`\n6. Installs `tilelang` & `flash-linear-attention`\n\n##### Usage\n\n```bash\n# Activate environment\nsource sglang_minicpm_sala_env\u002Fbin\u002Factivate\n\n# Launch Inference Server (Replace MODEL_PATH with actual path)\nMODEL_PATH=\u002Fpath\u002Fto\u002Fyour\u002FMiniCPM-SALA\n\npython3 -m sglang.launch_server \\\n    --model ${MODEL_PATH} \\\n    --trust-remote-code \\\n    --disable-radix-cache \\\n    --attention-backend minicpm_flashinfer \\\n    --chunked-prefill-size 8192 \\\n    --max-running-requests 32 \\\n    --skip-server-warmup \\\n    --port 31111 \\\n    --dense-as-sparse\n```\n\n| Parameter | Description |\n|-----------|-------------|\n| `--trust-remote-code` | Allow custom code in model |\n| `--disable-radix-cache` | Disable RadixAttention prefix cache |\n| `--attention-backend minicpm_flashinfer` | Use MiniCPM FlashInfer backend |\n| `--chunked-prefill-size 8192` | Chunked prefill size |\n| `--max-running-requests 32` | Max concurrent requests |\n| `--skip-server-warmup` | Skip server warmup |\n| `--port 31111` | Server port |\n| `--dense-as-sparse` | Use dense-as-sparse mode |\n\n##### Manual Installation\n\nIf the script doesn't work for you, follow these steps:\n\n```bash\n# 0. Ensure uv is installed\npip install uv\n\n# 1. Create venv\nuv venv --python 3.12 sglang_minicpm_sala_env\nsource sglang_minicpm_sala_env\u002Fbin\u002Factivate\n\n# 2. Install SGLang\nuv pip install --upgrade pip setuptools wheel\nuv pip install -e .\u002Fpython[all]\n\n# 3. Compile CUDA Extensions\n# (Ensure dependencies are cloned to 3rdparty\u002F)\ncd 3rdparty\u002Finfllmv2_cuda_impl && python setup.py install && cd ..\u002F..\ncd 3rdparty\u002Fsparse_kernel && python setup.py install && cd ..\u002F..\n\n# 4. Install extra deps\nuv pip install tilelang flash-linear-attention\n```\n\n##### Q&A\n\n**Q: CUDA extension compilation failed?**\n\n- Ensure CUDA 12+ is installed (`nvcc --version`).\n- Ensure `gcc` \u002F `g++` are available.\n- If `CXX` is set to `clang++ -pthread`, manually `export CXX=g++`.\n\n## MiniCPM4 and MiniCPM4.1 Series\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VouXjLHKDUY\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_6ae0f6f9cd42.jpg\", width=70%>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n#### Highlights\nMiniCPM 4.1-8B is the first open-source reasoning LLM with trainable sparse attention:\n\n✅ Strong Reasoning Capability: Surpasses similar-sized models on 15 tasks!\n\n✅ Fast Generation: 3x decoding speedup for reasoning\n\n✅ Efficient Architecture: Trainable sparse attention, frequency-ranked speculative decoding\n\n#### Introduction\nMiniCPM4 and MiniCPM4.1 series are highly efficient large language models (LLMs) designed explicitly for end-side devices, which achieves this efficiency through systematic innovation in four key dimensions: model architecture, training data, training algorithms, and inference systems.\n\n- 🏗️ **Efficient Model Architecture:**\n  - InfLLM-V2 -- Trainable Sparse Attention Mechanism: Adopts a trainable sparse attention mechanism architecture where each token only needs to compute relevance with less than 5% of tokens in 128K long text processing, significantly reducing computational overhead for long texts ([InfLLM-V2 Training Kernels](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl))\n\n- 🧠 **Efficient Learning Algorithms:**\n  - Model Wind Tunnel 2.0 -- Efficient Predictable Scaling: Introduces scaling prediction methods for performance of downstream tasks, enabling more precise model training configuration search\n  - BitCPM -- Ultimate Ternary Quantization: Compresses model parameter bit-width to 3 values, achieving 90% extreme model bit-width reduction\n  - Efficient Training Engineering Optimization: Adopts FP8 low-precision computing technology combined with Multi-token Prediction training strategy\n\n- 📚 **High-Quality Training Data:**\n  - UltraClean -- High-quality Pre-training Data Filtering and Generation: Builds iterative data cleaning strategies based on efficient data verification, open-sourcing high-quality Chinese and English pre-training dataset [UltraFinweb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenbmb\u002FUltra-FineWeb)\n  - UltraChat v2 -- High-quality Supervised Fine-tuning Data Generation: Constructs large-scale high-quality supervised fine-tuning datasets covering multiple dimensions including knowledge-intensive data, reasoning-intensive data, instruction-following data, long text understanding data, and tool calling data\n\n- ⚡ **Efficient Inference and Deployment System:**\n  - CPM.cu -- Lightweight and Efficient CUDA Inference Framework: Integrates sparse attention, model quantization, and speculative sampling to achieve efficient prefilling and decoding ([Inference Kernels and Framework](https:\u002F\u002Fgithub.com\u002Fopenbmb\u002Fcpm.cu))\n  - ArkInfer -- Cross-platform Deployment System: Supports efficient deployment across multiple backend environments, providing flexible cross-platform adaptation capabilities\n\n### Evaluation Results\n\n#### Efficiency Evaluation\nOn two typical end-side chips, Jetson AGX Orin and RTX 4090, MiniCPM4 and MiniCPM4.1 demonstrate significantly faster processing speed compared to similar-size models in long text processing tasks. As text length increases, MiniCPM4 and MiniCPM4.1's efficiency advantage becomes more pronounced. On the Jetson AGX Orin platform, compared to Qwen3-8B, MiniCPM4 and MiniCPM4.1 achieves approximately 7x decoding speed improvement.\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_f81787316507.png)\n\nMiniCPM4.1 achieves 3x decoding speed improvement in reasoning.\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_2f353ff93814.png)\n\n#### Comprehensive Evaluation\nMiniCPM4 launches end-side versions with 8B and 0.5B parameter scales, both achieving best-in-class performance in their respective categories.\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_2d05de343736.png)\n\nMiniCPM4.1 launches end-side versions with 8B parameter scale, achieving best-in-class performance in deep reasoning mode.\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_200804218fab.png)\n\n#### Long Text Evaluation\nMiniCPM4 is pre-trained on 32K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4 demonstrates outstanding performance. MiniCPM4.1 is pre-trained on 64K long texts and achieves length extension through YaRN technology. In the 128K long text needle-in-a-haystack task, MiniCPM4.1 demonstrates outstanding performance.\n\n![long-niah](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_12a1fce44bc4.png)\n\n### Inference\nMiniCPM 4.1 can be used with following frameworks: Huggingface Transformers, SGLang, vLLM, and CPM.cu. For the ultimate inference speed, we highly recommend CPM.cu.\n\nMiniCPM4\u002FMiniCPM4.1 supports both dense attention inference and sparse attention inference modes, where vLLM and SGLang currently only support dense inference mode. If you want to use sparse inference mode, please use Huggingface Transformers and CPM.cu.\n\n- Dense attention inference: vLLM, SGLang, Huggingface Transformers\n- Sparse attention inference: Huggingface Transformers, CPM.cu\n\n#### Hybird Reasoning Mode\n\nMiniCPM4.1 supports hybrid reasoning mode, which can be used in both deep reasoning mode and non-reasoning mode. To enable hybrid reasoning mode. User can set `enable_thinking=True` in `tokenizer.apply_chat_template` to enable hybrid reasoning mode, and set `enable_thinking=False` to enable non-reasoning mode. Similarly, user can directly add `\u002Fno_think` at the end of the query to enable non-reasoning mode. If not add any special token or add `\u002Fthink` at the end of the query, the model will enable reasoning mode.\n\n```python\n# Enable reasoning mode\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n    enable_thinking=True\n)\n# Enable non-reasoning mode\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n    enable_thinking=False\n)\n```\n\n#### HuggingFace\n\n- **Inference with Dense Attention**\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\ntorch.manual_seed(0)\n\npath = 'openbmb\u002FMiniCPM4.1-8B'\ndevice = \"cuda\"\ntokenizer = AutoTokenizer.from_pretrained(path)\nmodel = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)\n\n# User can directly use the chat interface\n# responds, history = model.chat(tokenizer, \"Write an article about Artificial Intelligence.\", temperature=0.7, top_p=0.7)\n# print(responds)\n\n# User can also use the generate interface\nmessages = [\n    {\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"},\n]\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([prompt_text], return_tensors=\"pt\").to(device)\n\nmodel_outputs = model.generate(\n    **model_inputs,\n    max_new_tokens=32768,\n    top_p=0.95,\n    temperature=0.6\n)\noutput_token_ids = [\n    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))\n]\n\nresponses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]\nprint(responses)\n```\n\n- **Inference with Sparse Attention**\nThis model supports InfLLM v2, a sparse attention mechanism designed for efficient long-sequence inference. It requires the [infllmv2_cuda_impl](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl) library.\n\nYou can install it by running the following command:\n\n```bash\ngit clone -b feature_infer https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl.git\ncd infllmv2_cuda_impl\ngit submodule update --init --recursive\npip install -e . # or python setup.py install \n```\n\nTo enable InfLLM v2, you need to add the `sparse_config` field in `config.json`:\n\n```json\n{\n    ...,\n    \"sparse_config\": {\n        \"kernel_size\": 32,\n        \"kernel_stride\": 16,\n        \"init_blocks\": 1,\n        \"block_size\": 64,\n        \"window_size\": 2048,\n        \"topk\": 64,\n        \"use_nope\": false,\n        \"dense_len\": 8192\n    }\n}\n```\n\nThese parameters control the behavior of InfLLM v2:\n\n* `kernel_size` (default: 32): The size of semantic kernels.\n* `kernel_stride` (default: 16): The stride between adjacent kernels.\n* `init_blocks` (default: 1): The number of initial blocks that every query token attends to. This ensures attention to the beginning of the sequence.\n* `block_size` (default: 64): The block size for key-value blocks.\n* `window_size` (default: 2048): The size of the local sliding window. \n* `topk` (default: 64): The specifies that each token computes attention with only the top-k most relevant key-value blocks.\n* `use_nope` (default: false): Whether to use the NOPE technique in block selection for improved performance.\n* `dense_len` (default: 8192): Since Sparse Attention offers limited benefits for short sequences, the model can use standard (dense) attention for shorter texts. The model will use dense attention for sequences with a token length below `dense_len` and switch to sparse attention for sequences exceeding this length. Set this to `-1` to always use sparse attention regardless of sequence length.\n\n- **Long Context Extension**\nMiniCPM4.1 natively supports context lengths of up to 65,536(64k) tokens. For conversations where the total length (including both input and output) significantly exceeds this limit, we recommend using RoPE scaling techniques for effective handling of long texts. We have validated the model's performance on context lengths of up to 131,072 tokens by modifying the LongRoPE factor.\n\nYou can apply the LongRoPE factor modification by modifying the model files. Specifically, in the `config.json` file, adjust the `rope_scaling` fields.\n\n```json\n{\n    ...,\n    \"rope_scaling\": {\n        \"rope_type\": \"longrope\", \n        \"long_factor\": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],\n        \"short_factor\": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],\n        \"original_max_position_embeddings\": 65536\n    }\n}\n```\n\n#### vLLM\n\n##### Speculative Decoding\n\nFor accelerated inference with speculative decoding using vLLM, follow these steps:\n\n###### 1. Download MiniCPM4.1 Draft Model\n\nFirst, download the MiniCPM4.1 draft model:\n\n```bash\ncd \u002Fyour_path\ngit clone https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3\n```\n\n###### 2. Install EAGLE3-Compatible vLLM\n\nThe EAGLE3 vLLM PR has been submitted. For now, use our repository for installation:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLDLINGLINGLING\u002Fvllm.git\ncd vllm \npip install -e .\n```\n\n###### 3. Launch vLLM Server with Speculative Decoding\n\nStart the vLLM inference server with speculative decoding enabled. Make sure to update the model path in the speculative-config to point to your downloaded MiniCPM4_1-8B-Eagle3-bf16 folder:\n\n```bash\nVLLM_USE_V1=1 \\\nvllm serve openbmb\u002FMiniCPM4.1-8B \\\n--seed 42 \\\n--trust-remote-code \\\n--speculative-config '{\n  \"model\": \"your\u002Fpath\u002FMiniCPM4_1-8B-Eagle3-bf16\",\n  \"num_speculative_tokens\": 3,\n  \"method\": \"eagle3\",\n  \"draft_tensor_parallel_size\": 1\n}'\n```\n\n###### 4. Client Usage Example\n\nThe client usage remains the same for both standard and speculative decoding:\n\n```python\nimport openai\n\nclient = openai.Client(base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\", api_key=\"EMPTY\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template\n    \n)\n\nprint(response.choices[0].message.content)\n```\n\n###### vLLM Configuration Parameters\n\n- `VLLM_USE_V1=1`: Enables vLLM v1 API\n- `--speculative-config`: JSON configuration for speculative decoding\n  - `model`: Path to the draft model for speculation\n  - `num_speculative_tokens`: Number of speculative tokens (default: 3)\n  - `method`: Speculative decoding method (eagle3)\n  - `draft_tensor_parallel_size`: Tensor parallel size for draft model (default: 1)\n- `--seed`: Random seed for reproducibility\n- `--trust-remote-code`: Allow execution of remote code for custom models\n\n##### Standard Inference (Without Speculative Decoding)\n\nFor now, you need to install the latest version of vLLM.\n\n```bash\npip install -U vllm \\\n    --pre \\\n    --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly\n```\n\nThen you can inference MiniCPM4.1-8B with vLLM:\n```python\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\n\nmodel_name = \"openbmb\u002FMiniCPM4.1-8B\"\nprompt = [{\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"}]\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninput_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)\n\nllm = LLM(\n    model=model_name,\n    trust_remote_code=True,\n    max_num_batched_tokens=65536,\n    dtype=\"bfloat16\", \n    gpu_memory_utilization=0.8, \n)\nsampling_params = SamplingParams(top_p=0.95, temperature=0.6, max_tokens=32768)\n\noutputs = llm.generate(prompts=input_text, sampling_params=sampling_params)\n\nprint(outputs[0].outputs[0].text)\n```\n\nAlso, you can start the inference server by running the following command:\n> **Note**: In vLLM's chat API, `add_special_tokens` is `False` by default. This means important special tokens—such as the beginning-of-sequence (BOS) token—will not be added automatically. To ensure the input prompt is correctly formatted for the model, you should explicitly set `extra_body={\"add_special_tokens\": True}`.\n\n```bash\nvllm serve openbmb\u002FMiniCPM4.1-8B --trust-remote-code\n```\n\nThen you can use the chat interface by running the following code:\n\n```python\nimport openai\n\nclient = openai.Client(base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\", api_key=\"EMPTY\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n    extra_body=dict(add_special_tokens=True),  # Ensures special tokens are added for chat template\n)\n\nprint(response.choices[0].message.content)\n```\n\n#### SGLang\n\n##### Speculative Decoding\n\nFor accelerated inference with speculative decoding, follow these steps:\n\n###### 1. Download MiniCPM4.1 Draft Model\n\nFirst, download the MiniCPM4.1 draft model:\n\n```bash\ncd \u002Fyour_path\ngit clone https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3\n```\n\n###### 2. Install EAGLE3-Compatible SGLang\n\nThe EAGLE3 adaptation PR has been submitted. For now, use our repository for installation:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLDLINGLINGLING\u002Fsglang.git\ncd sglang\npip install -e .\n```\n\n###### 3. Launch SGLang Server with Speculative Decoding\n\nStart the SGLang server with speculative decoding enabled:\n\n```bash\npython -m sglang.launch_server \\\n  --model-path \"openbmb\u002FMiniCPM4.1-8B\" \\\n  --host \"127.0.0.1\" \\\n  --port 30002 \\\n  --mem-fraction-static 0.9 \\\n  --speculative-algorithm EAGLE3 \\\n  --speculative-draft-model-path \"your\u002Fpath\u002FMiniCPM4_1-8B-Eagle3-bf16\" \\\n  --speculative-num-steps 3 \\\n  --speculative-eagle-topk 1 \\\n  --speculative-num-draft-tokens 32 \\\n  --temperature 0.7\n```\n\n###### 4. Client Usage\n\nThe client usage remains the same for both standard and speculative decoding:\n\n```python\nimport openai\n\nclient = openai.Client(base_url=f\"http:\u002F\u002Flocalhost:30002\u002Fv1\", api_key=\"None\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n)\n\nprint(response.choices[0].message.content)\n```\n\nNote: Make sure to update the port number in the client code to match the server port (30002 in the speculative decoding example).\n\n###### Configuration Parameters\n\n- `--speculative-algorithm EAGLE3`: Enables EAGLE3 speculative decoding\n- `--speculative-draft-model-path`: Path to the draft model for speculation\n- `--speculative-num-steps`: Number of speculative steps (default: 3)\n- `--speculative-eagle-topk`: Top-k parameter for EAGLE (default: 1)\n- `--speculative-num-draft-tokens`: Number of draft tokens (default: 32)\n- `--mem-fraction-static`: Memory fraction for static allocation (default: 0.9)\n\n##### Standard Inference (Without Speculative Decoding)\n\nFor now, you need to install our forked version of SGLang.\n\n```bash\ngit clone -b openbmb https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fsglang.git\ncd sglang\n\npip install --upgrade pip\npip install -e \"python[all]\"\n```\n\nYou can start the inference server by running the following command:\n\n```bash\npython -m sglang.launch_server --model openbmb\u002FMiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml\n```\n\nThen you can use the chat interface by running the following command:\n\n```python\nimport openai\n\nclient = openai.Client(base_url=f\"http:\u002F\u002Flocalhost:30000\u002Fv1\", api_key=\"None\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n)\n\nprint(response.choices[0].message.content)\n```\n\n\n#### CPM.cu\n\nWe **recommend** using [CPM.cu](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu) for the inference of MiniCPM4 and MiniCPM4.1. CPM.cu is a CUDA inference framework developed by OpenBMB, which integrates efficient sparse, speculative sampling, and quantization techniques, fully leveraging the efficiency advantages of MiniCPM4 and MiniCPM4.1.\n\nYou can install CPM.cu by running the following command:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu.git --recursive\ncd CPM.cu\npython3 setup.py install\n```\n\nYou can run the following command to test the speed of the model.\n\n```bash\npython3 tests\u002Flong_prompt_gen.py # generate prompt.txt\npython3 tests\u002Ftest_generate.py --prompt-file prompt.txt\n```\n\nYou can run the following command to infer with EAGLE3 speculative decoding algorithm.\n\n```bash\npython3 -m cpmcu.cli \\\n    --model-path $BASE_MODEL_PATH \\\n    --draft-model-path $EAGLE3_DRAFT_MODEL_PATH \\\n    --prompt-text \"Tell me about Tsinghua University\" \\\n    --use-eagle3 true\n```\n\nFor more details about CPM.cu, please refer to the repo of [CPM.cu](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu).\n\n\n#### llama.cpp and Ollama\n\nWe also support inference with [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp) and [Ollama](https:\u002F\u002Follama.com\u002F).\n\n##### llama.cpp\n\nYou can download the GGUF format of MiniCPM4.1-8B model from [huggingface](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) and run it with llama.cpp for efficient CPU or GPU inference.\n```\n# case 1: main-cli\n.\u002Fbuild\u002Fbin\u002Fllama-cli -m MiniCPM4.1-8B-Q4_K_M.gguf -p \"Write an article about Artificial Intelligence.\" -n 1500\n\n# case 2: server\n## launch server\n.\u002Fbuild\u002Fbin\u002Fllama-server -m MiniCPM4.1-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -c 4096 -fa on &\n\n## send request\ncurl -X POST http:\u002F\u002F127.0.0.1:8080\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"gpt-3.5-turbo\",\n    \"messages\": [{\"role\": \"user\", \"content\": \"Write an article about Artificial Intelligence.\"}],\n    \"max_tokens\": 1500\n  }'\n```\n\n##### Ollama\nPlease refer to [model hub](https:\u002F\u002Follama.com\u002Fopenbmb\u002Fminicpm4.1) for model download. After installing ollama package, you can use MiniCPM4.1 with following commands:\n```\nollama run openbmb\u002Fminicpm4.1\n```\n\n### BitCPM4: Quantization\n\nBitCPM4 are ternary quantized models derived from the MiniCPM series models through quantization-aware training (QAT), achieving significant improvements in both training efficiency and model parameter efficiency.\n- Improvements of the training method\n  - Searching hyperparameters with a wind-tunnel on a small model.\n  - Using a two-stage training method: training in high-precision first and then QAT, making the best of the trained high-precision models and significantly reducing the computational resources required for the QAT phase.\n- High parameter efficiency\n  - Achieving comparable performance to full-precision models of similar parameter models with a bit width of only 1.58 bits, demonstrating high parameter efficiency. \n\n#### BitCPM4 Evaluation\n\nBitCPM4's performance is comparable with other full-precision models in same model size.\n![bitcpm-benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_18dd13583f1a.png)\n\n#### BitCPM4 Inference\n\nBitCPM4's parameters are stored in a fake-quantized format, which supports direct inference within the Huggingface framework.\n\n### MiniCPM4 Application\n\u003Cdetails>\n\u003Csummary>Click to view details about MiniCPM4 Application\u003C\u002Fsummary>\n\n#### MiniCPM4-Survey: Trustworthy Survey Generation\n\n**MiniCPM4-Survey** is an open-source LLM agent model jointly developed by [THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn), Renmin University of China and [ModelBest](https:\u002F\u002Fmodelbest.cn\u002Fen). Built on MiniCPM4-8B, it accepts users' quiries as input and autonomously generate trustworthy, long-form survey papers.\n\nKey features include:\n\n- **Plan-Retrieve-Write Survey Generation Framework** — We propose a multi-agent generation framework, which operates through three core stages: planning (defining the overall structure of the survey), retrieval (generating appropriate retrieval keywords), and writing (synthesizing the retrieved information to generate coherent section-level content).\n\n- **High-Quality Dataset Construction** — We gather and process lots of expert-written survey papers to construct a high-quality training dataset. Meanwhile, we collect a large number of research papers to build a retrieval database.\n\n- **Multi-Aspect Reward Design** — We carefully design a reward system with three aspects (structure, content, and citations) to evaluate the quality of the surveys, which is used as the reward function in the RL training stage.\n\n- **Multi-Step RL Training Strategy** — We propose a *Context Manager* to ensure retention of essential information while facilitating efficient reasoning, and we construct *Parallel Environment* to maintain efficient RL training cycles.\n\n##### Demo and Quick Start\n\nSee [here](.\u002Fdemo\u002Fminicpm4\u002FSurveyGeneration\u002FREADME.md)\n\n##### Performance Evaluation\n\n| Method                                      | Relevance | Coverage | Depth | Novelty | Avg.  | Fact Score |\n|---------------------------------------------|-----------|----------|-------|---------|-------|------------|\n| Naive RAG (driven by G2FT)                  | 3.25      | 2.95     | 3.35  | 2.60    | 3.04  | 43.68      |\n| AutoSurvey (driven by G2FT)                 | 3.10      | 3.25     | 3.15  | **3.15**| 3.16  | 46.56      |\n| Webthinker (driven by WTR1-7B)              | 3.30      | 3.00     | 2.75  | 2.50    | 2.89  | --         |\n| Webthinker (driven by QwQ-32B)              | 3.40      | 3.30     | 3.30  | 2.50    | 3.13  | --         |\n| OpenAI Deep Research (driven by GPT-4o)     | 3.50      |**3.95**  | 3.55  | 3.00    | **3.50**  | --         |\n| MiniCPM-4-Survey                            | 3.45      | 3.70     | **3.85** | 3.00    | **3.50**  | **68.73**  |\n| &nbsp;&nbsp;&nbsp;*w\u002Fo* RL                  | **3.55**  | 3.35     | 3.30  | 2.25    | 3.11  | 50.24      |\n\n*Performance comparison of the survey generation systems. \"G2FT\" stands for Gemini-2.0-Flash-Thinking, and \"WTR1-7B\" denotes Webthinker-R1-7B. FactScore evaluation was omitted for Webthinker, as it does not include citation functionality, and for OpenAI Deep Research, which does not provide citations when exporting the results.*\n\n#### MiniCPM4-MCP: Tool Use with Model Context Protocol\n\n**MiniCPM4-MCP** is an open-source on-device LLM agent model jointly developed by [THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn), Renmin University of China and [ModelBest](https:\u002F\u002Fmodelbest.cn\u002Fen), built on [MiniCPM-4](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B) with 8 billion parameters. It is capable of solving a wide range of real-world tasks by interacting with various tool and data resources through MCP. As of now, MiniCPM4-MCP supports the following:\n\n- Utilization of tools across 16 MCP servers: These servers span various categories, including office, lifestyle, communication, information, and work management.\n\n- Single-tool-calling capability: It can perform single- or multi-step tool calls using a single tool that complies with the MCP.\n\n- Cross-tool-calling capability: It can perform single- or multi-step tool calls using different tools that complies with the MCP.\n\n##### Demo\n\nDemo is available in this [link](.\u002Fdemo\u002Fminicpm4\u002FMCP\u002FREADME_en.md).\n\n##### Performance Evaluation\n\n| MCP Server                  |          | gpt-4o             |              |          | qwen3             |              |      |      minicpm4         |              |\n|-----------------------|----------------|--------------|--------------|---------------|--------------|--------------|----------------|--------------|--------------|\n|                       | func           | param        | value        | func          | param        | value        | func           | param        | value        |\n| Airbnb                | 89.3           | 67.9         | 53.6         | 92.8          | 60.7         | 50.0         | 96.4           | 67.9         | 50.0         |\n| Amap-Maps             | 79.8           | 77.5         | 50.0         | 74.4          | 72.0         | 41.0         | 89.3           | 85.7         | 39.9         |\n| Arxiv-MCP-Server      | 85.7           | 85.7         | 85.7         | 81.8          | 54.5         | 50.0         | 57.1           | 57.1         | 52.4         |\n| Calculator            | 100.0          | 100.0        | 20.0         | 80.0          | 80.0         | 13.3         | 100.0          | 100.0        | 6.67         |\n| Computor-Control-MCP  | 90.0           | 90.0         | 90.0         | 90.0          | 90.0         | 90.0         | 90.0           | 90.0         | 86.7         |\n| Desktop-Commander     | 100.0          | 100.0        | 100.0        | 100.0         | 100.0        | 100.0        | 100.0          | 100.0        | 100.0        |\n| Filesystem            | 63.5           | 63.5         | 31.3         | 69.7          | 69.7         | 26.0         | 83.3           | 83.3         | 42.7         |\n|Github | 92.0 | 80.0 | 58.0 | 80.5 | 50.0 | 27.7 | 62.8 | 25.7 | 17.1 |\n| Gaode                 | 71.1           | 55.6         | 17.8         | 68.8          | 46.6         | 24.4         | 68.9           | 46.7         | 15.6         |\n| MCP-Code-Executor     | 85.0           | 80.0         | 70.0         | 80.0          | 80.0         | 70.0         | 90.0           | 90.0         | 65.0         |\n| MCP-Docx              | 95.8           | 86.7         | 67.1         | 94.9          | 81.6         | 60.1         | 95.1           | 86.6         | 76.1         |\n| PPT                   | 72.6           | 49.8         | 40.9         | 85.9          | 50.7         | 37.5         | 91.2           | 72.1         | 56.7         |\n| PPTx                  | 64.2           | 53.7         | 13.4         | 91.0          | 68.6         | 20.9         | 91.0           | 58.2         | 26.9         |\n| Simple-Time-Server    | 90.0           | 70.0         | 70.0         | 90.0          | 90.0         | 90.0         | 90.0           | 60.0         | 60.0         |\n| Slack                 | 100.0          | 90.0         | 70.0         | 100.0         | 100.0        | 65.0         | 100.0          | 100.0        | 100.0        |\n| Whisper               | 90.0           | 90.0         | 90.0         | 90.0          | 90.0         | 90.0         | 90.0           | 90.0         | 30.0         |\n| **Average**              | **80.2**       | **70.2**     | **49.1**     | **83.5**      | **67.7**     | **43.8**     | **88.3**       | **76.1**     | **51.2**     |\n\n#### MiniCPM Intel AIPC Client: A New Edge Large Model Powerhouse  \n\nDeveloped in collaboration between Mianbi Intelligence and Intel, the MiniCPM Intel AIPC Client is an edge large model client specially designed for devices equipped with Intel Core Ultra series processors. It delivers a low-latency, high-efficiency, and privacy-preserving local large model experience for developers, researchers, and AI enthusiasts. Its core features include:  \n\n##### Key Features  \n- Deep Intel Hardware Adaptation  \nFully compatible with Intel Core Ultra series processors, enabling deep integration with hardware to unleash peak performance. Users can run large models smoothly on local devices without relying on cloud services.  \n\n- Extreme Optimization Based on OpenVINO  \nDeeply optimized with the OpenVINO inference framework, it significantly boosts inference efficiency, reaching up to **80 tokens per second**. This ensures rapid model response for both quick queries and complex task processing.  \n\n- Privacy and Security Assurance  \nAdopting local deployment, all data processing is completed on the device, eliminating privacy risks from cloud uploads. This provides users with peace of mind, especially for scenarios with high data privacy requirements.  \n\n- Catering to Diverse User Groups  \nWhether for developers chasing cutting-edge technologies, researchers focused on academic studies, or enthusiasts eager to explore AI applications, the MiniCPM Intel AIPC Client enables easy access to the power of local large models, opening the door to personalized AI exploration.  \n\n##### System Requirements  \n- Recommended processor: Intel Core Ultra 7 or higher (mobile version)  \n- Recommended RAM: 32GB or above\n\n##### Download\n\n[download](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Freleases\u002Ftag\u002F2.4.2)\n\n\u003C\u002Fdetails>\n\n\n## LICENSE\n\n#### Model LICENSE\n\n* This repository and MiniCPM models are released under the [Apache-2.0](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FLICENSE) License. \n\n#### Statement\n\n* As a language model, MiniCPM generates content by learning from a vast amount of text. \n* However, it does not possess the ability to comprehend or express personal opinions or value judgments. \n* Any content generated by MiniCPM does not represent the viewpoints or positions of the model developers. \n* Therefore, when using content generated by MiniCPM, users should take full responsibility for evaluating and verifying it on their own.\n\n## Institutions\n\nThis project is developed by the following institutions:\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_8eaab2faaba2.png\" width=\"28px\"> [Modelbest Inc.](https:\u002F\u002Fmodelbest.cn\u002F)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_13da7b2eabfc.png\" width=\"28px\"> [THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn\u002F)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_d9d8c88f8a4e.png\" width=\"28px\"> [Gaoling School of Artificial Intelligence of RUC](https:\u002F\u002Flinyankai.github.io\u002F)\n\n## Citation\n\n* Please cite our paper: [MiniCPM4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07900) if you find our work valuable.\n\n```\n@article{minicpm4,\n  title={Minicpm4: Ultra-efficient llms on end devices},\n  author={MiniCPM, Team},\n  journal={arXiv preprint arXiv:2506.07900},\n  year={2025}\n}\n```\n","\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_ba52a6d2c048.png\" width=\"500em\" >\u003C\u002Fimg> \n\u003C\u002Fdiv>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FREADME-cn.md\">中文\u003C\u002Fa> | \u003Cb>English\u003C\u002Fb>\n    \u003Cp>\n\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F2506.07900\" target=\"_blank\">MiniCPM论文\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fmodelbest.feishu.cn\u002Fwiki\u002FD2tFw8Pcsi5CIzkaHNacLK64npg\" target=\"_blank\">MiniCPM维基（中文）\u003C\u002Fa> |\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM-V\u002F\" target=\"_blank\">MiniCPM-V仓库\u003C\u002Fa> |\n加入我们的\u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002F3cGQn9b3YM\" target=\"_blank\">Discord\u003C\u002Fa>和\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002Fassets\u002Fwechat.jpg\" target=\"_blank\">微信\u003C\u002Fa>群组 |\n\u003Ca href=\"https:\u002F\u002Fmp.weixin.qq.com\u002Fs\u002FKIhH2nCURBXuFXAtYRpuXg?poc_token=HBIsUWijxino8oJ5s6HcjcfXFRi0Xj2LJlxPYD9c\">加入我们\u003C\u002Fa>\n\u003C\u002Fp>\n\n> [!NOTE]\n> ### 🏆 2026稀疏算子加速与竞赛（SOAR）现已启动！\n>\n> **MiniCPM-SALA架构仅仅是个开始。要充分发挥其潜力，还需要系统层面的深度协同以及跨层编译优化。**\n>\n> OpenBMB联合**SGLang**和**NVIDIA**，诚邀全球极客挑战在专用**NVIDIA 6000D**环境下实现**9B规模、1M token推理**的极限。\n>\n> * 💰 **奖金池：** 超过10万美元（最高奖金：**89,000美元**）\n> * 🚀 **目标：** 通过跨层编译优化单批次及多批次性能。\n>\n> 👉 **[了解更多并报名](https:\u002F\u002Fsoar.openbmb.cn\u002F)**\n\n## 变更日志🔥\n- [2026.02.11] **[MiniCPM-SALA](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-SALA)** 发布！这是首个有效整合稀疏注意力与线性注意力的大规模混合模型，可支持百万token上下文建模。🔥🔥🔥\n- [2025.09.29] **[InfLLM-V2论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24663)发布！** 我们仅需5B长文本token即可训练出稀疏注意力模型。🔥🔥🔥\n- [2025.09.05] **[MiniCPM4.1系列](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fminicpm-4-6841ab29d180257e940baa9b)** 发布！该系列是具备可训练稀疏注意力的混合推理模型，既可用于深度推理模式，也可用于非推理模式。🔥🔥🔥\n- [2025.06.06] 发布了[**MiniCPM4**](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fopenbmb\u002Fminicpm-4-6841ab29d180257e940baa9b)! 该模型在保持同等规模下最佳性能的同时，实现了极致的效率提升！在典型终端芯片上，生成速度可提升5倍以上！\n- [2024.09.05] 我们发布了[**MiniCPM3-4B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM3-4B)! 该模型性能超越Phi-3.5-mini-instruct和GPT-3.5-Turbo-0125，并可与Llama3.1-8B-Instruct、Qwen2-7B-Instruct以及GLM-4-9B-Chat等7B至9B参数量级的模型相媲良。\n- [2024.07.05] 发布了[**MiniCPM-S-1B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-S-1B-sft)! 该模型在FFN层平均稀疏度达到87.89%，使FFN FLOPs减少84%，同时保持下游任务性能。\n- [2024.04.11] 发布了[**MiniCPM-2B-128k**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-128k)、[**MiniCPM-MoE-8x2B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-MoE-8x2B)以及[**MiniCPM-1B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-1B-sft-bf16)! 点击[这里](https:\u002F\u002Fopenbmb.vercel.app\u002F)阅读我们的技术博客。\n- [2024.02.01] 发布了[**MiniCPM-2B**](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16)! 该模型在公开基准测试中表现与Mistral-7B相当（尤其在中文、数学和代码能力方面更胜一筹），整体性能优于Llama2-13B、MPT-30B和Falcon-40B等模型。\n\n## 快速链接\n\n- [变更日志🔥](#changelog)\n- [快速链接](#quick-links)\n- [模型下载](#model-downloads)\n- [MiniCPM-SALA](#minicpm-sala)\n- [MiniCPM4及MiniCPM4.1系列](#minicpm4-and-minicpm41-series)\n    - [亮点](#highlights)\n    - [简介](#introduction)\n  - [评估结果](#evaluation-results)\n    - [效率评估](#efficiency-evaluation)\n    - [综合评估](#comprehensive-evaluation)\n    - [长文本评估](#long-text-evaluation)\n  - [推理](#inference)\n    - [混合推理模式](#hybird-reasoning-mode)\n    - [HuggingFace](#huggingface)\n    - [vLLM](#vllm)\n      - [推测解码](#speculative-decoding)\n        - [1. 下载MiniCPM4.1草稿模型](#1-download-minicpm41-draft-model)\n        - [2. 安装兼容EAGLE3的vLLM](#2-install-eagle3-compatible-vllm)\n        - [3. 启动带有推测解码的vLLM服务器](#3-launch-vllm-server-with-speculative-decoding)\n        - [4. 客户端使用示例](#4-client-usage-example)\n        - [vLLM配置参数](#vllm-configuration-parameters)\n      - [标准推理（无推测解码）](#standard-inference-without-speculative-decoding)\n    - [SGLang](#sglang)\n      - [推测解码](#speculative-decoding-1)\n        - [1. 下载MiniCPM4.1草稿模型](#1-download-minicpm41-draft-model-1)\n        - [2. 安装兼容EAGLE3的SGLang](#2-install-eagle3-compatible-sglang)\n        - [3. 启动带有推测解码的SGLang服务器](#3-launch-sglang-server-with-speculative-decoding)\n        - [4. 客户端使用](#4-client-usage)\n        - [配置参数](#configuration-parameters)\n      - [标准推理（无推测解码）](#standard-inference-without-speculative-decoding-1)\n    - [CPM.cu](#cpmcu)\n    - [llama.cpp和Ollama](#llamacpp-and-ollama)\n      - [llama.cpp](#llamacpp)\n      - [Ollama](#ollama)\n  - [BitCPM4：量化](#bitcpm4-quantization)\n    - [BitCPM4评估](#bitcpm4-evaluation)\n    - [BitCPM4推理](#bitcpm4-inference)\n  - [MiniCPM4应用](#minicpm4-application)\n    - [MiniCPM4-Survey：可信调查生成](#minicpm4-survey-trustworthy-survey-generation)\n      - [演示与快速入门](#demo-and-quick-start)\n      - [性能评估](#performance-evaluation)\n    - [MiniCPM4-MCP：基于模型上下文协议的工具使用](#minicpm4-mcp-tool-use-with-model-context-protocol)\n      - [演示](#demo)\n      - [性能评估](#performance-evaluation-1)\n    - [MiniCPM英特尔AIPC客户端：全新的边缘大模型利器](#minicpm-intel-aipc-client-a-new-edge-large-model-powerhouse)\n      - [核心特性](#key-features)\n      - [系统要求](#system-requirements)\n      - [下载](#download)\n- [LICENSE](#license)\n    - [模型LICENSE](#model-license)\n    - [声明](#statement)\n- [机构](#institutions)\n- [引用](#citation)\n\n## 模型下载\n\n  | HuggingFace | ModelScope |\n  |-------------|------------|\n  | [MiniCPM-SALA](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-SALA) | [MiniCPM-SALA](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-SALA) |\n  | [MiniCPM4.1-8B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B) | [MiniCPM4.1-8B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4.1-8B) |\n  | [MiniCPM4.1-8B-GPTQ](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GPTQ) | [MiniCPM4.1-8B-GPTQ](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-GPTQ) | \n  | [MiniCPM4.1-8B-AutoAWQ](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-AutoAWQ) | [MiniCPM4.1-8B-AutoAWQ](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-AutoAWQ) | \n  | [MiniCPM4.1-8B-Marlin](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Marlin) | [MiniCPM4.1-8B-Marlin](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-Marlin) | \n  | [MiniCPM4.1-8B-GGUF](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) | [MiniCPM4.1-8B-GGUF](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) | \n  | [MiniCPM4.1-8B-MLX](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-MLX) | [MiniCPM4.1-8B-MLX](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-MLX) | \n  | [MiniCPM4.1-8B-Eagle3](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3) | [MiniCPM4.1-8B-Eagle3](https:\u002F\u002Fwww.modelscope.cn\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3) | \n  | [MiniCPM4-8B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B)    | [MiniCPM4-8B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B) |\n  | [MiniCPM4-0.5B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B) | [MiniCPM4-0.5B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B) |\n  | [BitCPM4-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FBitCPM4-1B)        | [BitCPM4-1B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FBitCPM4-1B) |\n  | [BitCPM4-0.5B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FBitCPM4-0.5B)    | [BitCPM4-0.5B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FBitCPM4-0.5B) |\n  | [MiniCPM4-Survey](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-Survey) | [MiniCPM4-Survey](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-Survey) |\n  | [MiniCPM4-MCP](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-MCP)  | [MiniCPM4-MCP](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-MCP) |\n\n\n\u003Cdetails>\n\u003Csummary>📋 点击查看所有 MiniCPM 系列模型\u003C\u002Fsummary>\n\n  | HuggingFace | ModelScope |\n  |-------------|------------|\n  | [MiniCPM4-8B-Eagle-FRSpec](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-FRSpec) | [MiniCPM4-8B-Eagle-FRSpec](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-FRSpec) |\n  | [MiniCPM4-8B-Eagle-FRSpec-QAT](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-FRSpec-QAT) | [MiniCPM4-8B-Eagle-FRSpec-QAT](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-FRSpec-QAT) |\n  | [MiniCPM4-8B-Eagle-vLLM](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-Eagle-vLLM) | [MiniCPM4-8B-Eagle-vLLM](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-Eagle-vLLM) |\n  | [MiniCPM4-8B-marlin-Eagle-vLLM](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B-marlin-Eagle-vLLM) | [MiniCPM4-8B-marlin-Eagle-vLLM](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-8B-marlin-Eagle-vLLM) |\n  | [MiniCPM4-0.5B-QAT-Int4-unquantized](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B-QAT-Int4-unquantized) | [MiniCPM4-0.5B-QAT-Int4-unquantized](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B-QAT-Int4-unquantized) |\n  | [MiniCPM4-0.5B-QAT-Int4-GPTQ-format](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-0.5B-QAT-Int4-GPTQ-format) | [MiniCPM4-0.5B-QAT-Int4-GPTQ-format](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM4-0.5B-QAT-Int4-GPTQ-format) |\n  | [MiniCPM3-4B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM3-4B) | [MiniCPM3-4B](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM3-4B) |\n  | [MiniCPM-2B-sft](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-sft-bf16) | [MiniCPM-2B-sft](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FminiCPM-bf16)|\n  | [MiniCPM-2B-dpo](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-dpo-bf16) | [MiniCPM-2B-dpo](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-2B-dpo-bf16\u002Fsummary) |\n  | [MiniCPM-2B-128k](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-128k) | [MiniCPM-2B-128k](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002Fopenbmb\u002FMiniCPM-2B-128k\u002Fsummary) |\n  | [MiniCPM-MoE-8x2B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-MoE-8x2B) | [MiniCPM-MoE-8x2B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-MoE-8x2B) |\n  | [MiniCPM-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-1B-sft-bf16) | [MiniCPM-1B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-1B-sft-bf16) |\n  | [MiniCPM-S-1B](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-S-1B-sft) | [MiniCPM-S-1B](https:\u002F\u002Fmodelscope.cn\u002Fmodels\u002FOpenBMB\u002FMiniCPM-S-1B-sft) |\n\u003C\u002Fdetails>\n\n## MiniCPM-SALA\n#### 亮点\n\nMiniCPM-SALA（稀疏注意力与线性注意力）是首个有效融合稀疏注意力和线性注意力的大规模混合模型，可实现百万标记的上下文建模。\n\n✅ 创新混合架构：将25%的稀疏注意力（InfLLM-v2）与75%的线性注意力（Lightning Attention）相结合，前者用于高保真度的长上下文建模，后者则确保全局效率。\n\n✅ 打破效率瓶颈：突破“计算墙”和“内存墙”，推理速度达到密集基线的3.5倍，同时KV缓存开销显著降低。\n\n✅ 百万标记上下文：借助HyPE（混合位置编码），模型可扩展至100万+标记，并保持强大的长度泛化能力。\n\n✅ HALO适配：通过层优化（HALO）实现混合注意力机制，这是一种新颖的蒸馏方法，能够有效地将密集注意力的能力迁移到混合架构中，从而避免纯线性模型常见的性能严重下降问题。\n\n#### 简介\n\nMiniCPM-SALA是一种高效的混合模型，其中25%的层采用[InfLLM-V2](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.24663)，其余75%则使用Lightning Attention。这种架构使得在消费级GPU（如NVIDIA RTX 5090）上即可进行百万标记的推理。\n\n- **SALA混合注意力机制**\n  - 集成25%的InfLLM-V2和75%的Lightning Attention，既能利用稀疏注意力对局部细节的精细关注，又能发挥线性注意力在全局上下文中的高效优势。\n\n- **从Transformer到混合模型的持续训练**\n  - 通过在预训练权重基础上进行架构转换，避免了冷启动训练的低效问题，从而将总训练预算降低至从头训练同等规模模型所需预算的约25%。\n\n- **[HyPE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2601.22156)（混合位置编码）**\n  - 在短上下文和长上下文之间实现性能平衡，能够保持与Qwen3-8B等现代全注意力模型相当的知识、数学和编程等通用能力，并在多个长上下文基准测试中展现出显著优势。\n\n- **长序列的高效推理**\n  - 在A6000D上，当序列长度为256K标记时，推理速度可达Qwen3-8B的3.5倍；在NVIDIA A6000D和5090 GPU上，均可支持最高100万标记的上下文推理，而Qwen3-8B在此长度下会因内存不足（OOM）而失败。\n\n### 评估结果\n\n#### 效率评估\n\n我们在NVIDIA A6000D和RTX 5090 GPU上将MiniCPM-SALA（9B）与Qwen3-8B进行了对比，以评估其推理速度和内存效率。结果显示，MiniCPM-SALA不仅在首令牌生成时间（TTFT）上实现了高达2.5倍的加速，还克服了全注意力架构的内存瓶颈。相比之下，Qwen3-8B在较长序列下容易出现内存溢出错误，而MiniCPM-SALA则能在单个消费级RTX 5090上成功处理100万标记的上下文，真正实现了超长上下文推理在边缘设备上的普及。\n\n![inference_speed_a6000d](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_a0fe509948d6.png)\n\n![inference_speed_5090](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_f873b9c606f6.png)\n\n#### 长上下文评估\n\n在大多数涉及的长上下文基准测试中，MiniCPM-SALA的表现均优于其他同规模的开源大语言模型。特别是在RULER和NoLiMa测试中，它在所有上下文长度（最高128K）上均取得了最高分，并以38.97的总体平均分位居榜首，表明其在处理长上下文信息方面具有卓越性能。\n\n![long_text_evaluation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_e0ff6f36c237.png)\n\n#### 超长上下文评估\n\n评估显示，MiniCPM-SALA具备出色的长度外推能力，在仅用520K标记进行训练的情况下，仍能在2048K标记的上下文中取得81.6分的成绩。值得注意的是，该模型并未使用YaRN等辅助技术，这很可能得益于其稀疏注意力层中无位置编码（NoPE）的设计。\n\n![ultra_long_text_evaluation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_8ea1f8948235.png)\n\n#### 标准评估\n\nMiniCPM-SALA在标准基准测试中获得了76.53的平均分，表现优于Qwen3-8B和Falcon-H1R-7B等同类模型。该架构在知识、代码和数学等领域均表现出稳健的性能。\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_6329cd82b749.png)\n\n### 推理\n\n为了获得最佳性能，我们建议将温度设置为0.9。\n\n#### HuggingFace\n\n我们的模型与Hugging Face Transformers完全兼容。您可以通过以下方式使用我们的模型进行推理：\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\nmodel_path = \"openbmb\u002FMiniCPM-SALA\"\ntokenizer = AutoTokenizer.from_pretrained(model_path)\nmodel = AutoModelForCausalLM.from_pretrained(model_path, trust_remote_code=True, device_map=\"auto\")\nmodel.eval()\n\nprompts = [\"我的名字是\", \"中国的首都是\"]\nwith torch.no_grad():\n    inputs = tokenizer(prompts, return_tensors=\"pt\").to(model.device)\n    outputs = model.generate(**inputs)\noutput_texts = tokenizer.batch_decode(outputs)\nprint(output_texts)\n```\n\n#### SGLang\n\n##### 系统要求\n\n- CUDA 12.x或更高版本\n- `gcc` \u002F `g++` 编译器\n- `uv` 包管理器（脚本会自动检查）\n\n##### 安装步骤\n\n```bash\n# 克隆仓库\ngit clone -b minicpm_sala https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fsglang.git\ncd sglang\n\n# 一键安装（创建虚拟环境并编译所有依赖）\nbash install_minicpm_sala.sh\n\n# 或者指定PyPI镜像\nbash install_minicpm_sala.sh https:\u002F\u002Fmirrors.tuna.tsinghua.edu.cn\u002Fpypi\u002Fweb\u002Fsimple\n```\n\n安装脚本将执行以下步骤：\n\n1. 创建名为`sglang_minicpm_sala_env`的虚拟环境（Python 3.12）\n2. 将依赖项克隆至`3rdparty\u002F`目录（infllmv2），并初始化子模块（sparse_kernel）\n3. 安装MiniCPM-SALA（当前仓库）\n4. 编译并安装`infllmv2_cuda_impl`\n5. 编译并安装`sparse_kernel`\n6. 安装`tilelang`和`flash-linear-attention`\n\n##### 使用方法\n\n```bash\n# 激活环境\nsource sglang_minicpm_sala_env\u002Fbin\u002Factivate\n\n# 启动推理服务器（请将 MODEL_PATH 替换为实际路径）\nMODEL_PATH=\u002Fpath\u002Fto\u002Fyour\u002FMiniCPM-SALA\n\npython3 -m sglang.launch_server \\\n    --model ${MODEL_PATH} \\\n    --trust-remote-code \\\n    --disable-radix-cache \\\n    --attention-backend minicpm_flashinfer \\\n    --chunked-prefill-size 8192 \\\n    --max-running-requests 32 \\\n    --skip-server-warmup \\\n    --port 31111 \\\n    --dense-as-sparse\n```\n\n| 参数 | 描述 |\n|-----------|-------------|\n| `--trust-remote-code` | 允许模型中使用自定义代码 |\n| `--disable-radix-cache` | 禁用 RadixAttention 前缀缓存 |\n| `--attention-backend minicpm_flashinfer` | 使用 MiniCPM FlashInfer 后端 |\n| `--chunked-prefill-size 8192` | 分块预填充大小 |\n| `--max-running-requests 32` | 最大并发请求数 |\n| `--skip-server-warmup` | 跳过服务器预热 |\n| `--port 31111` | 服务器端口 |\n| `--dense-as-sparse` | 使用密集转稀疏模式 |\n\n##### 手动安装\n\n如果脚本无法正常工作，请按照以下步骤操作：\n\n```bash\n# 0. 确保已安装 uv\npip install uv\n\n# 1. 创建虚拟环境\nuv venv --python 3.12 sglang_minicpm_sala_env\nsource sglang_minicpm_sala_env\u002Fbin\u002Factivate\n\n# 2. 安装 SGLang\nuv pip install --upgrade pip setuptools wheel\nuv pip install -e .\u002Fpython[all]\n\n# 3. 编译 CUDA 扩展\n# （确保依赖项已克隆到 3rdparty\u002F 目录下）\ncd 3rdparty\u002Finfllmv2_cuda_impl && python setup.py install && cd ..\u002F..\ncd 3rdparty\u002Fsparse_kernel && python setup.py install && cd ..\u002F..\n\n# 4. 安装额外依赖\nuv pip install tilelang flash-linear-attention\n```\n\n##### 问答\n\n**问：CUDA 扩展编译失败？**\n\n- 请确保已安装 CUDA 12 或更高版本（运行 `nvcc --version` 检查）。\n- 确保系统中已安装 `gcc` 和 `g++`。\n- 如果 `CXX` 被设置为 `clang++ -pthread`，请手动执行 `export CXX=g++`。\n\n## MiniCPM4 和 MiniCPM4.1 系列\n\n\u003Cdiv align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=VouXjLHKDUY\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_6ae0f6f9cd42.jpg\", width=70%>\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n#### 亮点\nMiniCPM 4.1-8B 是首个具有可训练稀疏注意力机制的开源推理大模型：\n\n✅ 强大的推理能力：在 15 项任务中超越同类规模模型！\n\n✅ 高效生成：推理速度提升 3 倍\n\n✅ 高效架构：可训练稀疏注意力机制、基于频率排序的推测式解码\n\n#### 简介\nMiniCPM4 和 MiniCPM4.1 系列是高度高效的大型语言模型（LLMs），专为终端设备设计。其高效性得益于在四个关键维度上的系统性创新：模型架构、训练数据、训练算法和推理系统。\n\n- 🏗️ **高效模型架构：**\n  - InfLLM-V2 — 可训练稀疏注意力机制：采用可训练稀疏注意力机制架构，在处理 128K 长文本时，每个 token 只需与不到 5% 的其他 token 计算相关性，从而显著降低长文本计算开销（[InfLLM-V2 训练内核](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl)）\n\n- 🧠 **高效学习算法：**\n  - Model Wind Tunnel 2.0 — 高效可预测缩放：引入下游任务性能的缩放预测方法，实现更精准的模型训练配置搜索\n  - BitCPM — 极致三值量化：将模型参数位宽压缩至 3 种取值，实现 90% 的极端模型位宽缩减\n  - 高效训练工程优化：采用 FP8 低精度计算技术，并结合多 token 预测训练策略\n\n- 📚 **高质量训练数据：**\n  - UltraClean — 高质量预训练数据筛选与生成：基于高效数据验证构建迭代式数据清洗策略，开源高质量中英文预训练数据集 [UltraFinweb](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fopenbmb\u002FUltra-FineWeb)\n  - UltraChat v2 — 高质量监督微调数据生成：构建大规模高质量监督微调数据集，涵盖知识密集型、推理密集型、指令遵循、长文本理解以及工具调用等多个维度\n\n- ⚡ **高效推理与部署系统：**\n  - CPM.cu — 轻量高效 CUDA 推理框架：集成稀疏注意力、模型量化和推测采样，实现高效预填充和解码（[推理内核与框架](https:\u002F\u002Fgithub.com\u002Fopenbmb\u002Fcpm.cu)）\n  - ArkInfer — 跨平台部署系统：支持在多种后端环境中高效部署，提供灵活的跨平台适配能力\n\n### 评估结果\n\n#### 效率评估\n在 Jetson AGX Orin 和 RTX 4090 这两种典型终端芯片上，MiniCPM4 和 MiniCPM4.1 在长文本处理任务中表现出显著优于同类规模模型的速度。随着文本长度增加，MiniCPM4 和 MiniCPM4.1 的效率优势愈发明显。在 Jetson AGX Orin 平台上，与 Qwen3-8B 相比，MiniCPM4 和 MiniCPM4.1 的解码速度提升了约 7 倍。\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_f81787316507.png)\n\nMiniCPM4.1 在推理任务中实现了 3 倍的解码速度提升。\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_2f353ff93814.png)\n\n#### 综合评估\nMiniCPM4 推出了 8B 和 0.5B 参数规模的终端版本，两者均在各自类别中达到最佳性能。\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_2d05de343736.png)\n\nMiniCPM4.1 推出了 8B 参数规模的终端版本，在深度推理模式下达到了同类最佳性能。\n\n![benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_200804218fab.png)\n\n#### 长文本评估\nMiniCPM4 经过 32K 长文本的预训练，并通过 YaRN 技术实现长度扩展。在 128K 长文本“大海捞针”任务中，MiniCPM4 表现出色。MiniCPM4.1 则经过 64K 长文本的预训练，并同样通过 YaRN 技术实现长度扩展。在 128K 长文本“大海捞针”任务中，MiniCPM4.1 也展现了卓越的表现。\n\n![long-niah](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_12a1fce44bc4.png)\n\n### 推理\nMiniCPM 4.1 可与以下框架配合使用：Huggingface Transformers、SGLang、vLLM 和 CPM.cu。若追求极致的推理速度，我们强烈推荐使用 CPM.cu。\n\nMiniCPM4\u002FMiniCPM4.1 同时支持密集注意力推理和稀疏注意力推理模式，而 vLLM 和 SGLang 目前仅支持密集推理模式。如需使用稀疏推理模式，请选用 Huggingface Transformers 和 CPM.cu。\n\n- 密集注意力推理：vLLM、SGLang、Huggingface Transformers\n- 稀疏注意力推理：Huggingface Transformers、CPM.cu\n\n#### 混合推理模式\n\nMiniCPM4.1 支持混合推理模式，该模式既可用于深度推理，也可用于非推理模式。要启用混合推理模式，用户可在 `tokenizer.apply_chat_template` 中设置 `enable_thinking=True` 来启用混合推理模式，或设置 `enable_thinking=False` 来启用非推理模式。同样地，用户也可以在查询末尾直接添加 `\u002Fno_think` 来启用非推理模式。若未添加任何特殊标记或在查询末尾添加 `\u002Fthink`，模型将自动启用推理模式。\n\n```python\n# 启用推理模式\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n    enable_thinking=True\n)\n# 启用非推理模式\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n    enable_thinking=False\n)\n```\n\n#### HuggingFace\n\n- **密集注意力推理**\n```python\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nimport torch\ntorch.manual_seed(0)\n\npath = 'openbmb\u002FMiniCPM4.1-8B'\ndevice = \"cuda\"\ntokenizer = AutoTokenizer.from_pretrained(path)\nmodel = AutoModelForCausalLM.from_pretrained(path, torch_dtype=torch.bfloat16, device_map=device, trust_remote_code=True)\n\n# 用户可以直接使用聊天接口\n# responds, history = model.chat(tokenizer, \"写一篇关于人工智能的文章。\", temperature=0.7, top_p=0.7)\n# print(responds)\n\n# 用户也可以使用 generate 接口\nmessages = [\n    {\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"},\n]\nprompt_text = tokenizer.apply_chat_template(\n    messages,\n    tokenize=False,\n    add_generation_prompt=True,\n)\nmodel_inputs = tokenizer([prompt_text], return_tensors=\"pt\").to(device)\n\nmodel_outputs = model.generate(\n    **model_inputs,\n    max_new_tokens=32768,\n    top_p=0.95,\n    temperature=0.6\n)\noutput_token_ids = [\n    model_outputs[i][len(model_inputs[i]):] for i in range(len(model_inputs['input_ids']))\n]\n\nresponses = tokenizer.batch_decode(output_token_ids, skip_special_tokens=True)[0]\nprint(responses)\n```\n\n- **稀疏注意力推理**\n本模型支持 InfLLM v2，这是一种专为高效长序列推理设计的稀疏注意力机制。它需要 [infllmv2_cuda_impl](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl) 库。\n\n您可以通过运行以下命令来安装：\n\n```bash\ngit clone -b feature_infer https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Finfllmv2_cuda_impl.git\ncd infllmv2_cuda_impl\ngit submodule update --init --recursive\npip install -e . # 或 python setup.py install \n```\n\n要启用 InfLLM v2，您需要在 `config.json` 中添加 `sparse_config` 字段：\n\n```json\n{\n    ...,\n    \"sparse_config\": {\n        \"kernel_size\": 32,\n        \"kernel_stride\": 16,\n        \"init_blocks\": 1,\n        \"block_size\": 64,\n        \"window_size\": 2048,\n        \"topk\": 64,\n        \"use_nope\": false,\n        \"dense_len\": 8192\n    }\n}\n```\n\n这些参数控制着 InfLLM v2 的行为：\n\n* `kernel_size`（默认值：32）：语义核的大小。\n* `kernel_stride`（默认值：16）：相邻核之间的步长。\n* `init_blocks`（默认值：1）：每个查询 token 首先关注的块数。这可确保对序列开头的关注。\n* `block_size`（默认值：64）：键值块的块大小。\n* `window_size`（默认值：2048）：局部滑动窗口的大小。\n* `topk`（默认值：64）：指定每个 token 只会与最相关的前 k 个键值块计算注意力。\n* `use_nope`（默认值：false）：是否在块选择中使用 NOPE 技术以提升性能。\n* `dense_len`（默认值：8192）：由于稀疏注意力对于短序列的收益有限，模型可以对较短文本使用标准（密集）注意力。当序列的 token 长度低于 `dense_len` 时，模型将使用密集注意力；超过此长度则切换到稀疏注意力。将其设置为 `-1` 则无论序列长度如何均始终使用稀疏注意力。\n\n- **长上下文扩展**\nMiniCPM4.1 原生支持高达 65,536（64k）token 的上下文长度。对于总长度（包括输入和输出）显著超出此限制的对话，我们建议使用 RoPE 缩放技术来有效处理长文本。我们已通过修改 LongRoPE 因子，验证了模型在高达 131,072 token 上下文长度下的表现。\n\n您可以通过修改模型文件来应用 LongRoPE 因子的调整。具体而言，在 `config.json` 文件中，调整 `rope_scaling` 字段。\n\n```json\n{\n    ...,\n    \"rope_scaling\": {\n        \"rope_type\": \"longrope\", \n        \"long_factor\": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.210822723989212, 12.053807765671676, 14.193944598909404, 16.65780676784363, 19.463620727694074, 22.628311203524586, 26.150106147261315, 30.02526691405111, 34.23183327975347, 38.73811934094828, 43.502489489729555, 48.47627117965394, 53.61139491762471, 58.857366522037935, 64.16798299215064, 69.51359464319125, 74.86555458220285, 80.21497790341579, 85.55322183307433, 90.89611806932027, 96.26245306514224, 101.68269304046481, 107.18619510219668, 112.82253283014026, 118.63764063163615, 119.88866203644656, 120.9462882391725, 121.837565139014, 122.58663780572562, 123.2147719894291, 123.74049454862576, 124.17980424685767, 124.54641761955492, 124.85202548028222, 125.10654406389756, 125.31835105170659, 125.49450117164764, 125.64091910903052, 125.76256945356558, 125.86360463815589, 125.94749252260765, 126.01712561287873],\n        \"short_factor\": [0.9982316082870437, 1.033048153422584, 1.0749920956484724, 1.1255096879436193, 1.1863348602111476, 1.259543828902579, 1.3476188888731149, 1.4535223827776373, 1.5807816745852985, 1.7335856049489526, 1.9168922912975785, 2.1365471404135326, 2.3994084200118646, 2.713475511863602, 3.0880118452194134, 3.533650295140154, 4.062463396503134, 4.687974098908333, 5.425075306704039, 6.289818967956352, 7.29902962722721, 8.6357018163639, 10.21082...,\n        \"original_max_position_embeddings\": 65536\n    }\n}\n```\n\n#### vLLM\n\n##### 推测解码\n\n使用 vLLM 进行推测解码加速推理时，请按照以下步骤操作：\n\n###### 1. 下载 MiniCPM4.1 草稿模型\n\n首先，下载 MiniCPM4.1 草稿模型：\n\n```bash\ncd \u002Fyour_path\ngit clone https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3\n```\n\n###### 2. 安装兼容 EAGLE3 的 vLLM\n\nEAGLE3 的 vLLM PR 已提交。目前请使用我们的仓库进行安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLDLINGLINGLING\u002Fvllm.git\ncd vllm \npip install -e .\n```\n\n###### 3. 启动支持推测解码的 vLLM 服务器\n\n启动启用推测解码功能的 vLLM 推理服务器。请确保将推测配置中的模型路径更新为指向您下载的 MiniCPM4_1-8B-Eagle3-bf16 文件夹：\n\n```bash\nVLLM_USE_V1=1 \\\nvllm serve openbmb\u002FMiniCPM4.1-8B \\\n--seed 42 \\\n--trust-remote-code \\\n--speculative-config '{\n  \"model\": \"your\u002Fpath\u002FMiniCPM4_1-8B-Eagle3-bf16\",\n  \"num_speculative_tokens\": 3,\n  \"method\": \"eagle3\",\n  \"draft_tensor_parallel_size\": 1\n}'\n```\n\n###### 4. 客户端使用示例\n\n无论是标准推理还是推测解码，客户端的使用方式都相同：\n\n```python\nimport openai\n\nclient = openai.Client(base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\", api_key=\"EMPTY\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n    extra_body=dict(add_special_tokens=True),  \u002F\u002F 确保为聊天模板添加特殊标记\n    \n)\n\nprint(response.choices[0].message.content)\n```\n\n###### vLLM 配置参数\n\n- `VLLM_USE_V1=1`：启用 vLLM v1 API\n- `--speculative-config`：推测解码的 JSON 配置\n  - `model`：用于推测的草稿模型路径\n  - `num_speculative_tokens`：推测令牌数量（默认值为 3）\n  - `method`：推测解码方法（eagle3）\n  - `draft_tensor_parallel_size`：草稿模型的张量并行规模（默认值为 1）\n- `--seed`：随机种子，用于保证结果可重复性\n- `--trust-remote-code`：允许执行自定义模型的远程代码\n\n##### 标准推理（无推测解码）\n\n目前，您需要安装最新版本的 vLLM。\n\n```bash\npip install -U vllm \\\n    --pre \\\n    --extra-index-url https:\u002F\u002Fwheels.vllm.ai\u002Fnightly\n```\n\n然后您可以使用 vLLM 对 MiniCPM4.1-8B 进行推理：\n\n```python\nfrom transformers import AutoTokenizer\nfrom vllm import LLM, SamplingParams\n\nmodel_name = \"openbmb\u002FMiniCPM4.1-8B\"\nprompt = [{\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"}]\n\ntokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)\ninput_text = tokenizer.apply_chat_template(prompt, tokenize=False, add_generation_prompt=True)\n\nllm = LLM(\n    model=model_name,\n    trust_remote_code=True,\n    max_num_batched_tokens=65536,\n    dtype=\"bfloat16\", \n    gpu_memory_utilization=0.8, \n)\nsampling_params = SamplingParams(top_p=0.95, temperature=0.6, max_tokens=32768)\n\noutputs = llm.generate(prompts=input_text, sampling_params=sampling_params)\n\nprint(outputs[0].outputs[0].text)\n```\n\n此外，您还可以通过运行以下命令来启动推理服务器：\n> **注意**：在 vLLM 的聊天 API 中，默认情况下 `add_special_tokens` 是 `False`。这意味着重要的特殊标记——例如序列开始标记（BOS）——不会自动添加。为了确保输入提示正确地格式化以供模型使用，您应显式设置 `extra_body={\"add_special_tokens\": True}`。\n\n```bash\nvllm serve openbmb\u002FMiniCPM4.1-8B --trust-remote-code\n```\n\n然后您可以使用以下代码调用聊天界面：\n\n```python\nimport openai\n\nclient = openai.Client(base_url=\"http:\u002F\u002Flocalhost:8000\u002Fv1\", api_key=\"EMPTY\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n    extra_body=dict(add_special_tokens=True),  \u002F\u002F 确保为聊天模板添加特殊标记\n)\n\nprint(response.choices[0].message.content)\n```\n\n#### SGLang\n\n##### 推测解码\n\n要实现推测解码加速推理，请按照以下步骤操作：\n\n###### 1. 下载 MiniCPM4.1 草稿模型\n\n首先，下载 MiniCPM4.1 草稿模型：\n\n```bash\ncd \u002Fyour_path\ngit clone https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-Eagle3\n```\n\n###### 2. 安装与 EAGLE3 兼容的 SGLang\n\nEAGLE3 适配的 PR 已经提交。目前，请使用我们的仓库进行安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FLDLINGLINGLING\u002Fsglang.git\ncd sglang\npip install -e .\n```\n\n###### 3. 启动支持推测解码的 SGLang 服务器\n\n启动启用推测解码功能的 SGLang 服务器：\n\n```bash\npython -m sglang.launch_server \\\n  --model-path \"openbmb\u002FMiniCPM4.1-8B\" \\\n  --host \"127.0.0.1\" \\\n  --port 30002 \\\n  --mem-fraction-static 0.9 \\\n  --speculative-algorithm EAGLE3 \\\n  --speculative-draft-model-path \"your\u002Fpath\u002FMiniCPM4_1-8B-Eagle3-bf16\" \\\n  --speculative-num-steps 3 \\\n  --speculative-eagle-topk 1 \\\n  --speculative-num-draft-tokens 32 \\\n  --temperature 0.7\n```\n\n###### 4. 客户端使用方法\n\n无论是标准解码还是推测解码，客户端的使用方式都相同：\n\n```python\nimport openai\n\nclient = openai.Client(base_url=f\"http:\u002F\u002Flocalhost:30002\u002Fv1\", api_key=\"None\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n)\n\nprint(response.choices[0].message.content)\n```\n\n注意：请确保在客户端代码中将端口号更新为与服务器端口一致（在推测解码示例中为 30002）。\n\n###### 配置参数\n\n- `--speculative-algorithm EAGLE3`：启用 EAGLE3 推测解码\n- `--speculative-draft-model-path`：推测用草稿模型的路径\n- `--speculative-num-steps`：推测步骤数（默认：3）\n- `--speculative-eagle-topk`：EAGLE 的 Top-k 参数（默认：1）\n- `--speculative-num-draft-tokens`：草稿令牌数量（默认：32）\n- `--mem-fraction-static`：静态内存分配比例（默认：0.9）\n\n##### 标准推理（无推测解码）\n\n目前，您需要安装我们分叉的 SGLang 版本。\n\n```bash\ngit clone -b openbmb https:\u002F\u002Fgithub.com\u002FOpenBMB\u002Fsglang.git\ncd sglang\n\npip install --upgrade pip\npip install -e \"python[all]\"\n```\n\n您可以通过运行以下命令来启动推理服务器：\n\n```bash\npython -m sglang.launch_server --model openbmb\u002FMiniCPM4.1-8B --trust-remote-code --port 30000 --chat-template chatml\n```\n\n然后，您可以使用以下命令调用聊天接口：\n\n```python\nimport openai\n\nclient = openai.Client(base_url=f\"http:\u002F\u002Flocalhost:30000\u002Fv1\", api_key=\"None\")\n\nresponse = client.chat.completions.create(\n    model=\"openbmb\u002FMiniCPM4.1-8B\",\n    messages=[\n        {\"role\": \"user\", \"content\": \"写一篇关于人工智能的文章。\"},\n    ],\n    temperature=0.6,\n    max_tokens=32768,\n)\n\nprint(response.choices[0].message.content)\n```\n\n\n#### CPM.cu\n\n我们**推荐**使用 [CPM.cu](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu) 来进行 MiniCPM4 和 MiniCPM4.1 的推理。CPM.cu 是由 OpenBMB 开发的 CUDA 推理框架，集成了高效的稀疏采样、推测采样和量化技术，能够充分发挥 MiniCPM4 和 MiniCPM4.1 的效率优势。\n\n您可以通过运行以下命令来安装 CPM.cu：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu.git --recursive\ncd CPM.cu\npython3 setup.py install\n```\n\n您可以运行以下命令来测试模型的速度：\n\n```bash\npython3 tests\u002Flong_prompt_gen.py # 生成 prompt.txt\npython3 tests\u002Ftest_generate.py --prompt-file prompt.txt\n```\n\n您还可以运行以下命令来使用 EAGLE3 推测解码算法进行推理：\n\n```bash\npython3 -m cpmcu.cli \\\n    --model-path $BASE_MODEL_PATH \\\n    --draft-model-path $EAGLE3_DRAFT_MODEL_PATH \\\n    --prompt-text \"给我讲讲清华大学吧\" \\\n    --use-eagle3 true\n```\n\n有关 CPM.cu 的更多详细信息，请参阅 [CPM.cu](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FCPM.cu) 仓库。\n\n\n#### llama.cpp 和 Ollama\n\n我们还支持使用 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp) 和 [Ollama](https:\u002F\u002Follama.com\u002F) 进行推理。\n\n##### llama.cpp\n\n您可以从 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4.1-8B-GGUF) 下载 MiniCPM4.1-8B 模型的 GGUF 格式，并使用 llama.cpp 进行高效的 CPU 或 GPU 推理。\n```\n\n\n# 案例 1：主命令行\n.\u002Fbuild\u002Fbin\u002Fllama-cli -m MiniCPM4.1-8B-Q4_K_M.gguf -p \"写一篇关于人工智能的文章。\" -n 1500\n\n# 案例 2：服务器\n## 启动服务器\n.\u002Fbuild\u002Fbin\u002Fllama-server -m MiniCPM4.1-8B-Q4_K_M.gguf --host 127.0.0.1 --port 8080 -c 4096 -fa on &\n\n## 发送请求\ncurl -X POST http:\u002F\u002F127.0.0.1:8080\u002Fv1\u002Fchat\u002Fcompletions \\\n  -H \"Content-Type: application\u002Fjson\" \\\n  -d '{\n    \"model\": \"gpt-3.5-turbo\",\n    \"messages\": [{\"role\": \"用户\", \"内容\": \"写一篇关于人工智能的文章。\"}],\n    \"max_tokens\": 1500\n  }'\n```\n\n##### Ollama\n请参考 [模型中心](https:\u002F\u002Follama.com\u002Fopenbmb\u002Fminicpm4.1) 下载模型。安装 ollama 包后，您可以使用以下命令运行 MiniCPM4.1：\n```\nollama run openbmb\u002Fminicpm4.1\n```\n\n### BitCPM4：量化\n\nBitCPM4 是通过量化感知训练（QAT）从 MiniCPM 系列模型衍生出的三值量化模型，显著提升了训练效率和模型参数效率。\n- 训练方法的改进\n  - 在小型模型上利用风洞搜索超参数。\n  - 采用两阶段训练法：先以高精度训练，再进行 QAT，充分利用已训练好的高精度模型，从而大幅减少 QAT 阶段所需的计算资源。\n- 高参数效率\n  - 仅以 1.58 位的比特宽度，便能达到与同规模全精度模型相当的性能，展现出极高的参数效率。\n\n#### BitCPM4 评估\n\nBitCPM4 的性能与其他同等规模的全精度模型相当。\n![bitcpm-benchmark](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_18dd13583f1a.png)\n\n#### BitCPM4 推理\n\nBitCPM4 的参数以伪量化格式存储，支持直接在 Hugging Face 框架内进行推理。\n\n### MiniCPM4 应用\n\u003Cdetails>\n\u003Csummary>点击查看 MiniCPM4 应用详情\u003C\u002Fsummary>\n\n#### MiniCPM4-Survey：可信调查问卷生成\n\n**MiniCPM4-Survey** 是由 [THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn)、中国人民大学和 [ModelBest](https:\u002F\u002Fmodelbest.cn\u002Fen) 联合开发的开源 LLM 代理模型。它基于 MiniCPM4-8B 构建，能够接收用户的查询作为输入，并自主生成可信的长篇调查报告。\n\n其主要特点包括：\n\n- **计划-检索-撰写问卷生成框架** — 我们提出了一种多智能体生成框架，该框架通过三个核心阶段运行：计划（定义问卷的整体结构）、检索（生成合适的检索关键词）和撰写（综合检索到的信息以生成连贯的章节级内容）。\n\n- **高质量数据集构建** — 我们收集并处理大量专家撰写的调查论文，以构建高质量的训练数据集。同时，我们还收集了大量研究论文来建立检索数据库。\n\n- **多维度奖励设计** — 我们精心设计了一个包含结构、内容和引用三个方面的奖励体系，用于评估问卷的质量，并将其作为强化学习训练阶段的奖励函数。\n\n- **多步强化学习训练策略** — 我们提出了一种*上下文管理器*，以确保关键信息的保留，同时促进高效的推理；此外，我们还构建了*并行环境*，以维持高效的强化学习训练循环。\n\n##### 演示与快速入门\n\n请参阅[此处](.\u002Fdemo\u002Fminicpm4\u002FSurveyGeneration\u002FREADME.md)\n\n##### 性能评估\n\n| 方法                                      | 相关性 | 覆盖面 | 深度 | 新颖性 | 平均分  | 事实准确性 |\n|---------------------------------------------|-----------|----------|-------|---------|-------|------------|\n| 简单RAG（由G2FT驱动）                  | 3.25      | 2.95     | 3.35  | 2.60    | 3.04  | 43.68      |\n| AutoSurvey（由G2FT驱动）                 | 3.10      | 3.25     | 3.15  | **3.15**| 3.16  | 46.56      |\n| Webthinker（由WTR1-7B驱动）              | 3.30      | 3.00     | 2.75  | 2.50    | 2.89  | --         |\n| Webthinker（由QwQ-32B驱动）              | 3.40      | 3.30     | 3.30  | 2.50    | 3.13  | --         |\n| OpenAI深度研究（由GPT-4o驱动）     | 3.50      |**3.95**  | 3.55  | 3.00    | **3.50**  | --         |\n| MiniCPM-4-Survey                            | 3.45      | 3.70     | **3.85** | 3.00    | **3.50**  | **68.73**  |\n| &nbsp;&nbsp;&nbsp;*无*强化学习                  | **3.55**  | 3.35     | 3.30  | 2.25    | 3.11  | 50.24      |\n\n*问卷生成系统的性能对比。\"G2FT\"代表Gemini-2.0-Flash-Thinking，而\"WTR1-7B\"则表示Webthinker-R1-7B。由于Webthinker不包含引用功能，且OpenAI深度研究在导出结果时也不提供引用，因此未对这两者进行事实准确性评估。*\n\n#### MiniCPM4-MCP：基于模型上下文协议的工具使用\n\n**MiniCPM4-MCP**是由[THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn)、中国人民大学以及[ModelBest](https:\u002F\u002Fmodelbest.cn\u002Fen)联合开发的一款开源端侧LLM代理模型，其基础为拥有80亿参数的[MiniCPM-4](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM4-8B)。它能够通过MCP与各类工具和数据资源交互，从而解决广泛的现实世界任务。截至目前，MiniCPM4-MCP支持以下功能：\n\n- 在16个MCP服务器上使用工具：这些服务器涵盖办公、生活、通信、信息和工作管理等多个类别。\n\n- 单一工具调用能力：它可以使用符合MCP标准的单一工具执行单步或多步工具调用。\n\n- 多工具调用能力：它可以使用不同的符合MCP标准的工具执行单步或多步工具调用。\n\n##### 演示\n\n演示可在该[链接](.\u002Fdemo\u002Fminicpm4\u002FMCP\u002FREADME_en.md)中查看。\n\n##### 性能评估\n\n| MCP服务器                  |          | gpt-4o             |              |          | qwen3             |              |      |      minicpm4         |              |\n|-----------------------|----------------|--------------|--------------|---------------|--------------|--------------|----------------|--------------|--------------|\n|                       | func           | param        | value        | func          | param        | value        | func           | param        | value        |\n| Airbnb                | 89.3           | 67.9         | 53.6         | 92.8          | 60.7         | 50.0         | 96.4           | 67.9         | 50.0         |\n| Amap-Maps             | 79.8           | 77.5         | 50.0         | 74.4          | 72.0         | 41.0         | 89.3           | 85.7         | 39.9         |\n| Arxiv-MCP-Server      | 85.7           | 85.7         | 85.7         | 81.8          | 54.5         | 50.0         | 57.1           | 57.1         | 52.4         |\n| Calculator            | 100.0          | 100.0        | 20.0         | 80.0          | 80.0         | 13.3         | 100.0          | 100.0        | 6.67         |\n| Computor-Control-MCP  | 90.0           | 90.0         | 90.0         | 90.0          | 90.0         | 90.0         | 90.0           | 90.0         | 86.7         |\n| Desktop-Commander     | 100.0          | 100.0        | 100.0        | 100.0         | 100.0        | 100.0        | 100.0          | 100.0        | 100.0        |\n| Filesystem            | 63.5           | 63.5         | 31.3         | 69.7          | 69.7         | 26.0         | 83.3           | 83.3         | 42.7         |\n|Github | 92.0 | 80.0 | 58.0 | 80.5 | 50.0 | 27.7 | 62.8 | 25.7 | 17.1 |\n| Gaode                 | 71.1           | 55.6         | 17.8         | 68.8          | 46.6         | 24.4         | 68.9           | 46.7         | 15.6         |\n| MCP-Code-Executor     | 85.0           | 80.0         | 70.0         | 80.0          | 80.0         | 70.0         | 90.0           | 90.0         | 65.0         |\n| MCP-Docx              | 95.8           | 86.7         | 67.1         | 94.9          | 81.6         | 60.1         | 95.1           | 86.6         | 76.1         |\n| PPT                   | 72.6           | 49.8         | 40.9         | 85.9          | 50.7         | 37.5         | 91.2           | 72.1         | 56.7         |\n| PPTx                  | 64.2           | 53.7         | 13.4         | 91.0          | 68.6         | 20.9         | 91.0           | 58.2         | 26.9         |\n| Simple-Time-Server    | 90.0           | 70.0         | 70.0         | 90.0          | 90.0         | 90.0         | 90.0           | 60.0         | 60.0         |\n| Slack                 | 100.0          | 90.0         | 70.0         | 100.0         | 100.0        | 65.0         | 100.0          | 100.0        | 100.0        |\n| Whisper               | 90.0           | 90.0         | 90.0         | 90.0          | 90.0         | 90.0         | 90.0           | 90.0         | 30.0         |\n| **平均值**              | **80.2**       | **70.2**     | **49.1**     | **83.5**      | **67.7**     | **43.8**     | **88.3**       | **76.1**     | **51.2**     |\n\n#### MiniCPM 英特尔 AIPC 客户端：新一代边缘大模型强者\n\n由面壁智能与英特尔合作开发的 MiniCPM Intel AIPC 客户端，是一款专为搭载英特尔酷睿 Ultra 系列处理器的设备设计的边缘大模型客户端。它为开发者、研究人员和 AI 爱好者提供低延迟、高效率且保护隐私的本地大模型体验。其核心特性包括：\n\n##### 核心特性  \n- 深度英特尔硬件适配  \n完全兼容英特尔酷睿 Ultra 系列处理器，实现与硬件的深度集成，释放峰值性能。用户无需依赖云服务，即可在本地设备上流畅运行大模型。  \n\n- 基于 OpenVINO 的极致优化  \n通过 OpenVINO 推理框架进行深度优化，显著提升推理效率，最高可达 **每秒 80 个 token**。这确保了模型在快速查询和复杂任务处理时都能迅速响应。  \n\n- 隐私与安全保障  \n采用本地部署方式，所有数据处理均在设备端完成，彻底消除因上传云端带来的隐私风险。尤其对于对数据隐私要求较高的场景，用户可更加安心使用。  \n\n- 满足多样化用户群体需求  \n无论是追求前沿技术的开发者、专注于学术研究的科研人员，还是热衷于探索 AI 应用的爱好者，MiniCPM Intel AIPC 客户端都能帮助他们轻松接入本地大模型的强大能力，开启个性化的 AI 探索之旅。  \n\n##### 系统要求  \n- 推荐处理器：英特尔酷睿 Ultra 7 或更高（移动版）  \n- 推荐内存：32GB 或以上  \n\n##### 下载\n\n[下载](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Freleases\u002Ftag\u002F2.4.2)\n\n\u003C\u002Fdetails>\n\n\n\n\n## 许可协议\n\n#### 模型许可协议\n\n* 本仓库及 MiniCPM 模型均遵循 [Apache-2.0](https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fblob\u002Fmain\u002FLICENSE) 许可协议发布。 \n\n#### 声明\n\n* 作为一款语言模型，MiniCPM 通过学习海量文本生成内容。  \n* 然而，它并不具备理解或表达个人观点或价值判断的能力。  \n* MiniCPM 生成的任何内容均不代表模型开发者的观点或立场。  \n* 因此，在使用 MiniCPM 生成的内容时，用户应自行承担评估和验证的责任。\n\n## 合作机构\n\n本项目由以下机构共同开发：\n\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_8eaab2faaba2.png\" width=\"28px\"> [Modelbest Inc.](https:\u002F\u002Fmodelbest.cn\u002F)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_13da7b2eabfc.png\" width=\"28px\"> [THUNLP](https:\u002F\u002Fnlp.csai.tsinghua.edu.cn\u002F)\n- \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_readme_d9d8c88f8a4e.png\" width=\"28px\"> [中国人民大学高瓴人工智能学院](https:\u002F\u002Flinyankai.github.io\u002F)\n\n## 引用\n\n* 如果您认为我们的工作有价值，请引用我们的论文：[MiniCPM4](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.07900)。\n\n```\n@article{minicpm4,\n  title={Minicpm4: 超高效终端设备端大模型},\n  author={MiniCPM 团队},\n  journal={arXiv 预印本 arXiv:2506.07900},\n  year={2025}\n}\n```","# MiniCPM 快速上手指南\n\nMiniCPM 是由 OpenBMB 推出的高效端侧大模型系列，最新发布的 **MiniCPM4.1** 和 **MiniCPM-SALA** 在推理速度、长文本处理及混合推理能力上表现卓越。本指南将帮助您快速部署并使用 MiniCPM4.1-8B 模型。\n\n## 1. 环境准备\n\n### 系统要求\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+), macOS, Windows (WSL2)\n*   **GPU**: NVIDIA GPU (显存建议 ≥ 16GB 用于 FP16 推理，≥ 8GB 用于量化版本)，支持 CUDA 11.8+\n*   **内存**: 系统内存 ≥ 16GB\n*   **Python**: 3.9 - 3.11\n\n### 前置依赖\n确保已安装以下基础工具：\n*   Git\n*   CUDA Toolkit (如使用 GPU 加速)\n*   pip 或 conda (包管理工具)\n\n## 2. 安装步骤\n\n推荐使用国内镜像源（如 ModelScope 或清华源）以加速下载。\n\n### 方案 A：使用 HuggingFace\u002FModelScope + Transformers (通用)\n\n1.  **创建虚拟环境**\n    ```bash\n    conda create -n minicpm python=3.10\n    conda activate minicpm\n    ```\n\n2.  **安装核心依赖**\n    ```bash\n    pip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n    pip install transformers accelerate sentencepiece protobuf\n    ```\n\n3.  **下载模型 (推荐国内开发者使用 ModelScope)**\n    首先安装 ModelScope 客户端：\n    ```bash\n    pip install modelscope\n    ```\n    \n    使用 Python 脚本下载 `MiniCPM4.1-8B`：\n    ```python\n    from modelscope import snapshot_download\n    model_dir = snapshot_download('OpenBMB\u002FMiniCPM4.1-8B', cache_dir='.\u002Fmodels')\n    print(f\"Model downloaded to: {model_dir}\")\n    ```\n\n### 方案 B：使用 vLLM 高性能推理 (生产环境推荐)\n\n若需高吞吐推理，建议安装适配 EAGLE3  speculative decoding 的 vLLM 版本。\n\n1.  **安装兼容版本的 vLLM**\n    ```bash\n    pip install vllm\n    # 注意：如需完整支持 MiniCPM4.1 的投机采样特性，请参考官方文档安装特定 commit 版本的 vLLM\n    ```\n\n2.  **下载模型**\n    同上，使用 ModelScope 或 HuggingFace 下载模型权重。\n\n## 3. 基本使用\n\n### 方式一：使用 Transformers 原生加载 (最简单)\n\n以下代码演示如何加载模型并进行简单的对话生成。\n\n```python\nimport torch\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\n\n# 设置模型路径 (如果使用 ModelScope 下载，填写本地路径)\nmodel_path = \".\u002Fmodels\u002FOpenBMB\u002FMiniCPM4.1-8B\" \n# 或者直接使用 HuggingFace ID: \"openbmb\u002FMiniCPM4.1-8B\"\n\n# 加载分词器和模型\ntokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_path, \n    torch_dtype=torch.bfloat16, \n    device_map=\"auto\", \n    trust_remote_code=True\n)\n\n# 准备输入提示\nprompt = \"你好，请介绍一下你自己。\"\nmessages = [\n    {\"role\": \"user\", \"content\": prompt}\n]\n\n# 应用聊天模板\ntext = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)\ninputs = tokenizer([text], return_tensors=\"pt\").to(model.device)\n\n# 生成回复\noutputs = model.generate(\n    **inputs, \n    max_new_tokens=512, \n    temperature=0.7, \n    top_p=0.9, \n    do_sample=True\n)\n\n# 解码输出\nresponse = tokenizer.decode(outputs[0][len(inputs.input_ids[0]):], skip_special_tokens=True)\nprint(response)\n```\n\n### 方式二：使用 vLLM 启动服务 (高性能)\n\n通过命令行启动 API 服务，支持高并发请求。\n\n```bash\npython -m vllm.entrypoints.api_server \\\n    --model .\u002Fmodels\u002FOpenBMB\u002FMiniCPM4.1-8B \\\n    --trust-remote-code \\\n    --dtype bfloat16 \\\n    --port 8000\n```\n\n**客户端调用示例：**\n```bash\ncurl http:\u002F\u002Flocalhost:8000\u002Fv1\u002Fcompletions \\\n    -H \"Content-Type: application\u002Fjson\" \\\n    -d '{\n        \"model\": \".\u002Fmodels\u002FOpenBMB\u002FMiniCPM4.1-8B\",\n        \"prompt\": \"你好，MiniCPM！\",\n        \"max_tokens\": 128\n    }'\n```\n\n### 方式三：端侧轻量级运行 (llama.cpp \u002F Ollama)\n\n对于资源受限的设备，可使用量化版本 (GGUF)。\n\n1.  **下载 GGUF 模型**: 从 HuggingFace 或 ModelScope 下载 `MiniCPM4.1-8B-GGUF`。\n2.  **使用 llama.cpp 运行**:\n    ```bash\n    .\u002Fmain -m .\u002Fminicpm4.1-8b-q4_k_m.gguf -p \"你好，请写一首关于春天的诗。\" -n 256\n    ```\n3.  **使用 Ollama 运行**:\n    导入 Modelfile 后执行：\n    ```bash\n    ollama run minicpm4.1\n    ```\n\n---\n*注：更多高级功能（如混合推理模式、百万字长上下文处理、投机采样配置）请参考官方 Wiki 或 GitHub 仓库详细文档。*","某初创团队开发了一款面向户外作业人员的离线智能助手 App，需在无网络环境下通过普通安卓手机实时处理复杂的设备故障排查指令。\n\n### 没有 MiniCPM 时\n- **响应严重滞后**：在骁龙 7 系列中端芯片上运行通用大模型，生成一步推理步骤需耗时 3-5 秒，用户等待焦虑感极强。\n- **电量消耗过快**：高算力负载导致手机发热严重，电池在半小时内耗尽，无法满足全天野外作业需求。\n- **必须依赖云端**：因本地无法流畅运行大参数模型，不得不保留云端接口，但在信号弱的山区经常请求超时或失败。\n- **复杂逻辑易中断**：面对多步故障树分析，长上下文推理常因显存溢出（OOM）而崩溃，无法完成完整诊断。\n\n### 使用 MiniCPM 后\n- **推理速度飞跃**：利用 MiniCPM4.1 的端侧优化特性，在同一设备上生成速度提升 3 倍以上，实现“话音落、答案出”的流畅体验。\n- **能效显著优化**：稀疏注意力机制大幅降低计算功耗，连续使用两小时手机仅微温，电量剩余仍超 60%。\n- **纯离线稳定运行**：模型完美适配端侧硬件，彻底切断对云端的依赖，即使在深山无信号区域也能提供专家级指导。\n- **长程推理不崩溃**：支持百万级 token 上下文建模，能完整记忆并处理长达数十页的设备维修手册，精准定位隐蔽故障。\n\nMiniCPM 让高性能大模型真正从云端走下神坛，成为普通终端设备上随时待命的智能专家。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenBMB_MiniCPM_962727b6.png","OpenBMB","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenBMB_02e4bd39.png","OpenBMB (Open Lab for Big Model Base) aims to build foundation models and systems towards AGI.",null,"openbmb@gmail.com","https:\u002F\u002Fwww.openbmb.cn","https:\u002F\u002Fgithub.com\u002FOpenBMB",[84,88,92],{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",56.1,{"name":89,"color":90,"percentage":91},"Python","#3572A5",40,{"name":93,"color":94,"percentage":95},"Shell","#89e051",3.9,8790,564,"2026-04-11T07:45:12","Apache-2.0",4,"Linux, macOS, Windows","推理可选：支持 NVIDIA GPU（配合 vLLM\u002FSGLang\u002FCPM.cu，需 CUDA）、Apple Silicon（配合 MLX）、Intel AIPC；特定优化竞赛提及 NVIDIA 6000D；量化版本（GGUF\u002FGPTQ\u002FAWQ）可降低显存需求以适配端侧芯片","未说明（取决于模型规模及是否量化，端侧版本如 0.5B\u002F1B 旨在低内存运行）",{"notes":105,"python":106,"dependencies":107},"该工具提供多种推理后端以适应不同硬件：NVIDIA GPU 推荐使用 vLLM 或 SGLang（支持投机解码加速）；Mac 用户可使用 MLX 版本；端侧或低资源环境推荐使用 llama.cpp、Ollama 或 GGUF\u002FGPTQ\u002FAWQ 量化版本；另有专为 Intel AIPC 优化的客户端版本。","未说明",[108,109,110,111,112,113,114,115],"vLLM","SGLang","llama.cpp","Ollama","MLX","HuggingFace Transformers","CPM.cu","EAGLE3",[15],"2026-03-27T02:49:30.150509","2026-04-11T21:38:30.924652",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},29825,"如何在 Ollama 或 llama.cpp 中运行 MiniCPM 模型？遇到 'unknown model architecture' 错误怎么办？","官方 llama.cpp 版本已经支持 MiniCPM3-4B。如果遇到 'unknown model architecture: minicpm3' 错误，请确保您使用的是最新的 llama.cpp 官方版本（C++ 版本），而不是 python 版本的 llama-cpp-python（该版本可能暂不支持）。您可以从 HuggingFace 下载已转换好的 GGUF 版本模型：https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM3-4B-GGUF","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fissues\u002F204",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},29826,"如何在 LM Studio 中加载和运行 MiniCPM 的 GGUF 模型？","LM Studio 0.2.16 及以上版本通常可以正常运行 MiniCPM 的 GGUF 版本模型。即使在界面上显示 'Unsupported Architecture'（不支持的架构），这通常不影响实际使用。如果仍无法运行，建议尝试更新到最新版的 llama.cpp 核心，或者等待 LM Studio 的后续更新，因为该架构可能需要特定的后端支持。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fissues\u002F57",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},29827,"如何使用 fastllm 部署 MiniCPM 模型？遇到 Segmentation fault 或输出不正常如何解决？","社区已提交 PR (ztxz16\u002Ffastllm#423) 进行适配。如果在纯 CPU 模式下输出不正常，请确保使用了修复 bug 后的最新版本。如果在低内存模式下运行出现 'Segmentation fault' 错误，这是已知问题，开发团队正在定位并修复。建议暂时避免在低内存模式下运行，或使用标准的 float16 模式进行测试。转换脚本可参考 fastllm\u002Ftools\u002Fscripts 下的相关实现。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fissues\u002F60",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},29828,"微调 MiniCPM 时遇到 CUDA Out of Memory (显存不足) 错误如何解决？","在使用 RTX 4090 (24G) 等显卡微调时，如果发生显存溢出，建议调整 DeepSpeed 配置。有用户反馈使用 `--deepspeed configs\u002Fds_config_zero3.json` (ZeRO-3 策略) 可以成功运行，而某些 offload 配置可能导致问题。请检查您的 ds_config 文件，确保启用了适当的参数卸载（Offload）或分片策略，并确认 `per_device_train_batch_size` 设置较小（如 1）。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fissues\u002F78",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},29829,"将模型转换为 MLC-LLM 格式时遇到 TVM InternalError 或 Vulkan\u002FOpenCL 设备检测错误怎么办？","此类错误可能与 MLC-LLM 的特定版本或环境配置有关。如果您安装了 OpenCL 库但系统仍检测到 Vulkan 驱动导致报错，即使卸载了 Vulkan 驱动问题依旧，这通常是底层 TVM 编译或设备探测逻辑的问题。建议参考 MLC-LLM 的相关 Issue (如 #1606) 寻找解决方案，或尝试重新构建启用特定后端（如仅 OpenCL）的 TVM 版本。目前该问题较难通过简单配置解决，可能需要等待上游修复。","https:\u002F\u002Fgithub.com\u002FOpenBMB\u002FMiniCPM\u002Fissues\u002F80",{"id":146,"question_zh":147,"answer_zh":148,"source_url":139},29830,"微调后的 MiniCPM 模型效果很差，输出不理想，有什么改进建议？","微调效果差可能与数据集质量、超参数设置或基座模型版本有关。建议检查训练数据格式是否正确（如 ChatML 格式），并尝试调整学习率（如 1e-3 可能过大，可尝试更小值）和 warmup 步骤。此外，确保使用的基座模型是最新的稳定版本。社区中有用户通过交流具体的 ds_config 配置和数据处理方式来改进效果，建议参考成功案例的配置进行对比调试。",[150],{"id":151,"version":152,"summary_zh":153,"released_at":154},206412,"2.4.2","主要功能：\n- 支持与模型进行文本及图片对话\n- 支持调用Intel集成显卡加速\n\n支持模型：\n- MiniCPM 4.0 8B & 0.5B\n- MiniCPM 3.0 4B\n- MiniCPM-V 2.6 8B（多模态）\n- MiniCPM-V 2.0 2.8B（多模态）\n- MiniCPM-2B-128K\n- MiniCPM-1B-SFT-BF16\n\n配置要求：\n- 建议使用英特尔酷睿 Ultra7 及以上移动端处理器\n- 建议运行内存 32GB 及以上","2025-07-01T06:19:46"]