[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-li-plus--chatglm.cpp":3,"tool-li-plus--chatglm.cpp":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151918,2,"2026-04-12T11:33:05",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[15,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":81,"owner_location":82,"owner_email":83,"owner_twitter":84,"owner_website":85,"owner_url":86,"languages":87,"stars":108,"forks":109,"last_commit_at":110,"license":111,"difficulty_score":112,"env_os":113,"env_gpu":114,"env_ram":115,"env_deps":116,"category_tags":127,"github_topics":128,"view_count":10,"oss_zip_url":84,"oss_zip_packed_at":84,"status":16,"created_at":136,"updated_at":137,"faqs":138,"releases":178},6875,"li-plus\u002Fchatglm.cpp","chatglm.cpp","C++ implementation of ChatGLM-6B & ChatGLM2-6B & ChatGLM3 & GLM4(V)","chatglm.cpp 是一个基于 C++ 开发的高效推理框架，专为在个人电脑（如 MacBook、Windows 或 Linux 主机）上本地运行清华智谱 AI 系列的 ChatGLM 及 GLM-4 大模型而设计。它解决了大型语言模型通常依赖昂贵 GPU 服务器、难以在普通硬件上流畅运行的痛点，让用户无需联网即可在本地实现隐私安全、低延迟的实时对话体验。\n\n这款工具非常适合希望在本地部署大模型的开发者、研究人员，以及关注数据隐私的普通用户。其核心亮点在于纯 C++ 实现并依托 ggml 库，能够像 llama.cpp 一样高效工作。通过支持 int4\u002Fint8 量化技术，它大幅降低了内存占用并加速了 CPU 推理过程，同时优化了键值缓存与并行计算能力。此外，chatglm.cpp 还兼容 P-Tuning v2 和 LoRA 微调模型，支持打字机效果的流式输出，并提供 Python 绑定、Web 演示及 API 服务等多种扩展方式。无论是 x86\u002FARM 架构的 CPU，还是 NVIDIA 与 Apple Silicon 的 GPU，都能获得良好的性能支持，让大模型应用真正变得轻量","chatglm.cpp 是一个基于 C++ 开发的高效推理框架，专为在个人电脑（如 MacBook、Windows 或 Linux 主机）上本地运行清华智谱 AI 系列的 ChatGLM 及 GLM-4 大模型而设计。它解决了大型语言模型通常依赖昂贵 GPU 服务器、难以在普通硬件上流畅运行的痛点，让用户无需联网即可在本地实现隐私安全、低延迟的实时对话体验。\n\n这款工具非常适合希望在本地部署大模型的开发者、研究人员，以及关注数据隐私的普通用户。其核心亮点在于纯 C++ 实现并依托 ggml 库，能够像 llama.cpp 一样高效工作。通过支持 int4\u002Fint8 量化技术，它大幅降低了内存占用并加速了 CPU 推理过程，同时优化了键值缓存与并行计算能力。此外，chatglm.cpp 还兼容 P-Tuning v2 和 LoRA 微调模型，支持打字机效果的流式输出，并提供 Python 绑定、Web 演示及 API 服务等多种扩展方式。无论是 x86\u002FARM 架构的 CPU，还是 NVIDIA 与 Apple Silicon 的 GPU，都能获得良好的性能支持，让大模型应用真正变得轻量且触手可及。","# ChatGLM.cpp\n\n[![CMake](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fcmake.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fcmake.yml)\n[![Python package](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fpython-package.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fpython-package.yml)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fchatglm-cpp)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fchatglm-cpp\u002F)\n![Python](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fchatglm-cpp)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue)](LICENSE)\n\nC++ implementation of [ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B), [ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B), [ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3) and [GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4)(V) for real-time chatting on your MacBook.\n\n![demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_bc25daa9a7a3.gif)\n\n## Features\n\nHighlights:\n* Pure C++ implementation based on [ggml](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fggml), working in the same way as [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp).\n* Accelerated memory-efficient CPU inference with int4\u002Fint8 quantization, optimized KV cache and parallel computing.\n* P-Tuning v2 and LoRA finetuned models support.\n* Streaming generation with typewriter effect.\n* Python binding, web demo, api servers and more possibilities.\n\nSupport Matrix:\n* Hardwares: x86\u002Farm CPU, NVIDIA GPU, Apple Silicon GPU\n* Platforms: Linux, MacOS, Windows\n* Models: [ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B), [ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B), [ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3), [GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4)(V), [CodeGeeX2](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCodeGeeX2)\n\n## Getting Started\n\n**Preparation**\n\nClone the ChatGLM.cpp repository into your local machine:\n```sh\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp.git && cd chatglm.cpp\n```\n\nIf you forgot the `--recursive` flag when cloning the repository, run the following command in the `chatglm.cpp` folder:\n```sh\ngit submodule update --init --recursive\n```\n\n**Quantize Model**\n\nInstall necessary packages for loading and quantizing Hugging Face models:\n```sh\npython3 -m pip install -U pip\npython3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece\n```\n\nUse `convert.py` to transform ChatGLM-6B into quantized GGML format. For example, to convert the fp16 original model to q4_0 (quantized int4) GGML model, run:\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm-6b -t q4_0 -o models\u002Fchatglm-ggml.bin\n```\n\nThe original model (`-i \u003Cmodel_name_or_path>`) can be a Hugging Face model name or a local path to your pre-downloaded model. Currently supported models are:\n* ChatGLM-6B: `THUDM\u002Fchatglm-6b`, `THUDM\u002Fchatglm-6b-int8`, `THUDM\u002Fchatglm-6b-int4`\n* ChatGLM2-6B: `THUDM\u002Fchatglm2-6b`, `THUDM\u002Fchatglm2-6b-int4`, `THUDM\u002Fchatglm2-6b-32k`, `THUDM\u002Fchatglm2-6b-32k-int4`\n* ChatGLM3-6B: `THUDM\u002Fchatglm3-6b`, `THUDM\u002Fchatglm3-6b-32k`, `THUDM\u002Fchatglm3-6b-128k`, `THUDM\u002Fchatglm3-6b-base`\n* ChatGLM4(V)-9B: `THUDM\u002Fglm-4-9b-chat`, `THUDM\u002Fglm-4-9b-chat-1m`, `THUDM\u002Fglm-4-9b`, `THUDM\u002Fglm-4v-9b`\n* CodeGeeX2: `THUDM\u002Fcodegeex2-6b`, `THUDM\u002Fcodegeex2-6b-int4`\n\nYou are free to try any of the below quantization types by specifying `-t \u003Ctype>`:\n| type   | precision | symmetric |\n| ------ | --------- | --------- |\n| `q4_0` | int4      | true      |\n| `q4_1` | int4      | false     |\n| `q5_0` | int5      | true      |\n| `q5_1` | int5      | false     |\n| `q8_0` | int8      | true      |\n| `f16`  | half      |           |\n| `f32`  | float     |           |\n\nFor LoRA models, add `-l \u003Clora_model_name_or_path>` flag to merge your LoRA weights into the base model. For example, run `python3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm3-6b -t q4_0 -o models\u002Fchatglm3-ggml-lora.bin -l shibing624\u002Fchatglm3-6b-csc-chinese-lora` to merge public LoRA weights from Hugging Face.\n\nFor P-Tuning v2 models using the [official finetuning script](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3\u002Ftree\u002Fmain\u002Ffinetune_demo), additional weights are automatically detected by `convert.py`. If `past_key_values` is on the output weight list, the P-Tuning checkpoint is successfully converted.\n\n**Build & Run**\n\nCompile the project using CMake:\n```sh\ncmake -B build\ncmake --build build -j --config Release\n```\n\nNow you may chat with the quantized ChatGLM-6B model by running:\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p 你好\n# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\n```\n\nTo run the model in interactive mode, add the `-i` flag. For example:\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -i\n```\nIn interactive mode, your chat history will serve as the context for the next-round conversation.\n\nRun `.\u002Fbuild\u002Fbin\u002Fmain -h` to explore more options!\n\n**Try Other Models**\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM2-6B\u003C\u002Fsummary>\n\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm2-6b -t q4_0 -o models\u002Fchatglm2-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm2-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM3-6B\u003C\u002Fsummary>\n\nChatGLM3-6B further supports function call and code interpreter in addition to chat mode.\n\nChat mode:\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm3-6b -t q4_0 -o models\u002Fchatglm3-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。\n```\n\nSetting system prompt:\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -p 你好 -s \"You are ChatGLM3, a large language model trained by Zhipu.AI. Follow the user's instructions carefully. Respond using markdown.\"\n# 你好👋！我是 ChatGLM3，有什么问题可以帮您解答吗？\n```\n\nFunction call:\n~~~\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples\u002Fsystem\u002Ffunction_call.txt -i\nSystem   > Answer the following questions as best as you can. You have access to the following tools: ...\nPrompt   > 生成一个随机数\nChatGLM3 > random_number_generator\n```python\ntool_call(seed=42, range=(0, 100))\n```\nTool Call   > Please manually call function `random_number_generator` with args `tool_call(seed=42, range=(0, 100))` and provide the results below.\nObservation > 23\nChatGLM3 > 根据您的要求，我使用随机数生成器API生成了一个随机数。根据API返回结果，生成的随机数为23。\n~~~\n\nCode interpreter:\n~~~\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples\u002Fsystem\u002Fcode_interpreter.txt -i\nSystem   > 你是一位智能AI助手，你叫ChatGLM，你连接着一台电脑，但请注意不能联网。在使用Python解决任务时，你可以运行代码并得到结果，如果运行结果有错误，你需要尽可能对代码进行改进。你可以处理用户上传到电脑上的文件，文件默认存储路径是\u002Fmnt\u002Fdata\u002F。\nPrompt   > 列出100以内的所有质数\nChatGLM3 > 好的，我会为您列出100以内的所有质数。\n```python\ndef is_prime(n):\n   \"\"\"Check if a number is prime.\"\"\"\n   if n \u003C= 1:\n       return False\n   if n \u003C= 3:\n       return True\n   if n % 2 == 0 or n % 3 == 0:\n       return False\n   i = 5\n   while i * i \u003C= n:\n       if n % i == 0 or n % (i + 2) == 0:\n           return False\n       i += 6\n   return True\n\nprimes_upto_100 = [i for i in range(2, 101) if is_prime(i)]\nprimes_upto_100\n```\n\nCode Interpreter > Please manually run the code and provide the results below.\nObservation      > [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]\nChatGLM3 > 100以内的所有质数为：\n\n$$\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 \n$$\n~~~\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4-9B\u003C\u002Fsummary>\n\nChat mode:\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fglm-4-9b-chat -t q4_0 -o models\u002Fchatglm4-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm4-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！有什么可以帮助你的吗？\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4V-9B\u003C\u002Fsummary>\n\n[![03-Confusing-Pictures](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg)](https:\u002F\u002Fwww.barnorama.com\u002Fwp-content\u002Fuploads\u002F2016\u002F12\u002F03-Confusing-Pictures.jpg)\n\nYou may use `-vt \u003Cvision_type>` to set quantization type for the vision encoder. It is recommended to run GLM4V on GPU since vision encoding runs too slow on CPU even with 4-bit quantization.\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fglm-4v-9b -t q4_0 -vt q4_0 -o models\u002Fchatglm4v-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm4v-ggml.bin --image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg -p \"这张图片有什么不寻常的地方\" --temp 0\n# 这张图片中不寻常的地方在于，男子正在一辆黄色出租车后面熨衣服。通常情况下，熨衣是在家中或洗衣店进行的，而不是在车辆上。此外，出租车在行驶中，男子却能够稳定地熨衣，这增加了场景的荒诞感。\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>CodeGeeX2\u003C\u002Fsummary>\n\n```sh\n$ python3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fcodegeex2-6b -t q4_0 -o models\u002Fcodegeex2-ggml.bin\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fcodegeex2-ggml.bin --temp 0 --mode generate -p \"\\\n# language: Python\n# write a bubble sort function\n\"\n\n\ndef bubble_sort(lst):\n    for i in range(len(lst) - 1):\n        for j in range(len(lst) - 1 - i):\n            if lst[j] > lst[j + 1]:\n                lst[j], lst[j + 1] = lst[j + 1], lst[j]\n    return lst\n\n\nprint(bubble_sort([5, 4, 3, 2, 1]))\n```\n\u003C\u002Fdetails>\n\n## Using BLAS\n\nBLAS library can be integrated to further accelerate matrix multiplication. However, in some cases, using BLAS may cause performance degradation. Whether to turn on BLAS should depend on the benchmarking result.\n\n**Accelerate Framework**\n\nAccelerate Framework is automatically enabled on macOS. To disable it, add the CMake flag `-DGGML_NO_ACCELERATE=ON`.\n\n**OpenBLAS**\n\nOpenBLAS provides acceleration on CPU. Add the CMake flag `-DGGML_OPENBLAS=ON` to enable it.\n```sh\ncmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j\n```\n\n**CUDA**\n\nCUDA accelerates model inference on NVIDIA GPU. Add the CMake flag `-DGGML_CUDA=ON` to enable it.\n```sh\ncmake -B build -DGGML_CUDA=ON && cmake --build build -j\n```\n\nBy default, all kernels will be compiled for all possible CUDA architectures and it takes some time. To run on a specific type of device, you may specify `CMAKE_CUDA_ARCHITECTURES` to speed up the nvcc compilation. For example:\n```sh\ncmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=\"80\"       # for A100\ncmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=\"70;75\"    # compatible with both V100 and T4\n```\n\nTo find out the CUDA architecture of your GPU device, see [Your GPU Compute Capability](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus).\n\n**Metal**\n\nMPS (Metal Performance Shaders) allows computation to run on powerful Apple Silicon GPU. Add the CMake flag `-DGGML_METAL=ON` to enable it.\n```sh\ncmake -B build -DGGML_METAL=ON && cmake --build build -j\n```\n\n## Python Binding\n\nThe Python binding provides high-level `chat` and `stream_chat` interface similar to the original Hugging Face ChatGLM(2)-6B.\n\n**Installation**\n\nInstall from PyPI (recommended): will trigger compilation on your platform.\n```sh\npip install -U chatglm-cpp\n```\n\nTo enable CUDA on NVIDIA GPU:\n```sh\nCMAKE_ARGS=\"-DGGML_CUDA=ON\" pip install -U chatglm-cpp\n```\n\nTo enable Metal on Apple silicon devices:\n```sh\nCMAKE_ARGS=\"-DGGML_METAL=ON\" pip install -U chatglm-cpp\n```\n\nYou may also install from source. Add the corresponding `CMAKE_ARGS` for acceleration.\n```sh\n# install from the latest source hosted on GitHub\npip install git+https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp.git@main\n# or install from your local source after git cloning the repo\npip install .\n```\n\nPre-built wheels for CPU backend on Linux \u002F MacOS \u002F Windows are published on [release](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Freleases). For CUDA \u002F Metal backends, please compile from source code or source distribution.\n\n**Using Pre-converted GGML Models**\n\nHere is a simple demo that uses `chatglm_cpp.Pipeline` to load the GGML model and chat with it. First enter the examples folder (`cd examples`) and launch a Python interactive shell:\n```python\n>>> import chatglm_cpp\n>>> \n>>> pipeline = chatglm_cpp.Pipeline(\"..\u002Fmodels\u002Fchatglm-ggml.bin\")\n>>> pipeline.chat([chatglm_cpp.ChatMessage(role=\"user\", content=\"你好\")])\nChatMessage(role=\"assistant\", content=\"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\", tool_calls=[])\n```\n\nTo chat in stream, run the below Python example:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm-ggml.bin -i\n```\n\nLaunch a web demo to chat in your browser:\n```sh\npython3 web_demo.py -m ..\u002Fmodels\u002Fchatglm-ggml.bin\n```\n\n![web_demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_3117cb47aac9.jpg)\n\nFor other models:\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM2-6B\u003C\u002Fsummary>\n\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm2-ggml.bin -p 你好 --temp 0.8 --top_p 0.8  # CLI demo\npython3 web_demo.py -m ..\u002Fmodels\u002Fchatglm2-ggml.bin --temp 0.8 --top_p 0.8  # web demo\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM3-6B\u003C\u002Fsummary>\n\n**CLI Demo**\n\nChat mode:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin -p 你好 --temp 0.8 --top_p 0.8\n```\n\nFunction call:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin --temp 0.8 --top_p 0.8 --sp system\u002Ffunction_call.txt -i\n```\n\nCode interpreter:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin --temp 0.8 --top_p 0.8 --sp system\u002Fcode_interpreter.txt -i\n```\n\n**Web Demo**\n\nInstall Python dependencies and the IPython kernel for code interpreter.\n```sh\npip install streamlit jupyter_client ipython ipykernel\nipython kernel install --name chatglm3-demo --user\n```\n\nLaunch the web demo:\n```sh\nstreamlit run chatglm3_demo.py\n```\n\n| Function Call               | Code Interpreter               |\n|-----------------------------|--------------------------------|\n| ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_4ae6260fb01d.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_1f8bd8ef7727.png) |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4-9B\u003C\u002Fsummary>\n\nChat mode:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm4-ggml.bin -p 你好 --temp 0.8 --top_p 0.8\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4V-9B\u003C\u002Fsummary>\n\nChat mode:\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm4v-ggml.bin --image 03-Confusing-Pictures.jpg -p \"这张图片有什么不寻常之处\" --temp 0\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>CodeGeeX2\u003C\u002Fsummary>\n\n```sh\n# CLI demo\npython3 cli_demo.py -m ..\u002Fmodels\u002Fcodegeex2-ggml.bin --temp 0 --mode generate -p \"\\\n# language: Python\n# write a bubble sort function\n\"\n# web demo\npython3 web_demo.py -m ..\u002Fmodels\u002Fcodegeex2-ggml.bin --temp 0 --max_length 512 --mode generate --plain\n```\n\u003C\u002Fdetails>\n\n**Converting Hugging Face LLMs at Runtime**\n\nSometimes it might be inconvenient to convert and save the intermediate GGML models beforehand. Here is an option to directly load from the original Hugging Face model, quantize it into GGML models in a minute, and start serving. All you need is to replace the GGML model path with the Hugging Face model name or path.\n```python\n>>> import chatglm_cpp\n>>> \n>>> pipeline = chatglm_cpp.Pipeline(\"THUDM\u002Fchatglm-6b\", dtype=\"q4_0\")\nLoading checkpoint shards: 100%|██████████████████████████████████| 8\u002F8 [00:10\u003C00:00,  1.27s\u002Fit]\nProcessing model states: 100%|████████████████████████████████| 339\u002F339 [00:23\u003C00:00, 14.73it\u002Fs]\n...\n>>> pipeline.chat([chatglm_cpp.ChatMessage(role=\"user\", content=\"你好\")])\nChatMessage(role=\"assistant\", content=\"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\", tool_calls=[])\n```\n\nLikewise, replace the GGML model path with Hugging Face model in any example script, and it just works. For example:\n```sh\npython3 cli_demo.py -m THUDM\u002Fchatglm-6b -p 你好 -i\n```\n\n## API Server\n\nWe support various kinds of API servers to integrate with popular frontends. Extra dependencies can be installed by:\n```sh\npip install 'chatglm-cpp[api]'\n```\nRemember to add the corresponding `CMAKE_ARGS` to enable acceleration.\n\n**LangChain API**\n\nStart the api server for LangChain:\n```sh\nMODEL=.\u002Fmodels\u002Fchatglm2-ggml.bin uvicorn chatglm_cpp.langchain_api:app --host 127.0.0.1 --port 8000\n```\n\nTest the api endpoint with `curl`:\n```sh\ncurl http:\u002F\u002F127.0.0.1:8000 -H 'Content-Type: application\u002Fjson' -d '{\"prompt\": \"你好\"}'\n```\n\nRun with LangChain:\n```python\n>>> from langchain.llms import ChatGLM\n>>> \n>>> llm = ChatGLM(endpoint_url=\"http:\u002F\u002F127.0.0.1:8000\")\n>>> llm.predict(\"你好\")\n'你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。'\n```\n\nFor more options, please refer to [examples\u002Flangchain_client.py](examples\u002Flangchain_client.py) and [LangChain ChatGLM Integration](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fllms\u002Fchatglm).\n\n**OpenAI API**\n\nStart an API server compatible with [OpenAI chat completions protocol](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002Fchat):\n```sh\nMODEL=.\u002Fmodels\u002Fchatglm3-ggml.bin uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000\n```\n\nTest your endpoint with `curl`:\n```sh\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fchat\u002Fcompletions -H 'Content-Type: application\u002Fjson' \\\n    -d '{\"messages\": [{\"role\": \"user\", \"content\": \"你好\"}]}'\n```\n\nUse the OpenAI client to chat with your model:\n```python\n>>> from openai import OpenAI\n>>> \n>>> client = OpenAI(base_url=\"http:\u002F\u002F127.0.0.1:8000\u002Fv1\")\n>>> response = client.chat.completions.create(model=\"default-model\", messages=[{\"role\": \"user\", \"content\": \"你好\"}])\n>>> response.choices[0].message.content\n'你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。'\n```\n\nFor stream response, check out the example client script:\n```sh\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --stream --prompt 你好\n```\n\nTool calling is also supported:\n```sh\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --tool_call --prompt 上海天气怎么样\n```\n\nRequest GLM4V with image inputs:\n```sh\n# request with local image file\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --prompt \"描述这张图片\" \\\n    --image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg --temp 0\n# request with image url\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --prompt \"描述这张图片\" \\\n    --image https:\u002F\u002Fwww.barnorama.com\u002Fwp-content\u002Fuploads\u002F2016\u002F12\u002F03-Confusing-Pictures.jpg --temp 0\n```\n\nWith this API server as backend, ChatGLM.cpp models can be seamlessly integrated into any frontend that uses OpenAI-style API, including [mckaywrigley\u002Fchatbot-ui](https:\u002F\u002Fgithub.com\u002Fmckaywrigley\u002Fchatbot-ui), [fuergaosi233\u002Fwechat-chatgpt](https:\u002F\u002Fgithub.com\u002Ffuergaosi233\u002Fwechat-chatgpt), [Yidadaa\u002FChatGPT-Next-Web](https:\u002F\u002Fgithub.com\u002FYidadaa\u002FChatGPT-Next-Web), and more.\n\n## Using Docker\n\n**Option 1: Building Locally**\n\nBuilding docker image locally and start a container to run inference on CPU:\n```sh\ndocker build . --network=host -t chatglm.cpp\n# cpp demo\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n# python demo\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp python3 examples\u002Fcli_demo.py -m models\u002Fchatglm-ggml.bin -p \"你好\"\n# langchain api server\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels -p 8000:8000 -e MODEL=models\u002Fchatglm-ggml.bin chatglm.cpp \\\n    uvicorn chatglm_cpp.langchain_api:app --host 0.0.0.0 --port 8000\n# openai api server\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels -p 8000:8000 -e MODEL=models\u002Fchatglm-ggml.bin chatglm.cpp \\\n    uvicorn chatglm_cpp.openai_api:app --host 0.0.0.0 --port 8000\n```\n\nFor CUDA support, make sure [nvidia-docker](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-docker) is installed. Then run:\n```sh\ndocker build . --network=host -t chatglm.cpp-cuda \\\n    --build-arg BASE_IMAGE=nvidia\u002Fcuda:12.2.0-devel-ubuntu20.04 \\\n    --build-arg CMAKE_ARGS=\"-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80\"\ndocker run -it --rm --gpus all -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp-cuda \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\n**Option 2: Using Pre-built Image**\n\nThe pre-built image for CPU inference is published on both [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Frepository\u002Fdocker\u002Fliplusx\u002Fchatglm.cpp) and [GitHub Container Registry (GHCR)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fpkgs\u002Fcontainer\u002Fchatglm.cpp).\n\nTo pull from Docker Hub and run demo:\n```sh\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels liplusx\u002Fchatglm.cpp:main \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\nTo pull from GHCR and run demo:\n```sh\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels ghcr.io\u002Fli-plus\u002Fchatglm.cpp:main \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\nPython demo and API servers are also supported in pre-built image. Use it in the same way as **Option 1**.\n\n## Performance\n\nEnvironment:\n* CPU backend performance is measured on a Linux server with Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz using 16 threads.\n* CUDA backend is measured on a V100-SXM2-32GB GPU using 1 thread.\n* MPS backend is measured on an Apple M2 Ultra device using 1 thread.\n\nChatGLM-6B:\n\n|                                | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|--------------------------------|-------|-------|-------|-------|-------|-------|\n| ms\u002Ftoken (CPU @ Platinum 8260) | 74    | 77    | 86    | 89    | 114   | 189   |\n| ms\u002Ftoken (CUDA @ V100 SXM2)    | 8.1   | 8.7   | 9.4   | 9.5   | 12.0  | 19.1  |\n| ms\u002Ftoken (MPS @ M2 Ultra)      | 11.5  | 12.3  | N\u002FA   | N\u002FA   | 16.1  | 24.4  |\n| file size                      | 3.3G  | 3.7G  | 4.0G  | 4.4G  | 6.2G  | 12G   |\n| mem usage                      | 4.0G  | 4.4G  | 4.7G  | 5.1G  | 6.9G  | 13G   |\n\nChatGLM2-6B \u002F ChatGLM3-6B \u002F CodeGeeX2:\n\n|                                | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|--------------------------------|-------|-------|-------|-------|-------|-------|\n| ms\u002Ftoken (CPU @ Platinum 8260) | 64    | 71    | 79    | 83    | 106   | 189   |\n| ms\u002Ftoken (CUDA @ V100 SXM2)    | 7.9   | 8.3   | 9.2   | 9.2   | 11.7  | 18.5  |\n| ms\u002Ftoken (MPS @ M2 Ultra)      | 10.0  | 10.8  | N\u002FA   | N\u002FA   | 14.5  | 22.2  |\n| file size                      | 3.3G  | 3.7G  | 4.0G  | 4.4G  | 6.2G  | 12G   |\n| mem usage                      | 3.4G  | 3.8G  | 4.1G  | 4.5G  | 6.2G  | 12G   |\n\nChatGLM4-9B:\n\n|                                | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16  |\n|--------------------------------|------|------|------|------|------|------|\n| ms\u002Ftoken (CPU @ Platinum 8260) | 105  | 105  | 122  | 134  | 158  | 279  |\n| ms\u002Ftoken (CUDA @ V100 SXM2)    | 12.1 | 12.5 | 13.8 | 13.9 | 17.7 | 27.7 |\n| ms\u002Ftoken (MPS @ M2 Ultra)      | 14.4 | 15.3 | 19.6 | 20.1 | 20.7 | 32.4 |\n| file size                      | 5.0G | 5.5G | 6.1G | 6.6G | 9.4G | 18G  |\n\n## Model Quality\n\nWe measure model quality by evaluating the perplexity over the WikiText-2 test dataset, following the strided sliding window strategy in https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fperplexity. Lower perplexity usually indicates a better model.\n\nDownload and unzip the dataset from [link](https:\u002F\u002Fs3.amazonaws.com\u002Fresearch.metamind.io\u002Fwikitext\u002Fwikitext-2-raw-v1.zip). Measure the perplexity with a stride of 512 and max input length of 2048:\n```sh\n.\u002Fbuild\u002Fbin\u002Fperplexity -m models\u002Fchatglm3-base-ggml.bin -f wikitext-2-raw\u002Fwiki.test.raw -s 512 -l 2048\n```\n\n|                         | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|-------------------------|-------|-------|-------|-------|-------|-------|\n| [ChatGLM3-6B-Base][1]   | 6.215 | 6.188 | 6.006 | 6.022 | 5.971 | 5.972 |\n| [ChatGLM4-9B-Base][2]   | 6.834 | 6.780 | 6.645 | 6.624 | 6.576 | 6.577 |\n\n[1]: https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fchatglm3-6b-base\n[2]: https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fglm-4-9b\n\n## Development\n\n**Unit Test & Benchmark**\n\nTo perform unit tests, add this CMake flag `-DCHATGLM_ENABLE_TESTING=ON` to enable testing. Recompile and run the unit test (including benchmark).\n```sh\nmkdir -p build && cd build\ncmake .. -DCHATGLM_ENABLE_TESTING=ON && make -j\n.\u002Fbin\u002Fchatglm_test\n```\n\nFor benchmark only:\n```sh\n.\u002Fbin\u002Fchatglm_test --gtest_filter='Benchmark.*'\n```\n\n**Lint**\n\nTo format the code, run `make lint` inside the `build` folder. You should have `clang-format`, `black` and `isort` pre-installed.\n\n**Performance**\n\nTo detect the performance bottleneck, add the CMake flag `-DGGML_PERF=ON`:\n```sh\ncmake .. -DGGML_PERF=ON && make -j\n```\nThis will print timing for each graph operation when running the model.\n\n## Acknowledgements\n\n* This project is greatly inspired by [@ggerganov](https:\u002F\u002Fgithub.com\u002Fggerganov)'s [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) and is based on his NN library [ggml](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fggml).\n* Thank [@THUDM](https:\u002F\u002Fgithub.com\u002FTHUDM) for the amazing [ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B), [ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B), [ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3) and [GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4) and for releasing the model sources and checkpoints.\n","# ChatGLM.cpp\n\n[![CMake](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fcmake.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fcmake.yml)\n[![Python package](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fpython-package.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Factions\u002Fworkflows\u002Fpython-package.yml)\n[![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fchatglm-cpp)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fchatglm-cpp\u002F)\n![Python](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fchatglm-cpp)\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue)](LICENSE)\n\n基于 [ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B)、[ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B)、[ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3) 和 [GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4)(V) 的 C++ 实现，可在您的 MacBook 上进行实时对话。\n\n![demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_bc25daa9a7a3.gif)\n\n## 特性\n\n亮点：\n* 基于 [ggml](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fggml) 的纯 C++ 实现，工作方式与 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) 相同。\n* 通过 int4\u002Fint8 量化、优化的 KV 缓存和并行计算，实现加速且内存高效的 CPU 推理。\n* 支持 P-Tuning v2 和 LoRA 微调模型。\n* 具有打字机效果的流式生成。\n* 提供 Python 绑定、Web 演示、API 服务器等多种可能性。\n\n支持矩阵：\n* 硬件：x86\u002Farm CPU、NVIDIA GPU、Apple Silicon GPU\n* 平台：Linux、MacOS、Windows\n* 模型：[ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B)、[ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B)、[ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3)、[GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4)(V)、[CodeGeeX2](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCodeGeeX2)\n\n## 快速入门\n\n**准备工作**\n\n将 ChatGLM.cpp 仓库克隆到本地：\n```sh\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp.git && cd chatglm.cpp\n```\n\n如果您在克隆仓库时忘记使用 `--recursive` 标志，请在 `chatglm.cpp` 文件夹中运行以下命令：\n```sh\ngit submodule update --init --recursive\n```\n\n**模型量化**\n\n安装加载和量化 Hugging Face 模型所需的必要包：\n```sh\npython3 -m pip install -U pip\npython3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece\n```\n\n使用 `convert.py` 将 ChatGLM-6B 转换为量化后的 GGML 格式。例如，要将 fp16 原始模型转换为 q4_0（量化为 int4）GGML 模型，运行：\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm-6b -t q4_0 -o models\u002Fchatglm-ggml.bin\n```\n\n原始模型 (`-i \u003Cmodel_name_or_path>`) 可以是 Hugging Face 模型名称，也可以是你预先下载的模型的本地路径。目前支持的模型包括：\n* ChatGLM-6B：`THUDM\u002Fchatglm-6b`、`THUDM\u002Fchatglm-6b-int8`、`THUDM\u002Fchatglm-6b-int4`\n* ChatGLM2-6B：`THUDM\u002Fchatglm2-6b`、`THUDM\u002Fchatglm2-6b-int4`、`THUDM\u002Fchatglm2-6b-32k`、`THUDM\u002Fchatglm2-6b-32k-int4`\n* ChatGLM3-6B：`THUDM\u002Fchatglm3-6b`、`THUDM\u002Fchatglm3-6b-32k`、`THUDM\u002Fchatglm3-6b-128k`、`THUDM\u002Fchatglm3-6b-base`\n* ChatGLM4(V)-9B：`THUDM\u002Fglm-4-9b-chat`、`THUDM\u002Fglm-4-9b-chat-1m`、`THUDM\u002Fglm-4-9b`、`THUDM\u002Fglm-4v-9b`\n* CodeGeeX2：`THUDM\u002Fcodegeex2-6b`、`THUDM\u002Fcodegeex2-6b-int4`\n\n您可以自由尝试以下任何一种量化类型，只需指定 `-t \u003Ctype>`：\n| 类型   | 精度 | 对称 |\n| ------ | --------- | --------- |\n| `q4_0` | int4      | true      |\n| `q4_1` | int4      | false     |\n| `q5_0` | int5      | true      |\n| `q5_1` | int5      | false     |\n| `q8_0` | int8      | true      |\n| `f16`  | 半精度    |           |\n| `f32`  | 浮点数    |           |\n\n对于 LoRA 模型，添加 `-l \u003Clora_model_name_or_path>` 标志，即可将 LoRA 权重合并到基础模型中。例如，运行 `python3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm3-6b -t q4_0 -o models\u002Fchatglm3-ggml-lora.bin -l shibing624\u002Fchatglm3-6b-csc-chinese-lora`，即可合并来自 Hugging Face 的公开 LoRA 权重。\n\n对于使用 [官方微调脚本](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3\u002Ftree\u002Fmain\u002Ffinetune_demo) 的 P-Tuning v2 模型，`convert.py` 会自动检测额外的权重。如果输出权重列表中包含 `past_key_values`，则表示 P-Tuning 检查点已成功转换。\n\n**构建与运行**\n\n使用 CMake 编译项目：\n```sh\ncmake -B build\ncmake --build build -j --config Release\n```\n\n现在您可以通过运行以下命令与量化后的 ChatGLM-6B 模型对话：\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p 你好\n# 你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\n```\n\n若要在交互模式下运行模型，请添加 `-i` 标志。例如：\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -i\n```\n在交互模式下，您的聊天记录将作为下一轮对话的上下文。\n\n运行 `.\u002Fbuild\u002Fbin\u002Fmain -h` 以了解更多选项！\n\n**尝试其他模型**\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM2-6B\u003C\u002Fsummary>\n\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm2-6b -t q4_0 -o models\u002Fchatglm2-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm2-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM3-6B\u003C\u002Fsummary>\n\nChatGLM3-6B 在对话模式的基础上进一步支持函数调用和代码解释器功能。\n\n对话模式：\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm3-6b -t q4_0 -o models\u002Fchatglm3-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。\n```\n\n设置系统提示词：\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -p 你好 -s \"你是一个由智谱 AI 训练的大语言模型 ChatGLM3。请仔细遵循用户的指示，并以 Markdown 格式作答。\"\n\n# 你好👋！我是 ChatGLM3，有什么问题可以帮您解答吗？\n```\n\n函数调用：\n~~~\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples\u002Fsystem\u002Ffunction_call.txt -i\n系统   > 请尽你所能回答以下问题。你可以使用以下工具：...\n提示   > 生成一个随机数\nChatGLM3 > random_number_generator\n```python\ntool_call(seed=42, range=(0, 100))\n```\n工具调用   > 请手动调用函数 `random_number_generator`，参数为 `tool_call(seed=42, range=(0, 100))`，并将结果提供在下方。\n观测值 > 23\nChatGLM3 > 根据您的要求，我使用随机数生成器API生成了一个随机数。根据API返回结果，生成的随机数为23。\n~~~\n\n代码解释器：\n~~~\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin --top_p 0.8 --temp 0.8 --sp examples\u002Fsystem\u002Fcode_interpreter.txt -i\n系统   > 你是一位智能AI助手，你叫ChatGLM，你连接着一台电脑，但请注意不能联网。在使用Python解决任务时，你可以运行代码并得到结果，如果运行结果有错误，你需要尽可能对代码进行改进。你可以处理用户上传到电脑上的文件，文件默认存储路径是\u002Fmnt\u002Fdata\u002F。\n提示   > 列出100以内的所有质数\nChatGLM3 > 好的，我会为您列出100以内的所有质数。\n```python\ndef is_prime(n):\n   \"\"\"检查一个数是否为质数。\"\"\"\n   if n \u003C= 1:\n       return False\n   if n \u003C= 3:\n       return True\n   if n % 2 == 0 or n % 3 == 0:\n       return False\n   i = 5\n   while i * i \u003C= n:\n       if n % i == 0 or n % (i + 2) == 0:\n           return False\n       i += 6\n   return True\n\nprimes_upto_100 = [i for i in range(2, 101) if is_prime(i)]\nprimes_upto_100\n```\n\n代码解释器 > 请手动运行代码，并将结果提供在下方。\n观测值      > [2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97]\nChatGLM3 > 100以内的所有质数为：\n\n$$\n2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37, 41, 43, 47, 53, 59, 61, 67, 71, 73, 79, 83, 89, 97 \n$$\n~~~\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4-9B\u003C\u002Fsummary>\n\n聊天模式：\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fglm-4-9b-chat -t q4_0 -o models\u002Fchatglm4-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm4-ggml.bin -p 你好 --top_p 0.8 --temp 0.8\n# 你好👋！有什么可以帮助你的吗？\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4V-9B\u003C\u002Fsummary>\n\n[![03-Confusing-Pictures](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg)](https:\u002F\u002Fwww.barnorama.com\u002Fwp-content\u002Fuploads\u002F2016\u002F12\u002F03-Confusing-Pictures.jpg)\n\n你可以使用 `-vt \u003Cvision_type>` 来设置视觉编码器的量化类型。建议在GPU上运行GLM4V，因为即使采用4位量化，视觉编码在CPU上也会非常缓慢。\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fglm-4v-9b -t q4_0 -vt q4_0 -o models\u002Fchatglm4v-ggml.bin\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm4v-ggml.bin --image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg -p \"这张图片有什么不寻常的地方\" --temp 0\n# 这张图片中不寻常的地方在于，男子正在一辆黄色出租车后面熨衣服。通常情况下，熨衣是在家中或洗衣店进行的，而不是在车辆上。此外，出租车在行驶中，男子却能够稳定地熨衣，这增加了场景的荒诞感。\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>CodeGeeX2\u003C\u002Fsummary>\n\n```sh\n$ python3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fcodegeex2-6b -t q4_0 -o models\u002Fcodegeex2-ggml.bin\n$ .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fcodegeex2-ggml.bin --temp 0 --mode generate -p \"\\\n# language: Python\n# write a bubble sort function\n\"\n\n\ndef bubble_sort(lst):\n    for i in range(len(lst) - 1):\n        for j in range(len(lst) - 1 - i):\n            if lst[j] > lst[j + 1]:\n                lst[j], lst[j + 1] = lst[j + 1], lst[j]\n    return lst\n\n\nprint(bubble_sort([5, 4, 3, 2, 1]))\n```\n\u003C\u002Fdetails>\n\n## 使用BLAS\n\nBLAS库可以集成进来进一步加速矩阵乘法。然而，在某些情况下，使用BLAS可能会导致性能下降。是否启用BLAS应取决于基准测试的结果。\n\n**Accelerate框架**\n\nAccelerate框架在macOS上会自动启用。若要禁用它，需添加CMake标志 `-DGGML_NO_ACCELERATE=ON`。\n\n**OpenBLAS**\n\nOpenBLAS可以在CPU上提供加速。添加CMake标志 `-DGGML_OPENBLAS=ON` 即可启用。\n```sh\ncmake -B build -DGGML_OPENBLAS=ON && cmake --build build -j\n```\n\n**CUDA**\n\nCUDA可以在NVIDIA GPU上加速模型推理。添加CMake标志 `-DGGML_CUDA=ON` 即可启用。\n```sh\ncmake -B build -DGGML_CUDA=ON && cmake --build build -j\n```\n\n默认情况下，所有内核都会针对所有可能的CUDA架构进行编译，这需要一些时间。若要在特定类型的设备上运行，可以指定 `CMAKE_CUDA_ARCHITECTURES` 来加快nvcc编译速度。例如：\n```sh\ncmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=\"80\"       # 用于A100\ncmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=\"70;75\"    # 兼容V100和T4\n```\n\n要了解你的GPU设备的CUDA架构，可以查看[NVIDIA GPU计算能力](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus)。\n\n**Metal**\n\nMPS（Metal性能着色器）允许计算在强大的Apple Silicon GPU上运行。添加CMake标志 `-DGGML_METAL=ON` 即可启用。\n```sh\ncmake -B build -DGGML_METAL=ON && cmake --build build -j\n```\n\n## Python绑定\n\nPython绑定提供了与原始Hugging Face ChatGLM(2)-6B类似的高级`chat`和`stream_chat`接口。\n\n**安装**\n\n推荐从PyPI安装：这将在你的平台上触发编译。\n```sh\npip install -U chatglm-cpp\n```\n\n若要在NVIDIA GPU上启用CUDA：\n```sh\nCMAKE_ARGS=\"-DGGML_CUDA=ON\" pip install -U chatglm-cpp\n```\n\n若要在Apple硅芯片设备上启用Metal：\n```sh\nCMAKE_ARGS=\"-DGGML_METAL=ON\" pip install -U chatglm-cpp\n```\n\n你也可以从源码安装。为获得加速效果，需添加相应的`CMAKE_ARGS`。\n```sh\n# 从GitHub上托管的最新源码安装\npip install git+https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp.git@main\n\n# 或者在 Git 克隆仓库后，从本地源安装\npip install .\n```\n\n适用于 Linux \u002F MacOS \u002F Windows 的 CPU 后端预编译轮子已在 [release](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Freleases) 中发布。对于 CUDA \u002F Metal 后端，请从源代码或源码分发包进行编译。\n\n**使用预先转换的 GGML 模型**\n\n以下是一个简单的示例，使用 `chatglm_cpp.Pipeline` 加载 GGML 模型并与之对话。首先进入 examples 文件夹（`cd examples`），然后启动 Python 交互式 shell：\n```python\n>>> import chatglm_cpp\n>>> \n>>> pipeline = chatglm_cpp.Pipeline(\"..\u002Fmodels\u002Fchatglm-ggml.bin\")\n>>> pipeline.chat([chatglm_cpp.ChatMessage(role=\"user\", content=\"你好\")])\nChatMessage(role=\"assistant\", content=\"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\", tool_calls=[])\n```\n\n若需以流式方式对话，请运行以下 Python 示例：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm-ggml.bin -i\n```\n\n在浏览器中启动 Web 演示以进行对话：\n```sh\npython3 web_demo.py -m ..\u002Fmodels\u002Fchatglm-ggml.bin\n```\n\n![web_demo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_3117cb47aac9.jpg)\n\n其他模型：\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM2-6B\u003C\u002Fsummary>\n\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm2-ggml.bin -p 你好 --temp 0.8 --top_p 0.8  # CLI 演示\npython3 web_demo.py -m ..\u002Fmodels\u002Fchatglm2-ggml.bin --temp 0.8 --top_p 0.8  # Web 演示\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM3-6B\u003C\u002Fsummary>\n\n**CLI 演示**\n\n聊天模式：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin -p 你好 --temp 0.8 --top_p 0.8\n```\n\n函数调用：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin --temp 0.8 --top_p 0.8 --sp system\u002Ffunction_call.txt -i\n```\n\n代码解释器：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm3-ggml.bin --temp 0.8 --top_p 0.8 --sp system\u002Fcode_interpreter.txt -i\n```\n\n**Web 演示**\n\n安装 Python 依赖项及用于代码解释器的 IPython 内核。\n```sh\npip install streamlit jupyter_client ipython ipykernel\nipython kernel install --name chatglm3-demo --user\n```\n\n启动 Web 演示：\n```sh\nstreamlit run chatglm3_demo.py\n```\n\n| 函数调用               | 代码解释器               |\n|-----------------------------|--------------------------------|\n| ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_4ae6260fb01d.png) | ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_1f8bd8ef7727.png) |\n\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4-9B\u003C\u002Fsummary>\n\n聊天模式：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm4-ggml.bin -p 你好 --temp 0.8 --top_p 0.8\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>ChatGLM4V-9B\u003C\u002Fsummary>\n\n聊天模式：\n```sh\npython3 cli_demo.py -m ..\u002Fmodels\u002Fchatglm4v-ggml.bin --image 03-Confusing-Pictures.jpg -p \"这张图片有什么不寻常之处\" --temp 0\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>CodeGeeX2\u003C\u002Fsummary>\n\n```sh\n# CLI 演示\npython3 cli_demo.py -m ..\u002Fmodels\u002Fcodegeex2-ggml.bin --temp 0 --mode generate -p \"\\\n# language: Python\n# write a bubble sort function\n\"\n# Web 演示\npython3 web_demo.py -m ..\u002Fmodels\u002Fcodegeex2-ggml.bin --temp 0 --max_length 512 --mode generate --plain\n```\n\u003C\u002Fdetails>\n\n**在运行时转换 Hugging Face LLMs**\n\n有时提前转换并保存中间 GGML 模型可能不太方便。这里提供一种直接从原始 Hugging Face 模型加载、在一分钟内将其量化为 GGML 模型并开始服务的方法。您只需将 GGML 模型路径替换为 Hugging Face 模型名称或路径即可。\n```python\n>>> import chatglm_cpp\n>>> \n>>> pipeline = chatglm_cpp.Pipeline(\"THUDM\u002Fchatglm-6b\", dtype=\"q4_0\")\nLoading checkpoint shards: 100%|██████████████████████████████████| 8\u002F8 [00:10\u003C00:00,  1.27s\u002Fit]\nProcessing model states: 100%|████████████████████████████████| 339\u002F339 [00:23\u003C00:00, 14.73it\u002Fs]\n...\n>>> pipeline.chat([chatglm_cpp.ChatMessage(role=\"user\", content=\"你好\")])\nChatMessage(role=\"assistant\", content=\"你好👋！我是人工智能助手 ChatGLM-6B，很高兴见到你，欢迎问我任何问题。\", tool_calls=[])\n```\n\n同样地，在任何示例脚本中将 GGML 模型路径替换为 Hugging Face 模型，即可正常工作。例如：\n```sh\npython3 cli_demo.py -m THUDM\u002Fchatglm-6b -p 你好 -i\n```\n\n## API 服务器\n\n我们支持多种类型的 API 服务器，以便与流行的前端集成。可通过以下命令安装额外依赖：\n```sh\npip install 'chatglm-cpp[api]'\n```\n\n请记得添加相应的 `CMAKE_ARGS` 以启用加速。\n\n**LangChain API**\n\n启动 LangChain 的 API 服务器：\n```sh\nMODEL=.\u002Fmodels\u002Fchatglm2-ggml.bin uvicorn chatglm_cpp.langchain_api:app --host 127.0.0.1 --port 8000\n```\n\n使用 `curl` 测试 API 端点：\n```sh\ncurl http:\u002F\u002F127.0.0.1:8000 -H 'Content-Type: application\u002Fjson' -d '{\"prompt\": \"你好\"}'\n```\n\n结合 LangChain 使用：\n```python\n>>> from langchain.llms import ChatGLM\n>>> \n>>> llm = ChatGLM(endpoint_url=\"http:\u002F\u002F127.0.0.1:8000\")\n>>> llm.predict(\"你好\")\n'你好👋！我是人工智能助手 ChatGLM2-6B，很高兴见到你，欢迎问我任何问题。'\n```\n\n更多选项请参阅 [examples\u002Flangchain_client.py](examples\u002Flangchain_client.py) 和 [LangChain ChatGLM 集成](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fllms\u002Fchatglm)。\n\n**OpenAI API**\n\n启动兼容 [OpenAI 聊天完成协议](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002Fchat)的 API 服务器：\n```sh\nMODEL=.\u002Fmodels\u002Fchatglm3-ggml.bin uvicorn chatglm_cpp.openai_api:app --host 127.0.0.1 --port 8000\n```\n\n使用 `curl` 测试您的端点：\n```sh\ncurl http:\u002F\u002F127.0.0.1:8000\u002Fv1\u002Fchat\u002Fcompletions -H 'Content-Type: application\u002Fjson' \\\n    -d '{\"messages\": [{\"role\": \"user\", \"content\": \"你好\"}]}'\n```\n\n使用 OpenAI 客户端与您的模型对话：\n```sh\n>>> from openai import OpenAI\n>>> \n>>> client = OpenAI(base_url=\"http:\u002F\u002F127.0.0.1:8000\u002Fv1\")\n>>> response = client.chat.completions.create(model=\"default-model\", messages=[{\"role\": \"user\", \"content\": \"你好\"}])\n>>> response.choices[0].message.content\n'你好👋！我是人工智能助手 ChatGLM3-6B，很高兴见到你，欢迎问我任何问题。'\n```\n\n如需流式响应，请查看示例客户端脚本：\n```sh\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --stream --prompt 你好\n```\n\n工具调用也同样支持：\n```sh\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --tool_call --prompt 上海天气怎么样\n```\n\n请求 GLM4V 并附带图像输入：\n```sh\n# 使用本地图像文件请求\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --prompt \"描述这张图片\" \\\n    --image https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_readme_f52c51021b38.jpg --temp 0\n# 使用图像 URL 请求\npython3 examples\u002Fopenai_client.py --base_url http:\u002F\u002F127.0.0.1:8000\u002Fv1 --prompt \"描述这张图片\" \\\n    --image https:\u002F\u002Fwww.barnorama.com\u002Fwp-content\u002Fuploads\u002F2016\u002F12\u002F03-Confusing-Pictures.jpg --temp 0\n```\n\n借助此 API 服务器作为后端，ChatGLM.cpp 模型可以无缝集成到任何使用 OpenAI 风格 API 的前端中，包括 [mckaywrigley\u002Fchatbot-ui](https:\u002F\u002Fgithub.com\u002Fmckaywrigley\u002Fchatbot-ui)、[fuergaosi233\u002Fwechat-chatgpt](https:\u002F\u002Fgithub.com\u002Ffuergaosi233\u002Fwechat-chatgpt)、[Yidadaa\u002FChatGPT-Next-Web](https:\u002F\u002Fgithub.com\u002FYidadaa\u002FChatGPT-Next-Web) 等。\n\n## 使用 Docker\n\n**选项 1：本地构建**\n\n在本地构建 Docker 镜像，并启动容器以在 CPU 上运行推理：\n```sh\ndocker build . --network=host -t chatglm.cpp\n\n# C++ 示例\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n# Python 示例\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp python3 examples\u002Fcli_demo.py -m models\u002Fchatglm-ggml.bin -p \"你好\"\n# LangChain API 服务器\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels -p 8000:8000 -e MODEL=models\u002Fchatglm-ggml.bin chatglm.cpp \\\n    uvicorn chatglm_cpp.langchain_api:app --host 0.0.0.0 --port 8000\n# OpenAI API 服务器\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels -p 8000:8000 -e MODEL=models\u002Fchatglm-ggml.bin chatglm.cpp \\\n    uvicorn chatglm_cpp.openai_api:app --host 0.0.0.0 --port 8000\n```\n\n若需支持 CUDA，请确保已安装 [nvidia-docker](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-docker)。然后运行以下命令：\n```sh\ndocker build . --network=host -t chatglm.cpp-cuda \\\n    --build-arg BASE_IMAGE=nvidia\u002Fcuda:12.2.0-devel-ubuntu20.04 \\\n    --build-arg CMAKE_ARGS=\"-DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES=80\"\ndocker run -it --rm --gpus all -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels chatglm.cpp-cuda \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\n**选项 2：使用预构建镜像**\n\n用于 CPU 推理的预构建镜像已发布在 [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Frepository\u002Fdocker\u002Fliplusx\u002Fchatglm.cpp) 和 [GitHub Container Registry (GHCR)](https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fpkgs\u002Fcontainer\u002Fchatglm.cpp) 上。\n\n从 Docker Hub 拉取并运行示例：\n```sh\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels liplusx\u002Fchatglm.cpp:main \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\n从 GHCR 拉取并运行示例：\n```sh\ndocker run -it --rm -v $PWD\u002Fmodels:\u002Fchatglm.cpp\u002Fmodels ghcr.io\u002Fli-plus\u002Fchatglm.cpp:main \\\n    .\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm-ggml.bin -p \"你好\"\n```\n\n预构建镜像也支持 Python 示例和 API 服务器，使用方法与 **选项 1** 相同。\n\n## 性能\n\n环境：\n* CPU 后端性能是在一台配备 Intel(R) Xeon(R) Platinum 8260 CPU @ 2.40GHz 的 Linux 服务器上，使用 16 个线程进行测量。\n* CUDA 后端性能是在一块 V100-SXM2-32GB GPU 上，使用 1 个线程进行测量。\n* MPS 后端性能是在 Apple M2 Ultra 设备上，使用 1 个线程进行测量。\n\nChatGLM-6B：\n\n|                                | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|--------------------------------|-------|-------|-------|-------|-------|-------|\n| ms\u002F标记（CPU @ Platinum 8260） | 74    | 77    | 86    | 89    | 114   | 189   |\n| ms\u002F标记（CUDA @ V100 SXM2）    | 8.1   | 8.7   | 9.4   | 9.5   | 12.0  | 19.1  |\n| ms\u002F标记（MPS @ M2 Ultra）      | 11.5  | 12.3  | N\u002FA   | N\u002FA   | 16.1  | 24.4  |\n| 文件大小                      | 3.3G  | 3.7G  | 4.0G  | 4.4G  | 6.2G  | 12G   |\n| 内存占用                      | 4.0G  | 4.4G  | 4.7G  | 5.1G  | 6.9G  | 13G   |\n\nChatGLM2-6B \u002F ChatGLM3-6B \u002F CodeGeeX2：\n\n|                                | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|--------------------------------|-------|-------|-------|-------|-------|-------|\n| ms\u002F标记（CPU @ Platinum 8260） | 64    | 71    | 79    | 83    | 106   | 189   |\n| ms\u002F标记（CUDA @ V100 SXM2）    | 7.9   | 8.3   | 9.2   | 9.2   | 11.7  | 18.5  |\n| ms\u002F标记（MPS @ M2 Ultra）      | 10.0  | 10.8  | N\u002FA   | N\u002FA   | 14.5  | 22.2  |\n| 文件大小                      | 3.3G  | 3.7G  | 4.0G  | 4.4G  | 6.2G  | 12G   |\n| 内存占用                      | 3.4G  | 3.8G  | 4.1G  | 4.5G  | 6.2G  | 12G   |\n\nChatGLM4-9B：\n\n|                                | Q4_0 | Q4_1 | Q5_0 | Q5_1 | Q8_0 | F16  |\n|--------------------------------|------|------|------|------|------|------|\n| ms\u002F标记（CPU @ Platinum 8260） | 105  | 105  | 122  | 134  | 158  | 279  |\n| ms\u002F标记（CUDA @ V100 SXM2）    | 12.1 | 12.5 | 13.8 | 13.9 | 17.7 | 27.7 |\n| ms\u002F标记（MPS @ M2 Ultra）      | 14.4 | 15.3 | 19.6 | 20.1 | 20.7 | 32.4 |\n| 文件大小                      | 5.0G | 5.5G | 6.1G | 6.6G | 9.4G | 18G  |\n\n## 模型质量\n\n我们通过在 WikiText-2 测试数据集上评估困惑度来衡量模型质量，采用 https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fperplexity 中的滑动窗口策略。较低的困惑度通常表示模型表现更好。\n\n从 [链接](https:\u002F\u002Fs3.amazonaws.com\u002Fresearch.metamind.io\u002Fwikitext\u002Fwikitext-2-raw-v1.zip) 下载并解压数据集。使用步长为 512、最大输入长度为 2048 的方式计算困惑度：\n```sh\n.\u002Fbuild\u002Fbin\u002Fperplexity -m models\u002Fchatglm3-base-ggml.bin -f wikitext-2-raw\u002Fwiki.test.raw -s 512 -l 2048\n```\n\n|                         | Q4_0  | Q4_1  | Q5_0  | Q5_1  | Q8_0  | F16   |\n|-------------------------|-------|-------|-------|-------|-------|-------|\n| [ChatGLM3-6B-Base][1]   | 6.215 | 6.188 | 6.006 | 6.022 | 5.971 | 5.972 |\n| [ChatGLM4-9B-Base][2]   | 6.834 | 6.780 | 6.645 | 6.624 | 6.576 | 6.577 |\n\n[1]: https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fchatglm3-6b-base\n[2]: https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002Fglm-4-9b\n\n## 开发\n\n**单元测试与基准测试**\n\n要执行单元测试，需添加 CMake 标志 `-DCHATGLM_ENABLE_TESTING=ON` 以启用测试功能。重新编译并运行单元测试（包括基准测试）：\n```sh\nmkdir -p build && cd build\ncmake .. -DCHATGLM_ENABLE_TESTING=ON && make -j\n.\u002Fbin\u002Fchatglm_test\n```\n\n仅进行基准测试：\n```sh\n.\u002Fbin\u002Fchatglm_test --gtest_filter='Benchmark.*'\n```\n\n**代码格式化**\n\n要在 `build` 文件夹内格式化代码，运行 `make lint`。请确保已预先安装 `clang-format`、`black` 和 `isort`。\n\n**性能分析**\n\n要检测性能瓶颈，可添加 CMake 标志 `-DGGML_PERF=ON`：\n```sh\ncmake .. -DGGML_PERF=ON && make -j\n```\n这将在运行模型时打印每次图操作的耗时信息。\n\n## 致谢\n\n* 本项目深受 [@ggerganov](https:\u002F\u002Fgithub.com\u002Fggerganov) 的 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) 启发，并基于其 NN 库 [ggml](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fggml) 构建。\n* 感谢 [@THUDM](https:\u002F\u002Fgithub.com\u002FTHUDM) 提供的优秀 [ChatGLM-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM-6B)、[ChatGLM2-6B](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM2-6B)、[ChatGLM3](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FChatGLM3) 和 [GLM-4](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FGLM-4)，以及他们发布的模型源代码和检查点。","# ChatGLM.cpp 快速上手指南\n\nChatGLM.cpp 是基于 C++ 和 ggml 实现的 ChatGLM 系列模型推理框架，支持在 MacBook、Linux 和 Windows 上进行高效的本地实时对话。它支持 int4\u002Fint8 量化、流式输出以及 Python 绑定。\n\n## 环境准备\n\n**系统要求**\n*   **操作系统**: Linux, macOS, Windows\n*   **硬件**: x86\u002FARM CPU, NVIDIA GPU (需 CUDA), Apple Silicon GPU (需 Metal)\n*   **软件依赖**:\n    *   Git\n    *   CMake (3.14+)\n    *   C++ 编译器 (GCC, Clang, 或 MSVC)\n    *   Python 3.8+ (用于模型转换和 Python 绑定)\n\n**前置依赖安装**\n在克隆仓库前，请确保已安装基础构建工具。若使用 GPU 加速，请提前安装对应的 CUDA Toolkit 或 Xcode (macOS)。\n\n## 安装步骤\n\n### 1. 克隆项目\n递归克隆仓库以获取子模块：\n```sh\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp.git && cd chatglm.cpp\n```\n*注：如果忘记加 `--recursive` 参数，请在目录内运行 `git submodule update --init --recursive`。*\n\n### 2. 模型量化转换\nChatGLM.cpp 需要将 Hugging Face 格式的模型转换为 GGML 格式。首先安装必要的 Python 包（推荐使用国内镜像源加速）：\n\n```sh\npython3 -m pip install -U pip\npython3 -m pip install torch tabulate tqdm transformers accelerate sentencepiece -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n执行转换脚本（以 ChatGLM3-6B 为例，转换为 int4 量化版本）：\n```sh\npython3 chatglm_cpp\u002Fconvert.py -i THUDM\u002Fchatglm3-6b -t q4_0 -o models\u002Fchatglm3-ggml.bin\n```\n*   `-i`: 模型名称（如 `THUDM\u002Fchatglm3-6b`）或本地路径。\n*   `-t`: 量化类型，推荐 `q4_0` (int4) 以平衡速度与显存。\n*   `-o`: 输出文件路径。\n\n> **支持的模型**: ChatGLM-6B, ChatGLM2-6B, ChatGLM3, GLM-4(V), CodeGeeX2 等。\n\n### 3. 编译项目\n使用 CMake 构建项目。根据硬件选择是否开启加速：\n\n*   **CPU 默认编译**:\n    ```sh\n    cmake -B build\n    cmake --build build -j --config Release\n    ```\n\n*   **NVIDIA GPU 加速 (CUDA)**:\n    ```sh\n    cmake -B build -DGGML_CUDA=ON\n    cmake --build build -j --config Release\n    ```\n\n*   **Apple Silicon GPU 加速 (Metal)**:\n    ```sh\n    cmake -B build -DGGML_METAL=ON\n    cmake --build build -j --config Release\n    ```\n\n## 基本使用\n\n### 命令行交互\n编译完成后，可直接运行生成的可执行文件进行对话。\n\n**单次问答示例**：\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -p \"你好\"\n```\n\n**交互式对话（多轮上下文）**：\n添加 `-i` 参数进入交互模式，历史对话将作为上下文。\n```sh\n.\u002Fbuild\u002Fbin\u002Fmain -m models\u002Fchatglm3-ggml.bin -i\n```\n*提示：输入 `exit` 或按下 Ctrl+C 可退出交互模式。运行 `.\u002Fbuild\u002Fbin\u002Fmain -h` 查看更多参数选项。*\n\n### Python 快速调用\n如果你更习惯使用 Python，可以直接安装预编译包或从源码安装。\n\n**安装**：\n```sh\npip install -U chatglm-cpp -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n*如需开启 GPU 加速，请在安装前设置环境变量，例如：`CMAKE_ARGS=\"-DGGML_CUDA=ON\" pip install -U chatglm-cpp`*\n\n**代码示例**：\n```python\nimport chatglm_cpp\n\n# 加载量化后的模型\npipeline = chatglm_cpp.Pipeline(\"models\u002Fchatglm3-ggml.bin\")\n\n# 进行对话\nresponse = pipeline.chat([chatglm_cpp.ChatMessage(role=\"user\", content=\"你好\")])\nprint(response.content)\n```","一位拥有 MacBook Pro 的独立开发者，希望在本地离线环境中快速部署一个支持中文对话的智能客服原型，用于测试私有数据下的回答效果。\n\n### 没有 chatglm.cpp 时\n- **硬件门槛高**：运行原版 ChatGLM-6B 通常需要配备大显存 NVIDIA 显卡，普通笔记本无法承载，必须租用昂贵的云端 GPU 服务器。\n- **环境配置繁琐**：需要安装庞大的 PyTorch 框架及各类依赖库，版本冲突频繁，在 macOS 上配置 CUDA 替代方案更是耗时耗力。\n- **推理速度慢**：在未量化的浮点精度下，首字生成延迟高，多轮对话时显存占用迅速飙升，导致交互体验卡顿甚至崩溃。\n- **隐私与成本顾虑**：数据需上传至云端或依赖外部 API，存在隐私泄露风险，且按 Token 计费的模式让高频测试成本难以控制。\n\n### 使用 chatglm.cpp 后\n- **原生苹果芯片加速**：直接利用 MacBook 的 Apple Silicon GPU 进行推理，无需额外显卡，纯 C++ 实现让内存占用降低 70% 以上。\n- **极简部署流程**：通过一行命令即可将 Hugging Face 模型转换为 int4 量化格式，配合 Python 绑定快速启动，几分钟内完成环境搭建。\n- **流畅实时交互**：得益于优化的 KV 缓存和并行计算，即使在 CPU 模式下也能实现“打字机”效果的流式输出，响应速度满足实时对话需求。\n- **完全离线安全**：所有计算均在本地完成，敏感业务数据不出设备，且一次性转换后可无限次免费调用，彻底消除云端费用。\n\nchatglm.cpp 让高性能中文大模型真正跑在了开发者的笔记本电脑上，实现了低成本、高隐私且极速的本地化 AI 应用落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fli-plus_chatglm.cpp_3117cb47.jpg","li-plus","Jiahao Li","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fli-plus_d3fd1509.jpg","LLM Infra @ ByteDance Seed | B.Eng in Computer Science @ Tsinghua University","@bytedance","Beijing, China","liplus17@163.com",null,"https:\u002F\u002Fliplus.me","https:\u002F\u002Fgithub.com\u002Fli-plus",[88,92,96,100,104],{"name":89,"color":90,"percentage":91},"C++","#f34b7d",71.1,{"name":93,"color":94,"percentage":95},"Python","#3572A5",26.7,{"name":97,"color":98,"percentage":99},"CMake","#DA3434",1.4,{"name":101,"color":102,"percentage":103},"Dockerfile","#384d54",0.5,{"name":105,"color":106,"percentage":107},"Shell","#89e051",0.3,2960,328,"2026-04-10T08:33:30","MIT",4,"Linux, macOS, Windows","非必需。支持 NVIDIA GPU (需开启 CUDA)、Apple Silicon GPU (需开启 Metal) 或仅使用 CPU 推理。未明确指定具体显存大小要求，但建议视觉模型 (GLM-4V) 在 GPU 上运行以避免 CPU 过慢。","未说明 (取决于模型大小及量化等级，int4\u002Fint8 量化旨在降低内存需求)",{"notes":117,"python":118,"dependencies":119},"1. 该项目基于 C++ (ggml) 实现，支持纯 CPU 运行，也支持通过 CMake 编译开启 CUDA (NVIDIA) 或 Metal (Apple Silicon) 加速。\n2. 运行前需使用 convert.py 将 Hugging Face 模型转换为量化的 GGML 格式 (支持 int4\u002Fint8 等)。\n3. 视觉模型 (GLM-4V) 强烈建议使用 GPU 运行，因为即使在 4-bit 量化下，CPU 编码速度也非常慢。\n4. 可通过 pip 直接安装 Python 绑定 (chatglm-cpp)，安装时可通过环境变量开启 GPU 支持。","3.8+ (根据 PyPI badge 推断，文中示例使用 python3)",[120,121,122,123,124,125,126],"torch","transformers","accelerate","sentencepiece","tabulate","tqdm","cmake",[15],[129,130,131,132,133,134,135],"large-language-models","chatglm","nlp","chatglm2","chatglm3","codegeex2-6b","glm4","2026-03-27T02:49:30.150509","2026-04-12T21:08:36.141773",[139,144,149,154,158,163,168,173],{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},31005,"在 Mac (M1\u002FM2) 上运行 Web Demo 遇到 pydantic 导入错误怎么办？","如果在解决 `_C` 模块缺失后仍遇到 `ImportError: cannot import name 'computed_field' from 'pydantic'`，说明当前的 pydantic 版本不兼容。请检查并升级或降级 pydantic 库至项目要求的版本（通常较新版本需要 pydantic v2+，旧代码可能需要 v1），确保依赖环境匹配。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F14",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},30998,"如何在编译或安装时启用 GPU (CUDA) 加速？","需要在编译时开启 GGML_CUBLAS 选项。如果使用 pip 安装，请设置环境变量并执行：\nCMAKE_ARGS=\"-DGGML_CUBLAS=ON\" pip install chatglm-cpp --force-reinstall -v --no-cache\n\n如果是本地源码编译，请使用以下 cmake 命令：\ncmake -B build -DGGML_CUBLAS=ON\ncmake --build build -j\n\n注意：在 Windows CMD 中设置环境变量语法可能不同，若报错建议直接在本地重新编译或使用支持该语法的 Shell（如 Git Bash）。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F48",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},30999,"运行模型时出现 'Floating point exception (core dumped)' 错误怎么办？","这通常是由于编译器版本过低导致的。建议使用 GCC\u002FG++ 11.4 或更高版本进行编译。如果当前版本较低（如 7.5），请升级编译器后重新编译项目即可解决。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F157",{"id":155,"question_zh":156,"answer_zh":157,"source_url":143},31000,"Python 调用时报错 'ModuleNotFoundError: No module named chatglm_cpp._C' 如何解决？","这通常是因为目录命名冲突或编译产物未正确链接。尝试将项目根目录下的 `chatglm_cpp` 文件夹重命名为其他名称（避免与生成的模块名冲突），然后重新执行 `pip install .` 进行安装。此外，确保编译成功且在 Python 环境中能找到对应的 `.so` (Linux\u002FMac) 或 `.pyd` (Windows) 文件。",{"id":159,"question_zh":160,"answer_zh":161,"source_url":162},31001,"运行 convert.py 转换模型时在 dump_config 阶段报错或失败是什么原因？","原因通常是使用的 ChatGLM2 模型版本较旧，缺少部分 token_id 配置。请拉取项目的 main 分支最新代码（该问题已在后续提交中修复），然后重新运行转换脚本即可。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F13",{"id":164,"question_zh":165,"answer_zh":166,"source_url":167},31002,"目前是否支持 ChatGLM2 模型？","是的，主分支（main branch）已经支持 ChatGLM2。如果遇到类似 'ChatGLMConfig object has no attribute position_encoding_2d' 的错误，请确保已切换到 main 分支并更新了最新代码，参考最新的 README 文档进行部署。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F2",{"id":169,"question_zh":170,"answer_zh":171,"source_url":172},31003,"是否有对 GLM-4 或 GLM-4V 模型的支持计划？","社区用户非常关注 GLM-4 系列（包括 GLM-4-9B 和 GLM-4V）的支持。根据 Issue 讨论，维护者表示相关支持正在进行中或处于开发规划阶段，但具体完成时间需关注项目后续更新。目前建议关注主分支的动态以获取最新进展。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F301",{"id":174,"question_zh":175,"answer_zh":176,"source_url":177},31004,"为什么 LoRA 微调合并后的模型转换成 bin 格式后效果没有变化？","这是一个已知的问题场景。用户在将 LoRA 权重合并到基座模型后进行转换，发现推理效果与未微调前一致。这可能与合并权重的具体操作步骤、转换脚本对合并后权重的读取方式或量化过程中的精度损失有关。建议检查权重合并的正确性，并确认转换脚本是否完整读取了所有层参数。","https:\u002F\u002Fgithub.com\u002Fli-plus\u002Fchatglm.cpp\u002Fissues\u002F196",[179,184,189,194,199,204,209,214,219,224,229,234,239,244,249,254,259,264,269],{"id":180,"version":181,"summary_zh":182,"released_at":183},222905,"v0.4.2","* 在视觉编码器上应用 Flash Attention，以降低首个 token 的延迟。* 修复 Apple Silicon 芯片上的 Metal 编译错误。","2024-07-31T06:12:02",{"id":185,"version":186,"summary_zh":187,"released_at":188},222906,"v0.4.1","* 支持 GLM4V，GLM 系列中的首个视觉语言模型\n* 通过重新调度注意力缩放来修复 logits 中的 NaN\u002FInf 问题","2024-07-25T07:04:09",{"id":190,"version":191,"summary_zh":192,"released_at":193},222907,"v0.4.0","* 按需进行动态内存分配，以充分利用设备内存。不再预设临时缓冲区大小或内存大小。\n* 由于百川和InternLM已集成到llama.cpp中，因此不再支持它们。\n* API 变更：\n  * CMake 的 CUDA 选项：`-DGGML_CUBLAS` 更改为 `-DGGML_CUDA`\n  * CMake 的 CUDA 架构选项：`-DCUDA_ARCHITECTURES` 更改为 `-DCMAKE_CUDA_ARCHITECTURES`\n  * `GenerationConfig` 中的 `num_threads` 参数已被移除：线程数将自动选择最优设置。","2024-06-21T03:09:42",{"id":195,"version":196,"summary_zh":197,"released_at":198},222908,"v0.3.4","* 修复代码输入分词的正则表达式负向前瞻\n* 通过使用 `apply_chat_template` 计算 token 数量，修复 OpenAI API 服务器","2024-06-14T12:52:34",{"id":200,"version":201,"summary_zh":202,"released_at":203},222909,"v0.3.3","支持 ChatGLM4 对话模式","2024-06-13T06:36:24",{"id":205,"version":206,"summary_zh":207,"released_at":208},222910,"v0.3.2","* 支持 ChatGLM 系列的 p-tuning v2 微调模型\n* 修复 lora 模型及 chatglm3-6b-128k 的 convert.py 脚本\n* 修复 32k\u002F128k 序列长度下的 RoPE theta 配置\n* 改进 CUDA CMake 脚本，使其更好地兼容 nvcc 版本","2024-04-24T08:20:14",{"id":210,"version":211,"summary_zh":212,"released_at":213},222911,"v0.3.1","* 支持在 OpenAI API 服务器中进行函数调用\n* 更快的重复惩罚采样\n* 支持 `max_new_tokens` 生成选项","2024-01-20T16:14:55",{"id":215,"version":216,"summary_zh":217,"released_at":218},222912,"v0.3.0","* ChatGLM3 的全部功能，包括系统提示词、函数调用和代码解释器\n* 全新的 OpenAI 风格聊天 API\n* 在 OpenAI API 服务器中添加令牌使用信息，以兼容 LangChain 前端\n* 修复 chatglm3-6b-32k 的转换错误","2023-11-22T03:08:35",{"id":220,"version":221,"summary_zh":222,"released_at":223},222913,"v0.2.10","* 支持 ChatGLM3 的对话模式。\n* 即将推出：用于系统消息和函数调用的新提示格式。","2023-10-30T06:35:29",{"id":225,"version":226,"summary_zh":227,"released_at":228},222914,"v0.2.9","* 支持 InternLM 7B 和 20B 模型架构","2023-10-22T03:03:42",{"id":230,"version":231,"summary_zh":232,"released_at":233},222915,"v0.2.8","* Metal backend support for all models (ChatGLM & ChatGLM2 & Baichuan-7B & Baichuan-13B)\r\n* Fix GLM generation on CUDA for long context","2023-10-10T16:24:26",{"id":235,"version":236,"summary_zh":237,"released_at":238},222916,"v0.2.7","* Support Baichuan-7B model architecture (works for both Baichuan v1 & v2).\r\n* Minor bug fix and enhancement.","2023-09-28T13:23:18",{"id":240,"version":241,"summary_zh":242,"released_at":243},222917,"v0.2.6","* Support Baichuan-13B on CPU & CUDA backends\r\n* Bug fix for Windows and Metal","2023-08-31T11:50:03",{"id":245,"version":246,"summary_zh":247,"released_at":248},222918,"v0.2.5","* Optimize context computing (GEMM) for metal backend\r\n* Support repetition penalty option for generation\r\n* Update Dockerfile for CPU & CUDA backends with full functionality, hosted on GHCR","2023-08-22T16:52:53",{"id":250,"version":251,"summary_zh":252,"released_at":253},222919,"v0.2.4","* Python binding enhancement: support load-and-convert directly from original Hugging Face models. Intermediate GGML model files are no longer necessary.\r\n* Small fix for CLI demo on Windows.","2023-08-11T17:30:06",{"id":255,"version":256,"summary_zh":257,"released_at":258},222920,"v0.2.3","* Windows support: enable AVX\u002FAVX2 for better performance, fix stdout encoding issues, and support python binding on Windows.\r\n* API server: support LangChain integration & OpenAI API compatible server.\r\n* New model: Support CodeGeeX2 model inference in native c++ & python binding.","2023-08-07T06:03:50",{"id":260,"version":261,"summary_zh":262,"released_at":263},222921,"v0.2.2","* Support MPS (Metal Performance Shaders) backend on Apple silicon devices for ChatGLM2.\r\n* Support Volta, Turing and Ampere CUDA architectures.","2023-07-30T16:09:26",{"id":265,"version":266,"summary_zh":267,"released_at":268},222922,"v0.2.1","* 3x speedup for CUDA implementation.\r\n* Increase scratch size to accommodate up to 2k context.","2023-07-22T10:13:31",{"id":270,"version":271,"summary_zh":272,"released_at":273},222923,"v0.2.0","First release:\r\n* Accelerated CPU inference for ChatGLM-6B and ChatGLM2-6B for real-time chatting on MacBook.\r\n* Support int4\u002Fint5\u002Fint8 quantization, KV cache, efficient sampling, parallel computing and streaming generation.\r\n* Python binding, web demo, and more possibilities.","2023-07-08T04:33:22"]