[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mozilla-ai--llamafile":3,"tool-mozilla-ai--llamafile":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":115,"forks":116,"last_commit_at":117,"license":118,"difficulty_score":119,"env_os":120,"env_gpu":121,"env_ram":122,"env_deps":123,"category_tags":130,"github_topics":79,"view_count":131,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":163},752,"mozilla-ai\u002Fllamafile","llamafile","Distribute and run LLMs with a single file.","llamafile 致力于让大型语言模型的本地运行变得像打开一个普通文件一样简单。它的核心功能是将庞大的模型与推理引擎打包成单个可执行文件，用户无需安装任何依赖或配置复杂的环境，只需下载并运行即可在本地启动 AI 服务。\n\n传统上，部署开源大模型往往面临环境依赖多、跨平台兼容性差等痛点。llamafile 通过结合 llama.cpp 推理库与 Cosmopolitan Libc 技术，成功解决了这些难题。它实现了真正的“一次构建，到处运行”，支持 Windows、macOS、Linux 等多种操作系统及不同 CPU 架构，甚至能在没有 GPU 的普通电脑上流畅运行。此外，它还内置了 whisperfile，提供单文件的语音转文字功能。\n\n无论是希望快速验证想法的开发者、进行本地实验的研究人员，还是追求隐私安全的普通用户，llamafile 都能提供极大的便利。它降低了使用开源大模型的门槛，让强大的 AI 能力真正触手可及，且完全在本地完成，保障了数据隐私。","# llamafile\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_57afc5a146b9.png\" width=\"320\" height=\"320\"\n     alt=\"[line drawing of llama animal head in front of slightly open manilla folder filled with files]\">\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue.svg)](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fblob\u002Fmain\u002FLICENSE)\n[![ci status](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Factions\u002Fworkflows\u002Fci.yml)\n[![Based on llama.cpp](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fllama.cpp-7f5ee54-orange.svg)](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Fcommit\u002F7f5ee54)\n[![Based on whisper.cpp](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fwhisper.cpp-2eeeba5-green.svg)](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fwhisper.cpp\u002Fcommit\u002F2eeeba5)\n[![Discord](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_3e43a9a19e1b.png)](https:\u002F\u002Fdiscord.gg\u002FYuMNeuKStr)\n[![Mozilla Builders](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuilders-6E6E6E?logo=mozilla&logoColor=white&labelColor=4A4A4A)](https:\u002F\u002Fbuilders.mozilla.org\u002F)\n\n**llamafile lets you distribute and run LLMs with a single file.**\n\nllamafile is a [Mozilla Builders](https:\u002F\u002Fbuilders.mozilla.org\u002F) project (see its [announcement blog post](https:\u002F\u002Fhacks.mozilla.org\u002F2023\u002F11\u002Fintroducing-llamafile\u002F)), now revamped by [Mozilla.ai](https:\u002F\u002Fwww.mozilla.ai\u002Fopen-tools\u002Fllamafile). \n\nOur goal is to make open LLMs much more\naccessible to both developers and end users. We're doing that by\ncombining [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) with [Cosmopolitan Libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) into one\nframework that collapses all the complexity of LLMs down to\na single-file executable (called a \"llamafile\") that runs\nlocally on most operating systems and CPU archiectures, with no installation.\n\nllamafile also includes **[whisperfile](whisperfile\u002Findex.md)**, a single-file speech-to-text tool built on [whisper.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fwhisper.cpp) and the same Cosmopolitan packaging. It supports transcription and translation of audio files across all the same platforms, with no installation required.\n\n\n## v0.10.0\n\n**llamafile versions starting from 0.10.0 use a new build system**, aimed at keeping our code more easily \naligned with the latest versions of llama.cpp. This means they support more recent models and functionalities,\nbut at the same time they might be missing some of\nthe features you were accustomed to (check out [this doc](README_0.10.0.md) for a high-level description of what has been done). If you liked\nthe \"classic experience\" more, you will always be able to access the previous versions from our\n[releases](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Freleases) page. Our pre-built llamafiles always\nshow which version of the server they have been bundled with ([0.9.* example](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllava-v1.5-7b-llamafile), [0.10.* example](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0)), so you will always know\nwhich version of the software you are downloading.\n\n\n> **We want to hear from you!**\nWhether you are a new user or a long-time fan, please share what you find most valuable about llamafile and what would make it more useful for you.\n[Read more via the blog](https:\u002F\u002Fblog.mozilla.ai\u002Fllamafile-returns\u002F) and add your voice to the discussion [here](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fdiscussions\u002F809).\n\n\n## Quick Start\n\nDownload and run your first llamafile in minutes:\n\n```sh\n# Download an example model (Qwen3.5 0.8B)\ncurl -LO https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0\u002Fresolve\u002Fmain\u002FQwen3.5-0.8B-Q8_0.llamafile\n\n# Make it executable (macOS\u002FLinux\u002FBSD)\nchmod +x Qwen3.5-0.8B-Q8_0.llamafile\n\n# Run it\n.\u002FQwen3.5-0.8B-Q8_0.llamafile\n```\n\nWe chose this model because that's the smallest one we have\nbuilt a llamafile for, so most likely to work out-of-the-box for you.\nIf you have powerful hardware and\u002For GPUs, [feel free to choose](example_llamafiles.md)\nlarger and more expressive models which should provide more accurate\nresponses.\n\n**Windows users:** Rename the file to add `.exe` extension before running.\n\n## Documentation\n\nCheck the full documentation in the [docs\u002F](docs\u002F) folder or online at [mozilla-ai.github.io\u002Fllamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002F), or directly jump into one of the following subsections:\n\n- [Quickstart](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fquickstart\u002F)\n- [Example llamafiles](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fexample_llamafiles\u002F)\n- [Running a llamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Frunning_llamafile\u002F)\n- [Creating llamafiles](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fcreating_llamafiles\u002F)\n- [Source installation](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fsource_installation\u002F)\n- [Technical details](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Ftechnical_details\u002F)\n- [Supported Systems](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fsupport\u002F)\n- [Troubleshooting](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Ftroubleshooting\u002F)\n- [Whisperfile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fwhisperfile\u002F)\n\n\n## Licensing\n\nWhile the llamafile project is Apache 2.0-licensed, our changes\nto llama.cpp and whisper.cpp are licensed under MIT (just like the projects\nthemselves) so as to remain compatible and upstreamable in the future,\nshould that be desired.\n\nThe llamafile logo on this page was generated with the assistance of DALL·E 3.\n\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_d91808b3b684.png)](https:\u002F\u002Fstar-history.com\u002F#Mozilla-Ocho\u002Fllamafile&Date)\n","# llamafile\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_57afc5a146b9.png\" width=\"320\" height=\"320\"\n     alt=\"[line drawing of llama animal head in front of slightly open manilla folder filled with files]\">\n\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-Apache%202.0-blue.svg)](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fblob\u002Fmain\u002FLICENSE)\n[![ci status](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Factions\u002Fworkflows\u002Fci.yml)\n[![Based on llama.cpp](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fllama.cpp-7f5ee54-orange.svg)](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp\u002Fcommit\u002F7f5ee54)\n[![Based on whisper.cpp](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fwhisper.cpp-2eeeba5-green.svg)](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fwhisper.cpp\u002Fcommit\u002F2eeeba5)\n[![Discord](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_3e43a9a19e1b.png)](https:\u002F\u002Fdiscord.gg\u002FYuMNeuKStr)\n[![Mozilla Builders](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBuilders-6E6E6E?logo=mozilla&logoColor=white&labelColor=4A4A4A)](https:\u002F\u002Fbuilders.mozilla.org\u002F)\n\n**llamafile 让你能够用单个文件分发和运行大型语言模型（LLM）。**\n\nllamafile 是 [Mozilla Builders](https:\u002F\u002Fbuilders.mozilla.org\u002F) 项目（参见其 [公告博客文章](https:\u002F\u002Fhacks.mozilla.org\u002F2023\u002F11\u002Fintroducing-llamafile\u002F)），现由 [Mozilla.ai](https:\u002F\u002Fwww.mozilla.ai\u002Fopen-tools\u002Fllamafile) 重新打造。 \n\n我们的目标是让开源大型语言模型对开发者和最终用户都更加易于访问。我们通过将 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) 与 [Cosmopolitan Libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) 结合到一个框架中来实现这一目标，该框架将所有 LLM 的复杂性简化为单个可执行文件（称为“llamafile\"），可在大多数操作系统和 CPU 架构上本地运行，无需安装。\n\nllamafile 还包含 **[whisperfile]**(whisperfile\u002Findex.md)，这是一个基于 [whisper.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fwhisper.cpp) 和相同的 Cosmopolitan 打包技术的单文件语音转文本工具。它支持在所有相同平台上进行音频文件的转录和翻译，无需安装。\n\n\n## v0.10.0\n\n**从 0.10.0 版本开始的 llamafile 使用新的构建系统**，旨在使我们的代码更容易与最新版本的 llama.cpp 保持同步。这意味着它们支持更新的模型和功能，但同时也可能缺少你习惯的一些功能（查看 [此文档](README_0.10.0.md) 以了解已完成工作的概述）。如果你更喜欢“经典体验”，你始终可以从我们的 [releases](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Freleases) 页面访问之前的版本。我们的预构建 llamafile 始终显示它们捆绑了哪个版本的服务器（[0.9.* 示例](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllava-v1.5-7b-llamafile)，[0.10.* 示例](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0)），因此你将始终知道你下载的是哪个版本的软件。\n\n\n> **我们想听听你的意见！**\n无论你是新用户还是老粉丝，请分享你认为 llamafile 最有价值的地方，以及什么会让它对你更有用。\n[通过博客阅读更多内容](https:\u002F\u002Fblog.mozilla.ai\u002Fllamafile-returns\u002F) 并在 [此处](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fdiscussions\u002F809) 加入讨论。\n\n\n## Quick Start\n\n在几分钟内下载并运行你的第一个 llamafile：\n\n```sh\n# Download an example model (Qwen3.5 0.8B)\ncurl -LO https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0\u002Fresolve\u002Fmain\u002FQwen3.5-0.8B-Q8_0.llamafile\n\n# Make it executable (macOS\u002FLinux\u002FBSD)\nchmod +x Qwen3.5-0.8B-Q8_0.llamafile\n\n# Run it\n.\u002FQwen3.5-0.8B-Q8_0.llamafile\n```\n\n我们选择这个模型是因为它是我们要构建 llamafile 的最小模型，因此最有可能让你开箱即用。如果你有强大的硬件和\u002F或 GPU（图形处理器），[随意选择](example_llamafiles.md) 更大、表现力更强的模型，它们应该能提供更准确的响应。\n\n**Windows 用户：** 运行前请将文件名重命名以添加 `.exe` 扩展名。\n\n## Documentation\n\n查看 [docs\u002F](docs\u002F) 文件夹中的完整文档，或在 [mozilla-ai.github.io\u002Fllamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002F) 上查看，或直接跳转到以下子部分之一：\n\n- [快速开始](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fquickstart\u002F)\n- [示例 llamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fexample_llamafiles\u002F)\n- [运行 llamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Frunning_llamafile\u002F)\n- [创建 llamafile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fcreating_llamafiles\u002F)\n- [源码安装](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fsource_installation\u002F)\n- [技术细节](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Ftechnical_details\u002F)\n- [支持的系统](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fsupport\u002F)\n- [故障排除](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Ftroubleshooting\u002F)\n- [Whisperfile](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002Fwhisperfile\u002F)\n\n\n## Licensing\n\n虽然 llamafile 项目采用 Apache 2.0 许可，但我们针对 llama.cpp 和 whisper.cpp 所做的更改采用 MIT 许可（就像项目本身一样），以便在未来保持兼容并可向上游提交，如果这是所期望的话。\n\n本页面上的 llamafile 标志是在 DALL·E 3 的协助下生成的。\n\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_readme_d91808b3b684.png)](https:\u002F\u002Fstar-history.com\u002F#Mozilla-Ocho\u002Fllamafile&Date)","# llamafile 快速上手指南\n\n## 环境准备\n\n*   **系统要求**: 支持 macOS、Linux、BSD 及 Windows 操作系统。\n*   **硬件架构**: 兼容大多数 CPU 架构（部分模型支持 GPU 加速）。\n*   **前置依赖**: 无需安装 Python、PyTorch 或其他深度学习框架。仅需具备网络连接以下载模型文件，以及磁盘空间存储模型。\n*   **版本说明**: 推荐使用 v0.10.0 及以上版本，该版本采用新构建系统，能更好地对齐 llama.cpp 最新特性。\n\n## 安装步骤\n\nllamafile 采用单文件架构，无需传统安装过程，下载并赋予权限即可使用。\n\n1.  **下载模型文件**\n    以下载官方示例模型（Qwen3.5 0.8B）为例，该模型体积较小，适合快速验证环境：\n    ```sh\n    curl -LO https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0\u002Fresolve\u002Fmain\u002FQwen3.5-0.8B-Q8_0.llamafile\n    ```\n\n2.  **赋予执行权限**\n    在 macOS、Linux 或 BSD 系统上运行以下命令：\n    ```sh\n    chmod +x Qwen3.5-0.8B-Q8_0.llamafile\n    ```\n\n3.  **Windows 用户注意**\n    请在运行前将文件重命名，添加 `.exe` 后缀（例如：`Qwen3.5-0.8B-Q8_0.exe`）。\n\n## 基本使用\n\n完成上述步骤后，直接在终端中运行该文件即可启动本地 LLM 服务。\n\n```sh\n.\u002FQwen3.5-0.8B-Q8_0.llamafile\n```\n\n*   **建议**: 初学者建议先从小参数模型开始测试。若拥有高性能硬件或 GPU，可选择更大的模型以获得更准确的响应。\n*   **扩展功能**: 工具包内还包含 `whisperfile`，支持音频文件的转录和翻译，无需额外安装。\n\n如需了解详细配置、创建自定义 llamafile 或排查问题，请访问 [官方文档](https:\u002F\u002Fmozilla-ai.github.io\u002Fllamafile\u002F)。","某数据分析师需要在出差途中向团队展示基于隐私数据的本地问答系统，但目标演示电脑既无独立显卡也无法连接外网。\n\n### 没有 llamafile 时\n- 需要预先在每台机器上安装 Python、PyTorch 及大量依赖库，配置过程极易报错。\n- 针对不同 CPU 架构需手动编译源码，Windows 用户常因缺少 Visual Studio 组件而失败。\n- 模型权重文件与代码分离，拷贝过程中容易遗漏关键配置文件导致无法启动。\n- 缺乏统一标准，同事间分享模型时需附带详细的环境说明文档，沟通成本极高。\n\n### 使用 llamafile 后\n- llamafile 将模型权重与推理引擎整合为单一二进制文件，双击即可启动服务。\n- 基于 Cosmopolitan Libc 技术，同一文件可在 macOS、Linux 和 Windows 上无缝运行。\n- 内置自动硬件检测机制，即使无 GPU 也能调用 CPU 进行高效推理，无需额外驱动。\n- 支持通过 HTTP 接口直接访问，方便快速集成到现有工作流中，无需修改代码逻辑。\n\nllamafile 通过极简的单文件交付模式，彻底解决了本地大模型跨平台部署难、环境依赖重的问题。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmozilla-ai_llamafile_57afc5a1.png","mozilla-ai","Mozilla.ai","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmozilla-ai_daf5bacd.png","main development portal for mozilla.ai",null,"hello@mozilla.ai","https:\u002F\u002Fmozilla.ai","https:\u002F\u002Fgithub.com\u002Fmozilla-ai",[84,88,92,96,100,103,107,111],{"name":85,"color":86,"percentage":87},"C++","#f34b7d",74,{"name":89,"color":90,"percentage":91},"C","#555555",16.4,{"name":93,"color":94,"percentage":95},"Makefile","#427819",3.8,{"name":97,"color":98,"percentage":99},"Python","#3572A5",2.7,{"name":101,"color":102,"percentage":99},"Shell","#89e051",{"name":104,"color":105,"percentage":106},"Roff","#ecdebe",0.3,{"name":108,"color":109,"percentage":110},"Batchfile","#C1F12E",0.1,{"name":112,"color":113,"percentage":114},"CMake","#DA3434",0,24002,1300,"2026-04-05T11:13:33","NOASSERTION",1,"Linux, macOS, Windows, BSD","非必需，支持 CPU 运行，若使用 GPU 则依赖底层 llama.cpp 支持的硬件","未说明",{"notes":124,"python":125,"dependencies":126},"单文件可执行程序，无需安装；Windows 需手动添加 .exe 后缀；基于 llama.cpp 与 Cosmopolitan Libc 构建，实现跨平台零依赖运行；v0.10.0 起使用新构建系统。","无需",[127,128,129],"llama.cpp","whisper.cpp","Cosmopolitan Libc",[26,55,13],23,"2026-03-27T02:49:30.150509","2026-04-06T05:36:29.393053",[135,140,145,150,155,159],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},3208,"在 Fedora 上运行 llamafile 遇到 AMD GPU 初始化失败（提示找不到 clang++\u002Fhipcc），如何解决？","首先尝试设置环境变量 `HSA_OVERRIDE_GFX_VERSION=11.0.0`。如果仍报错，可能是 ROCm 版本问题，建议升级到 Fedora 40 自带的 ROCm 6.0 版本。有用户反馈升级后问题得到解决。","https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fissues\u002F188",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},3209,"在 ARM64 嵌入式设备（如 Raspberry Pi 5）上运行 llamafile 报错 `ape error: ... prog mmap failed`，是什么原因？","这通常是因为设备的 Linux 内核地址空间配置小于 48 位（常见于树莓派官方定制内核）。解决方法是安装支持 48 位地址空间的内核，或者自行重新编译内核。如果无法修改内核，可尝试直接使用 `llama.cpp` 替代。","https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fissues\u002F74",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},3210,"更新到 llamafile 0.8.5 后，Windows 上 AMD GPU 支持失效，提示找不到 `amdclang++.exe`，如何处理？","这是该版本的回归问题。请确保已正确安装 ROCm 工具链，并检查 `HIP_PATH` 环境变量是否指向包含 `amdclang++.exe` 的正确路径。如果预编译二进制文件缺失，可能需要等待新版本修复或暂时回退到 0.8.4 版本。","https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fissues\u002F441",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},3211,"llamafile 目前是否原生支持 AMD GPU？需要哪些前置条件？","支持。部分用户反馈在特定硬件（如 Radeon 6800M）上可以开箱即用。但通常需要确保系统安装了 ROCm SDK 及相关编译器（`clang++`, `hipcc`），并且环境变量（如 `HIP_PATH`）配置正确以便 llamafile 找到编译工具。","https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fissues\u002F92",{"id":156,"question_zh":157,"answer_zh":158,"source_url":139},3212,"当 llamafile 提示找不到编译器时，如何手动指定 ROCm 路径？","可以通过设置 `HIP_PATH` 环境变量来手动引导 llamafile 查找编译器。例如执行 `HIP_PATH=\u002Fusr .\u002Fllamafile-0.6 ...`。这有助于解决即使编译器在 `$PATH` 中但仍报错的问题。",{"id":160,"question_zh":161,"answer_zh":162,"source_url":144},3213,"如果 llamafile 在特定硬件上无法运行，有什么备选方案？","如果因内核限制（如 ARM 设备 mmap 失败）或编译环境问题导致 llamafile 无法启动，可以直接使用上游的 `llama.cpp` 项目作为替代方案，它通常提供更底层的控制选项。",[164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239,244,249,254,259],{"id":165,"version":166,"summary_zh":167,"released_at":168},102718,"0.9.3","## What's Changed\r\n* Fix link to troubleshooting guide by @rsanheim in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F740\r\n* Preserve URL path when building relative URLs in JS by @dmcardle in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F733\r\n* Add Plaintext output option to LocalScore + Respect NO_COLOR env var by @cjpais in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F742\r\n* Update README.md, fix llama 8B table stats by @cjrh in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F745\r\n* Add phi4 support by @cjpais in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F744\r\n* Qwen3 Support by @cjpais in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F743\r\n\r\n## New Contributors\r\n* @rsanheim made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F740\r\n* @dmcardle made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F733\r\n* @cjrh made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F745\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fcompare\u002F0.9.2...0.9.3","2025-05-14T22:23:40",{"id":170,"version":171,"summary_zh":172,"released_at":173},102717,"0.10.0","**llamafile versions starting from 0.10.0 use a new build system**, aimed at keeping our code more easily \r\naligned with the latest versions of llama.cpp. This means they support more recent models and functionalities,\r\nbut at the same time they might be missing some of the features you were accustomed to (check out [this doc](README_0.10.0.md) for a high-level description of what has been done).\r\n\r\nIf you liked the \"classic experience\" more, you will always be able to access the previous versions from our [releases](https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Freleases) page. Our pre-built llamafiles show which version of the server they have been bundled with ([0.9.* example](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllava-v1.5-7b-llamafile), [0.10.* example](https:\u002F\u002Fhuggingface.co\u002Fmozilla-ai\u002Fllamafile_0.10.0)), so you will always know which version of the software you are downloading.\r\n\r\n## What's Changed\r\n* Accept array in chat message content field by @henfiber in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F760\r\n* chore: Update README.md to include call for community feedback on llamafile by @njbrake in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F812\r\n* chore: integrate whisper.cpp as a submodule by @njbrake in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F813\r\n* chore: convert stable diffusion to submodule by @njbrake in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F818\r\n* chore: llama.cpp as submodule by @njbrake in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F819\r\n* feat: move docs to mkdocs by @njbrake in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F824\r\n* chore: Add `update-llama-cpp` workflow. by @daavoo in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F846\r\n* fix(update-llama-cpp): Use `new_build_wip` as base ref. by @daavoo in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F850\r\n* Fixed broken llamafile URL in docs by @aittalam in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F873\r\n* update supported OpenBSD versions by @sthen in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F897\r\n* llamafile reloaded (v0.10.0) by @aittalam in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F867\r\n\r\n## New Contributors\r\n* @henfiber made their first contribution in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F760\r\n* @njbrake made their first contribution in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F812\r\n* @daavoo made their first contribution in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F846\r\n* @sthen made their first contribution in https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fpull\u002F897\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmozilla-ai\u002Fllamafile\u002Fcompare\u002F0.9.3...0.10.0","2026-03-19T11:59:19",{"id":175,"version":176,"summary_zh":177,"released_at":178},102719,"0.9.2","## Llamafile\r\n\r\nLlamafile v0.9.2 is a significant release. It adds support for:\r\n* DeepSeek Distil R1 Models\r\n* Gemma 3\r\n* IBM Granite\r\n\r\n## LocalScore\r\n\r\nIn addition 0.9.2 introduces LocalScore, a benchmarking utility and [website](https:\u002F\u002Flocalscore.ai). \r\n\r\n**LocalScore** is an open-source tool that both benchmarks how fast Large Language Models (LLMs) run on your specific hardware and serves as a repository for these results. We created LocalScore to provide a simple, portable way to evaluate computer performance across various LLMs while making it easy to share and browse hardware performance data.\r\n\r\nLocalScore is now part of the release of Llamafile under the new CLI utility `localscore`\r\n\r\nYou can run it `.\u002Flocalscore -m \u003Cmodel>`. It is also included in every llamafile so you can benchmark models on your hardware easily using `.\u002Fllamafile --localscore`.\r\n\r\nLocalScore was created with support from [Mozilla Builders](https:\u002F\u002Fbuilders.mozilla.org).\r\n\r\n## What's Changed\r\n* [llamafiler] doc\u002Fv1_chat_completions.md: remove duplicate entry by @mseri in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F607\r\n* Update server readme with code completion (FIM) example by @heaversm in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F637\r\n* URL constructor to get a clean url_prefix (fix #640) by @sizvix in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F641\r\n* Fix translation bug from cpp to js in TS highlight by @emilbayes in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F681\r\n* Add whisperfile server documentation by @alonsosilvaallende in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F707\r\n* Unify button look and rearrange buttons to make them more compact by @corebonts in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F712\r\n* add stable-diffusion.cpp to install target (fix #580) by @rgroesslinger in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F635\r\n* Improve OpenAI compatibility for \u002Fv1\u002F* endpoints by @corebonts in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F713\r\n* Update WSL troubleshooting in README.md by @halter73 in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F585\r\n* Granite three support by @gabe-l-hart in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F608\r\n* Initial support for Gemma 3 models by @corebonts in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F717\r\n* Add copy and info buttons to the chat window and improve small screen UX by @corebonts in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F719\r\n* Avoid streaming incomplete UTF-8 characters by @corebonts in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F727\r\n* Introduce LocalScore CLI by @cjpais in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F734\r\n\r\n## New Contributors\r\n* @mseri made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F607\r\n* @heaversm made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F637\r\n* @sizvix made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F641\r\n* @emilbayes made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F681\r\n* @alonsosilvaallende made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F707\r\n* @corebonts made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F712\r\n* @rgroesslinger made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F635\r\n* @halter73 made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F585\r\n* @gabe-l-hart made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F608\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fcompare\u002F0.9.1...0.9.2","2025-04-03T16:29:57",{"id":180,"version":181,"summary_zh":182,"released_at":183},102720,"0.9.1","This release adds support for DeepSeek Distil models. It improves some documentation, and fixes a segfault when running with an Nvidia GPU.\r\n\r\n## What's Changed\r\n* Update Makefile: Fix PHONY from check to cosmocc and cosmocc-ci respectively by @mofosyne in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F683\r\n* Updated README to reflect WSL 2 command for Windows 11 by @peteski22 in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F685\r\n* Add Support for DeepSeek-R1 models by @Xydane in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F687\r\n* Revert Cosmopolitan to 3.9.7 by @cjpais in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F705\r\n\r\n## New Contributors\r\n* @peteski22 made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F685\r\n* @Xydane made their first contribution in https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fpull\u002F687\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fcompare\u002F0.9.0...0.9.1","2025-03-11T01:14:24",{"id":185,"version":186,"summary_zh":187,"released_at":188},102721,"0.9.0","We've solved all the known issues with the new llamafiler server, and\r\nimproved its web gui. In addition to the llamafiler binary below, the\r\nnew server is available as `llamafile --server --v2`. Its manual can be\r\naccessed via `llamafile --server --v2 --help`.\r\n\r\n- e64c7e2 Include llamafiler in llamafile binary\r\n- a8fd4d2 Improve management of multiple slots\r\n- 4158265 Show progress bar for prompt processing in web ui\r\n- 38677b5 Support relocating matching suffixes in KV cache\r\n- 956e62c Visually indicate messages truncated by context\r\n- 08e7a21 Forget old messages when running out of context\r\n- 59a5d97 8fa1702 Make pledge() security not break things\r\n- 1fc35e2 Add upload button and support text files\r\n- 43bc1eb fe514ef Improve buttons in web ui\r\n\r\nThis change upgrades our cosmocc toolchain, whose recent release has\r\nfixed all known issues and made performance improvements to memory\r\nallocation. See the [cosmopolitan releases](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan\u002Freleases) page.\r\n\r\n- c293359 Upgrade to Cosmopolitan v4.0.2\r\n\r\nThe following improvements have been made to the terminal `--chat` bot.\r\n\r\n- f51e535 Fix emoji editing in chatbot\r\n\r\nAdditional changes:\r\n\r\n- 9b03e32 Call appropriate hip api's (#651)\r\n","2025-01-06T00:05:15",{"id":190,"version":191,"summary_zh":192,"released_at":193},102722,"0.8.17","llamafiler has a new web UI which supports two modes of operation:\r\nchatbot and raw completion. Its syntax highlighting is just as advanced\r\nas the CLI chatbot. It looks much nicer than the old web ui. In a future\r\nrelease, llamafiler will be folded into llamafile to replace the old server.\r\n\r\n- 988c9ec Introduce raw completions web ui\r\n- 241bf21 Introduce \u002Fv1\u002Fcompletions endpoint in new server\r\n- 6d89f8f Add binary safety check to server\r\n- d18ddf1 Add redo button to new web ui\r\n- bc82424 Add settings modal to web ui\r\n- bb917bd Add vision model support to new server\r\n- 4c7b7d5 Implement data URI parser\r\n- fb4b3e6 Fix JSON parser bug\r\n- 9d6f89f Improve look and printability of new web ui\r\n- 25b6910 Make chatbot ui more printer friendly\r\n- 30518ca Respond to HTTP OPTIONS requests\r\n- 41abfa3 Work around multiple image handling\r\n- 35bc088 Make default system prompt configurable on web\r\n- 28c8e22 Scale and decimate images as needed in browser\r\n- 14713b5 Get basic chatbot web gui working in llamafiler\r\n- ef08074 Start porting syntax highlighter to JavaScript\r\n- fdfdb13 Port remaining highlighting code to javascript\r\n\r\nThe following improvements have been made to our terminal chatbot.\r\n\r\n- 12c3761 Make CLI chatbot work better with base models\r\n- e5c0921 Improve VT100 support\r\n- 4b61791 Fix VT102 support\r\n- d25c077 Introduce \u002Fupload and \u002Fforget commands to chatbot\r\n- 880ebc7 Handle empty system prompt better in cli chatbot\r\n\r\nGeneral improvements to this project.\r\n\r\n- f581c40 Fix futex prototype\r\n- 54d3c72 Make LLaVA fast again\r\n- 01b8d49 Remove n-gpu-layer limitation (#534)\r\n- 566cdc1 Improve Gemma system prompt generation\r\n- 46284fe Reduce attack surface of stb_image\r\n- 9bb262b Log CUDA kernel vs. runtime versions\r\n\r\nSyntax highlighting improvements for chatbot and web ui.\r\n\r\n- d979a1c Add BNF syntax highlighting\r\n- 4a8311a Add cmake syntax highlighting\r\n- 40e92cf Add Ocaml syntax highlighting\r\n- 0995343 Add more Clojure keywords\r\n- 0068a37 Make D syntax highlighting better\r\n- 0965a4b Make some markdown improvements\r\n- 9b96502 Improve JS\u002FHTML syntax highlighting\r\n- c0622da Put more work into markdown rendering\r\n- fa1c98f Improve markdown to html rendering\r\n- 8915432 Further improve markdown to html\r\n- d25fa3a Improve highlighting in new web ui\r\n- f5a0bd4 Fix JS regex highlighting issue\r\n- 2807ae6 Improve Ada syntax highlighting\r\n- d30da30 Syntax highlight D properly\r\n- 33a057e Improve Ruby some more\r\n- 5b0fff1 Improve Ruby syntax highlighting\r\n- 8413a21 Fix Ruby builtins in web gui\r\n\r\nThe latest cosmopolitan upgrade introduces a new more powerful syntax\r\nfor your .args files. They're now parsed more similarly to the shell,\r\nwith support for C style escaping in double-quoted strings. You can also\r\nnow add shell-style comments to .args files too. See tool\u002Fargs\u002Fargs2.c\r\nin the cosmopolitan codebase for the definitive reference.\r\n\r\n- fb59488 Upgrade to Cosmo v3.9.7\r\n- 21af0bf Import upstream bestline changes\r\n\r\nThe following example of the new .args file syntax is provided:\r\n\r\n```sh\r\n# specify model\r\n-m Qwen2.5-Coder-34B-Instruct.Q6_K.gguf\r\n\r\n# prevent flags below from being changed\r\n...\r\n\r\n# specify system prompt\r\n--system-prompt \"\\\r\nyou are a friendly ai assistant\\n\r\nyour job is to be helpful and intelligent\"\r\n\r\n# hide some stuff from user interfaces\r\n--nologo\r\n--no-display-prompt\r\n```\r\n\r\nYou can put .args files inside llamafile, llamafiler, and whisperfile\r\nusing the zipalign program.\r\n\r\nThe following screenshots are provided of the llamafiler web ui.\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F6a9de260-df2c-4de1-a861-55b0fe8178bb)\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F356958e6-3bca-493b-b78d-1c6d51c21b40)\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F26a6675f-ad39-462c-874e-1b0c0fc424ab)\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F13ba02f8-48ee-4e6b-8e99-ff94aa1e89be)\r\n","2024-12-01T01:00:14",{"id":195,"version":196,"summary_zh":197,"released_at":198},102723,"0.8.16","- Add Julia syntax highlighting support\r\n- Fix possible crash on Windows due to MT bug\r\n- Improve accuracy of chatbot context window management\r\n- The new `llamafiler` server now supports GPU. Pass the `-ngl 999` flag.\r\n- The new `llamafiler` server's `\u002Fv1\u002Fchat\u002Fcompletions` endpoint now supports prompt caching. It may be configured using the `--slots COUNT` and `--ctx-size TOKENS` flags.","2024-11-02T03:42:24",{"id":200,"version":201,"summary_zh":202,"released_at":203},102724,"0.8.15","The `--chat` bot interface now supports syntax highlighting 42 separate programming languages: ada, asm, basic, c, c#, c++, cobol, css, d, forth, fortran, go, haskell, html, java, javascript, json, kotlin, ld, lisp, lua, m4, make, markdown, matlab, pascal, perl, php, python, r, ruby, rust, scala, shell, sql, swift, tcl, tex, txt, typescript, and zig.\r\n\r\nThat chatbot now supports more commands:\r\n\r\n- `\u002Fundo` may be used to have the LLM forget the last thing you said. This is useful when you get a poor response and want to try asking your question a different way, without needing to start the conversation over from scratch.\r\n- `\u002Fpush` and `\u002Fpop` works similarly, in the sense that it allows you to rewind a conversation to a previous state. In this case, it does so by creating save points within your context window. Additionally, `\u002Fstack` may be used to view the current stack.\r\n- `\u002Fclear` may be used to reset the context window to the system prompt, effectively starting your conversation over.\r\n- `\u002Fmanual` may be used to put the chat interface in \"manual mode\" which lets you (1) inject system prompts, and (2) speak as the LLM. This could be useful in cases where you want the LLM to believe it said something when it actually didn't.\r\n- `\u002Fdump` may be used to print out the raw conversation history, including special tokens (that may be model specific). You can also say `\u002Fdump filename.txt` to save the raw conversation to a file.\r\n\r\nWe identified an issue with Google's Gemma models, where the chatbot wasn't actually inserting the system prompt. That's now fixed. So you can now instruct Gemma to do roleplaying if you pass the flags `llamafile -m gemma.gguf -p \"you are role playing as foo\" --chat`.\r\n\r\nYou can now type CTRL-J to create multi-line prompts in the terminal chatbot. It works similarly to shift-enter in the browser. It can be a quicker alternative to using the chatbot's triple quote syntax, i.e. `\"\"\"multi-line \u002F message\"\"\"`.\r\n\r\nBugs in the new chatbot have been fixed. For example, we now do a better job making sure special tokens like BOS, EOS, and EOT get inserted when appropriate into the conversation history. This should improve fidelity when using the terminal chatbot interface.\r\n\r\nThe `--threads` and `--threads-batch` flags may now be used separately to tune how many threads are used for prediction and prefill.\r\n\r\nThe llamafile-bench command now supports benchmarking GPU support (see #581 from @cjpais)\r\n\r\nBoth servers now support configuring a URL prefix, thanks to (see #597 and #604 from @vlasky)\r\n\r\nSupport for the IQ quantization formats is being removed from our CUDA module to save on build times. If you want to use IQ quants with your NVIDIA hardware, you need to pass the `--iq --recompile` flags to llamafile once, to build a ggml-cuda module for your system that includes them.\r\n\r\nFinally, we have an alpha release of a new `\u002Fv1\u002Fchat\u002Fcompletions` endpoint for the new `llamafiler` server. We're planning to build a new web interface that's based on this soon, so you're encouraged to test this, since llamafiler will eventually replace the old server too. File an issue if there's any features you need.","2024-10-30T21:13:25",{"id":205,"version":206,"summary_zh":207,"released_at":208},102725,"0.8.14","\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F60c12d26-d087-4227-ab71-0876b50a48d2\" width=\"300\">\r\n\r\n**llamafile lets you distribute and run LLMs with a single file**\r\n\r\nllamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002F) and [cosmopolitan libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) while aiming to stay ahead of the curve by including the most cutting-edge performance  and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a [shell-scriptable](https:\u002F\u002Fjustine.lol\u002Foneliners\u002F) CLI interface which together put you in control of artificial intelligence.\r\n\r\n### v0.8.14 changes\r\n\r\nThis release introduces our new CLI chatbot interface. It supports\r\nmulti-line input using triple quotes. It will syntax highlight Python,\r\nC, C++, Java, and JavaScript code.\r\n\r\n\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F485324d7-320e-4cc0-aec7-1cefe5ab8447\" width=\"600\">\r\n\r\nThis chatbot is now the default mode of operation. When you launch\r\nllamafile without any special arguments, the chatbot will be launched\r\nin the foreground, and the server will be launched in the background.\r\nYou can use the `--chat` and `--server` flags to disambiguate this\r\nbehavior if you only want one of them.\r\n\r\n- a384fd7 Create ollama inspired cli chatbot\r\n- 63205ee Add syntax highlighting to chatbot\r\n- 7b395be Introduce new --chat flag for chatbot\r\n- 28e98b6 Show prompt loading progress in chatbot\r\n- 4199dae Make chat+server hybrid the new default mode\r\n\r\nThe whisperfile server now lets you upload mp3\u002Fogg\u002Fflac.\r\n\r\n- 74dfd21 Rewrite audio file loader code\r\n- 7517a5f whisperfile server: convert files without ffmpeg (#568)\r\n\r\nOther improvements have been made.\r\n\r\n- d617c0b Added vision support to api_like_OAI (#524)\r\n- 726f6e8 Enable gpu support in llamafile-bench (#581)\r\n- c7c4d65 Speed up KV in llamafile-bench\r\n- 2c940da Make replace_all() have linear complexity\r\n- fa4c4e7 Use bf16 kv cache when it's faster\r\n- 20fe696 Upgrade to Cosmopolitan 3.9.4\r\n- c44664b Always favor fp16 arithmetic in tinyBLAS\r\n- 98eff09 Quantize TriLM models using Q2_K_S (#552)\r\n","2024-10-14T01:56:21",{"id":210,"version":211,"summary_zh":212,"released_at":213},102726,"0.8.13","\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fassets\u002F49262\u002Fbbcb0dde-4cd9-431a-9f79-ccb5ecd912d6\" width=\"320\" height=\"320\"\r\n  alt=\"[line drawing of llama animal head in front of slightly open manilla folder filled with files]\">\r\n\r\n**llamafile lets you distribute and run LLMs with a single file**\r\n\r\nllamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002F) and [cosmopolitan libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) while aiming to stay ahead of the curve by including the most cutting-edge performance  and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a [shell-scriptable](https:\u002F\u002Fjustine.lol\u002Foneliners\u002F) CLI interface which together put you in control of artificial intelligence.\r\n\r\n### v0.8.13 changes\r\n\r\nThis release synchronizes with upstream projects, bringing with it\r\nsupport for the newest models (e.g. Gemma 2B). Support for LLaMA v3 has\r\nbeen significantly improved.\r\n\r\n- e9ee3f9 Synchronize with llama.cpp upstream\r\n- d0b5e8f Upgrade to Cosmopolitan v3.7.1\r\n\r\nThe new llamafiler server is now able to serve 2400 embeddings per\r\nsecond on CPU. That's 3x faster than the llama.cpp server upstream. It's\r\nnow hardened for security. You should be able to safely use it a public\r\nfacing web server. There's a man page for llamafiler. You can also read\r\nthe docs online: [\u002Fllamafile\u002Fserver\u002Fdoc\u002Findex.md](\u002Fllamafile\u002Fserver\u002Fdoc\u002Findex.md).\r\n\r\n- 070aa13 Bring new server up to 2421 embedding\u002Fsec\r\n- 584a327 Increase tokens per second on tiny models\r\n- 99dd1c0 Add seccomp, tokenbucket, and batch prioritization\r\n- cda83f8 Make GGML threads spawn 10x faster\r\n- d451e0e Add chrome:\u002F\u002Ftracing\u002F feature\r\n\r\nThe new llamafiler server now fully supports all the old embedding\r\nendpoints that were provided by `llamafile --server`. Support for\r\nserving embeddings has been removed from the old server.\r\n\r\n- be94c1f Add OpenAI \u002Fv1\u002Fembeddings to new llamafiler server\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F271b40d9-73fa-4801-8533-5bc523a3ebda)\r\n\r\nThis release introduces whisperfile which is a single-file\r\nimplementation of OpenAI's Whisper model. It lets you transcribe speech\r\nto text and even translate it too. Our implementation is based off\r\nGeorgi Gerganov's [whisper.cpp](https:\u002F\u002Fhuggingface.co\u002Fggerganov\u002Fwhisper.cpp) project.\r\nThe project to turn it into a [whisperfile](https:\u002F\u002Fgithub.com\u002Fcjpais\u002Fwhisperfile\u002F) was\r\nfounded by CJ Pais who's handed over maintenance of his awesome work.\r\nThere's a man page for whisperfile (which also can be viewed by running\r\n`.\u002Fwhisperfile --help`) and we have online documentation with markdown\r\ntutorials at [\u002Fwhisper.cpp\u002Fdoc\u002Findex.md](\u002Fwhisper.cpp\u002Fdoc\u002Findex.md).\r\n\r\n- fd891be Merge whisperfile into llamafile (#517)\r\n- 7450034 Use colorblind friendly TTY colors in whisperfile\r\n- ggerganov\u002Fwhisper.cpp#2360 (our fork is upstreaming changes)\r\n\r\nWe developed a faster, more accurate implementation of GeLU. This helps\r\nimprove the performance of tiny models. It leads to measurable quality\r\nimprovements in whisper model output.\r\n\r\n- 8ace604 Write explicitly vectorized GeLU functions\r\n- b5748f3 Implement the real trick to GeLU with proof\r\n- ggerganov\u002Fllama.cpp#8878 (our fork is upstreaming changes)\r\n\r\nWe've been improving floating point numerical stability for very large\r\nmodels, e.g. Mixtral 8x22b and Command-R-Plus. tinyBLAS on CPU for F32,\r\nF16, and BF16 weights now uses a new zero-overhead divide-and-conquer\r\napproach to computing dot products, which we call ruler reduction, that\r\ncan result in a 10x reduction in worst case roundoff error accumulation.\r\n\r\n- cb817f5 Reduce rounding errors for very large models\r\n- 5b06924 Use ruler reduction for GGML dot products\r\n\r\nThis release introduces sdfile, which is our implementation of stable\r\ndiffusion. No documentation is yet provided for this command, other than\r\nthe docs provided by the upstream [stable-diffusion.cpp](https:\u002F\u002Fgithub.com\u002Fleejet\u002Fstable-diffusion.cpp)\r\nproject on which it's based.\r\n\r\n- 3b7b1e3 Add stable-diffusion.cpp\r\n- 25ceb2c Upgrade stable diffusion\r\n\r\nThe list of new architectures and tokenizers introduced by this version are:\r\nOpen ELM, GPT NEOX, Arctic, DeepSeek2, ChatGLM, BitNet, T5, JAIS, Poro,\r\nViking, Tekken, and CodeShell.\r\n\r\n### Known Issues\r\n\r\nThe llamafile executable size is increased from 30mb to 200mb by this release.\r\nThis is caused by ggerganov\u002Fllama.cpp#7156. We're already employing some\r\nworkarounds to minimize the impact of upstream development contributions\r\non binary size, and we're aiming to find more in the near future.","2024-08-18T17:22:48",{"id":215,"version":216,"summary_zh":217,"released_at":218},102727,"0.8.12","- 1839bfa Introduce --no-warmup flag\r\n- f40facc Upgrade to [Cosmopolitan v3.6.2](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan\u002Freleases\u002Ftag\u002F3.6.2)\r\n- fdd5d84 Fix build determinism issue\r\n- 3e220e7 Update llamafile version in README\r\n- 909f791 Make zipalign and slicehf gentler on system\r\n- dd10455 Add link to OLMo-7B in README\r\n- 867c752 Fix code compatibility issues","2024-07-28T04:08:32",{"id":220,"version":221,"summary_zh":222,"released_at":223},102728,"0.8.11","- 7469a23 Add smaug-bpe tokenizer","2024-07-23T18:13:37",{"id":225,"version":226,"summary_zh":227,"released_at":228},102729,"0.8.10","\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fassets\u002F49262\u002Fbbcb0dde-4cd9-431a-9f79-ccb5ecd912d6\" width=\"320\" height=\"320\"\r\n  alt=\"[line drawing of llama animal head in front of slightly open manilla folder filled with files]\">\r\n\r\n**llamafile lets you distribute and run LLMs with a single file**\r\n\r\nllamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002F) and [cosmopolitan libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) while aiming to stay ahead of the curve by including the most cutting-edge performance  and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a [shell-scriptable](https:\u002F\u002Fjustine.lol\u002Foneliners\u002F) CLI interface which together put you in control of artificial intelligence.\r\n\r\nThis release includes a build of the new llamafile server rewrite we've\r\nbeen promising, which we're calling `llamafiler`. It's matured enough to\r\nrecommend for embedding serving. This is the fastest way to serve\r\nembeddings. If you use it with all-MiniLM-L6-v2.Q6_K.gguf then on\r\nThreadripper it can serve JSON \u002Fembedding at 800 req\u002Fsec whereas the old\r\nllama.cpp server could only do 100 req\u002Fsec. So you can fill up your RAG\r\ndatabases very quickly if you productionize this.\r\n\r\nThe old llama.cpp server came from a folder named \"examples\" and was\r\nnever intended to be production worthy. This server is designed to be\r\nsturdy and uncrashable. It has \u002Fcompletion and \u002Ftokenize endpoints too,\r\nwhich serves 3.7 million requests per second on Threadripper, thanks to\r\nCosmo Libc improvements.\r\n\r\nSee the [LLaMAfiler Documentation](https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fblob\u002Fmain\u002Fllamafile\u002Fserver\u002Fdoc\u002Findex.md) for further details.\r\n\r\n- 73b1836 Write documentation for new server\r\n- b3930aa Make GGML asynchronously cancelable\r\n- 8604e9a Fix POSIX undefined cancelation behavior\r\n- 323f50a Let SIGQUIT produce per-thread backtraces\r\n- 15d7fba Use semaphore to limit GGML worker threads\r\n- d7c8e33 Add support for JSON parameters to new server\r\n- 7f099cd Make stack overflows recoverable in new server\r\n- fb3421c Add barebones \u002Fcompletion endpoint to new server\r\n\r\nThis release restores support for non-AVX x86 microprocessors. We had to\r\ndrop support at the beginning of the year. However our CPUid dispatching\r\nhas advanced considerably since then. We're now able to offer top speeds\r\non modern hardware, without leaving old hardware behind.\r\n\r\n- a674cfb Restore support for non-AVX microprocessors\r\n- 555fb80 Improve build configuration\r\n\r\nHere's the remaining improvements included in this release:\r\n\r\n- cc30400 Supports SmolLM (#495)\r\n- 4a4c065 Fix CUDA compile warnings and errors\r\n- 82f845c Avoid crashing with BF16 on Apple Metal\r\n","2024-07-23T17:53:04",{"id":230,"version":231,"summary_zh":232,"released_at":233},102730,"0.8.9","This release gets Gemma2 working closer to how Google intended.\r\n\r\n- af22695 Make gemma2-27b-it the same as aistudio.google.com\r\n- 41678c8 Add sliding window mask for Gemma2\r\n- 140eed5 Add soft-capping to Gemma2\r\n\r\nThis release fixes Android support. You can now run LLMs on your phone\r\nusing Cosmopolitan software like llamafile. Thank you @aj47 ([techfren.net](https:\u002F\u002Ftechfren.net))\r\nfor bug reports and and testing efforts. See also other bug fixes described\r\nby the Cosmopolitan [v3.5.4](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan\u002Freleases\u002Ftag\u002F3.5.4) and [v3.5.3](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan\u002Freleases\u002Ftag\u002F3.5.3) release notes.\r\n\r\nOur future replacement for the server now has an \u002Fembedding endpoint. On\r\nmy workstation, it's currently able to serve 851 requests per second for\r\na prompt with 52 tokens, using the all-MiniLM-L6-v2.Q6_K.gguf embeddings\r\nmodel. None of the requests fail and 99th percentile latency is 56.74ms.\r\n\r\n- 1346ef4 Create \u002Fembedding endpoint in new server\r\n- 263d39b Use float to string conversion\r\n- 0d62d05 Reclaim llama_decode() memory on cancelation\r\n- 617d841 Remove ggml_context cache\r\n- 46dda4f Refactor new server and get leak checker working\r\n- cd73243 Prevent vector overflow in llama.cpp\r\n\r\nYou can try the new embedding server as follows:\r\n\r\n    make -j o\u002F\u002Fllamafile\u002Fserver\u002Fmain\r\n    o\u002F\u002Fllamafile\u002Fserver\u002Fmain -m \u002Fweights\u002Fall-MiniLM-L6-v2.F32.gguf\r\n    curl http:\u002F\u002F127.0.0.1:8080\u002Fembedding?prompt=orange\r\n\r\nCompatibility with the old server's API of posting JSON content will be\r\nadded in upcoming changes. The same goes for the OpenAI API. The goal's\r\nto be compatible with everything.","2024-07-01T19:11:46",{"id":235,"version":236,"summary_zh":237,"released_at":238},102731,"0.8.8","- 571b4e5 Fix bug preventing GPU extraction on Windows\r\n- 4aea606 Support flash attention in --server mode\r\n- 7fd9101 Don't flush bf16 subnormals to zero\r\n- 7692b85 Add Google Gemma v2 support\r\n- 72fb8ca Introduce --special flag","2024-06-29T18:42:48",{"id":240,"version":241,"summary_zh":242,"released_at":243},102732,"0.8.7","This release includes important performance enhancements for quants.\r\n\r\n- 293a528 Performance improvements on Arm for legacy and k-quants (#453)\r\n- c38feb4 Optimized matrix multiplications for i-quants on `__aarch64__` (#464)\r\n\r\nThis release fixes bugs. For example, we're now using a brand new memory\r\nmanager, which is believed to support platforms like Android that have a\r\nvirtual address space with fewer than 47 bits. This release also restores our\r\nprebuilt Windows AMD GPU support, thanks to tinyBLAS.\r\n\r\n- 0c0e72a Upgrade to [Cosmopolitan v3.5.1](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan\u002Freleases\u002Ftag\u002F3.5.0)\r\n- 629e208 Fix server crash due to \u002Fdev\u002Furandom\r\n- 60404a8 Always use tinyBLAS with AMD GPUs on Windows\r\n- 6d3590c Pacify --temp flag when running in server mode\r\n- a28250b Update GGML_HIP_UMA (#473)\r\n- e973fa2 Improve CPU brand detection\r\n- 9cd8d70 Update sever README build\u002Ftesting instructions (#461)\r\n\r\nIt should be noted that, in future releases, we plan to introduce a new\r\nserver for llamafile. This new server is being designed for performance\r\nand production-worthiness. It's not included in this release, since the\r\nnew server currently only supports a tokenization endpoint. However the\r\nendpoint is capable of doing 2 million requests per second whereas with\r\nthe current server, the most we've ever seen is a few thousand.\r\n\r\n- e0656ea Introduce new llamafile server\r\n","2024-06-24T15:00:18",{"id":245,"version":246,"summary_zh":247,"released_at":248},102733,"0.8.6","Two minor issues are fixed with this release.\r\n\r\n- 69c2dd317d3e326c9f34cd645cde4181fdd4f862 Don't print special tokens for now (improve shell scriptability)\r\n- 866a1291421bb40b7744eb7ebb1750ab0dbaa0a1 Upgrade to Cosmopolitan v3.3.8\r\n\r\nSee the [llamafile v0.8.5 release notes](https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Freleases\u002Ftag\u002F0.8.5) for further details. For driver-only prebuilt AMD GPU support on Windows, please use [llamafile v0.8.4](https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Freleases\u002Ftag\u002F0.8.4)  for the next few weeks, until https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fissues\u002F7156 is resolved.","2024-05-25T14:27:36",{"id":250,"version":251,"summary_zh":252,"released_at":253},102734,"0.8.5","This release fixes bugs and introduces @kawrakow's latest quant\r\nperformance enhancements (a feature exclusive to llamafile). As of #435\r\nthe K quants now go consistently 2x faster than llama.cpp upstream. On\r\nbig CPUs like Threadripper we've doubled the performance of tiny models,\r\nfor both prompt processing and token generation for tiny models (see the\r\nbenchmarks below) The `llamafile-bench` and `llamafile-upgrade-engine`\r\ncommands have been introduced.\r\n\r\n- a86e7ce Add Script To Upgrade llamafile Archives (#412)\r\n- 07e87bf 261dfe7 Fix llamafile-quantize and rewrite documentation\r\n- 938cf72 Faster AVX2 matrix multiplications for MoE models (#428)\r\n- eaa756d Faster AVX2 matrix multiplications for legacy quants (#405)\r\n- 7cb15c6 Another performance optimization for Zen4 + refactoring (#435)\r\n- 9206719 8b2f8d8 e675719 4451c6d Introduce llamafile-bench command (cpu mode only)\r\n- 87d4ce1 Fix f16 cpuid check (caused crashes on sandybridge)\r\n- 5c40565 8d1afe4 Avoid crashing on llava ctrl-c\r\n- c0aa43e Introduce bf16 cuda support\r\n- 00e4f72 Enable GGML_CUDA_FORCE_MMQ in tinyBLAS mode\r\n- d228e01 0b5997d 64fbffc Sync with llama.cpp upstream (#427)\r\n- c660d38 Add text embedding models to 'other example llamafiles' table (#422)\r\n- 49cc13c Updated README with instructions to load models from third-party apps (#417)\r\n\r\n**Note**: Please use [llamafile v0.8.4](https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Freleases\u002Ftag\u002F0.8.4) if you need prebuilt (driver-only) AMD GPU support on Windows,\r\nat least for the next few weeks, until https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fissues\u002F7156 is resolved.\r\n\r\nBinaries run on Linux, Windows, MacOS, FreeBSD, OpenBSD, and NetBSD for\r\nAMD64 and ARM64. Supported GPUs are CUDA, ROCm, and Metal. Prebuilt GPU\r\nbinaries are provided for CUDA\u002FROCm on Linux, and CUDA on Windows. To\r\ninstall this release on systems with a POSIX-style shell:\r\n\r\n```\r\nsudo -s\r\ncd \u002Fusr\u002Flocal\r\nwget https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Freleases\u002Fdownload\u002F0.8.5\u002Fllamafile-0.8.5.zip\r\nunzip llamafile-0.8.5.zip\r\nexit\r\nllamafile --help\r\n```\r\n\r\nTo upgrade your old llamafiles without needing to redownload, run:\r\n\r\n```\r\nllamafile-upgrade-engine old.llamafile new.llamafile\r\n```\r\n\r\nPrebuilt llamafiles that have the LLM weights included are available at:\r\n\r\n- https:\u002F\u002Fhuggingface.co\u002FMozilla (official)\r\n- https:\u002F\u002Fhuggingface.co\u002Fmodels?library=llamafile (community)\r\n\r\nHere are some tutorials:\r\n\r\n- https:\u002F\u002Fjustine.lol\u002Foneliners\u002F\r\n- https:\u002F\u002Fgithub.com\u002Fmozilla-ocho\u002Fllamafile\u002F\r\n- https:\u002F\u002Ffuture.mozilla.org\u002Fnews\u002Fllamafiles-for-embeddings-in-local-rag-applications\u002F\r\n- https:\u002F\u002Fblog.mozilla.ai\u002Flocal-llm-as-judge-evaluation-with-lm-buddy-prometheus-and-llamafile\u002F\r\n- https:\u002F\u002Fwww.docker.com\u002Fblog\u002Fa-quick-guide-to-containerizing-llamafile-with-docker-for-ai-applications\u002F\r\n\r\nHere are some performance benchmarks for various quantization formats, on the world's flagship CPUs. See \u003Chttps:\u002F\u002Fjustine.lol\u002Fmatmul\u002F> to compare these numbers to where we were back in March two months ago.\r\n\r\n|                                         cpu_info |                           model_filename |       size |          test |             t\u002Fs |\r\n| :----------------------------------------- | :--------------------------------------- | ---------: | ------------: | --------------: |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.BF16 |  86.99 GiB |         pp512 |          447.01 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.BF16 |  86.99 GiB |          tg16 |           11.35 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |           mixtral-8x7b-instruct-v0.1.F16 |  86.99 GiB |         pp512 |          340.63 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |           mixtral-8x7b-instruct-v0.1.F16 |  86.99 GiB |          tg16 |           11.01 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.Q8_0 |  46.22 GiB |         pp512 |          288.16 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.Q8_0 |  46.22 GiB |          tg16 |           15.82 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.Q6_K |  35.74 GiB |         pp512 |          431.51 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-instruct-v0.1.Q6_K |  35.74 GiB |          tg16 |           22.73 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |        mixtral-8x7b-instruct-v0.1.Q5_K_M |  30.95 GiB |         pp512 |          427.71 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |        mixtral-8x7b-instruct-v0.1.Q5_K_M |  30.95 GiB |          tg16 |           24.90 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |        mixtral-8x7b-instruct-v0.1.Q4_K_M |  26.49 GiB |         pp512 |          440.03 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |        mixtral-8x7b-instruct-v0.1.Q4_K_M |  26.49 GiB |          tg16 |           27.31 |\r\n| AMD Ryzen Threadripper PRO 7995WX (znver4) |          mixtral-8x7b-in","2024-05-25T09:06:24",{"id":255,"version":256,"summary_zh":257,"released_at":258},102735,"0.8.4","This release fixes underflows and overflows.\r\n\r\n- A memory bug in the grammar parser has been fixed, that caused commands like `.\u002Fllamafile -m foo.gguf -p bar --grammar 'root::=\"'` (which failed to specify a closing quote) to crash. Anyone using the server as a public facing endpoint (despite our previous recommendations) is strongly encouraged to upgrade. See 22aba95 and 3fe045feef36ce5ee47a61e4bb375bf90b3b4f9a. Credit for discovering (and most importantly, reporting) this issue goes to [Eclypsium](https:\u002F\u002Feclypsium.com\u002F) Security Researcher [Richard Johnson](mailto:Richard.johnson@eclypsium.com). We incorrectly reported earlier that this fix was incorporated into the v0.8.2 release. You need to use the v0.8.4 release. This bug fix was upstreamed in https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fpull\u002F7194\r\n\r\n- Our new vectorized expf() implementation now handles underflow by producing subnormals rather than flushing to zero. b5c6df6e9e428ea56fe0969da33d8c164e1311f0\r\n\r\nSee these instructions for how to put the latest llamafile software into your old weights, without having to redownload. https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fissues\u002F24#issuecomment-1836362558","2024-05-10T09:30:05",{"id":260,"version":261,"summary_zh":262,"released_at":263},102736,"0.8.2","\u003Cimg src=\"https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fllamafile\u002Fassets\u002F49262\u002Fbbcb0dde-4cd9-431a-9f79-ccb5ecd912d6\" width=\"320\" height=\"320\"\r\n  alt=\"[line drawing of llama animal head in front of slightly open manilla folder filled with files]\">\r\n\r\n**llamafile lets you distribute and run LLMs with a single file**\r\n\r\nllamafile is a local LLM inference tool introduced by Mozilla Ocho in Nov 2023, which offers superior performance and binary portability to the stock installs of six OSes without needing to be installed. It features the best of [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002F) and [cosmopolitan libc](https:\u002F\u002Fgithub.com\u002Fjart\u002Fcosmopolitan) while aiming to stay ahead of the curve by including the most cutting-edge performance  and accuracy enhancements. What llamafile gives you is a fun web GUI chatbot, a turnkey OpenAI API compatible server, and a [shell-scriptable](https:\u002F\u002Fjustine.lol\u002Foneliners\u002F) CLI interface which together put you in control of artificial intelligence.\r\n\r\n- This release introduces faster AVX2 prompt processing for K-quants and IQ4_XS (#394). This was contributed to llamafile by @ikawrakow who originally invented  K quants last year: https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp\u002Fcommit\u002F99009e72f8072fa552eb02efee436be596c71cdd. In prior releases we recommended the legacy `Q4_0` quant since it was the simplest and most intuitive to get working with recent [matmul optimizations](https:\u002F\u002Fjustine.lol\u002Fmatmul\u002F). Thanks to Iwan Kawrakow's efforts, the best quants (e.g. `Q5_K_M`) will now go the fastest (on modern x86 systems).\r\n\r\n- Text generation (or prediction) should now go slightly faster too, thanks to development work matmul kernels, and enhancements to thread synchronization (see 89c189e) which should be noticed most on CPUs with many cores running smaller models. MacOS ARM users who are using CPU rather than Metal can expect to see the biggest boost, now that llamafile knows how to utilize all cores (see 6c45e3e).\r\n\r\n- Bugs in the server `\u002Fembedding` endpoint have been fixed (see 0e2845a and 7900294). You can also now pass `llamafile --embedding -m model -p prompt` to have embeddings printed to standard output (see 42bd9b8).\r\n\r\n- This release synchronizes with the upstream llama.cpp project as of May 7th in 94d0940, which improves tokenization for Command-R, Refact, Olmo, and StarCoder. There's a new flash attention op that may be enabled for many models by passing the `-fa` flag. We haven't been able to include this in our prebuilt cuda\u002Frocm binaries yet, so you may need to pass the `llamafile --recompile` flag for GPU.\r\n\r\n- This release introduces the `--precise`, `--fast`, and `--trap` flags, which control the execution of math. The `--precise` flag can slightly enhance the thinking of LLMs at the cost of some performance (see 2af3b88 and 9540b43). The `--fast` flag is included since it's unspecified which mode llamafile will use for any given situation (see bbae0f6 and b749326). The `--trap` flag can help you pinpoint the exact moment any NaNs appear (on CPUs that support this, e.g. most of x86), which is useful for troubleshooting. Additionally, a new vectorized `expf()` function has been introduced that enables llamafile to compute the exponent function faster and at full quality (see e2b3cb2). This matters because it's the function that powers SiLU and SoftMax which are used by most of todays premier public models.\r\n\r\n- Most of the CPU code in the GGML library now has optimal performance across different hardware architectures, thanks to new build system techniques. Features or specific options or models that underperformed before, may do better now (see 0bdea60 and c9d7393).\r\n\r\nAdditional fixes:\r\n\r\n- a2d159e Fix server multimodal statistics (#392)\r\n- aa8c01a Revert moondream vision language model support\r\n- eecbf89 More conservative strong\u002Fem markdown matcher (#352)\r\n- 38311f2 CUDA: CUDART \u003C 11.7 workaround for __hmax, __hmax2\r\n- 58d2ca0 Use qsort and set linkage to static for internal functions used for offload-arch-fix (#375)\r\n- 4ee1e398273d63d5a6a9554d89eeabb784568f36 The PDF documentation in llamafile-0.8.2.zip is now fixed\r\n- 4ee1e398273d63d5a6a9554d89eeabb784568f36 Remove warnings from cuda build\r\n\r\nAdditional notes:\r\n\r\n- We're experiencing some instability with our Windows AMD GPU support. If you encounter crashes using the `-ngl 999` flag on Windows, then try using the previous 0.8.1 release. Please also consider filing an issue, to report if it doesn't work, or better yet, please file an issue if it does work, since we otherwise have no way of knowing that (llamafile doesn't have telemetry because maximally respecting the user's privacy on their local machine is one of the stated goals of the project). You can also share details about your experience with us on the [Mozilla AI Discord server](https:\u002F\u002Fdiscord.gg\u002FteDuGYVTB2).\r\n\r\nSee these instructions for how to put the latest llamafile software into your old weights, without having to redownload. https:\u002F\u002Fgithub.com\u002FMozilla-Ocho\u002Fl","2024-05-09T23:20:14"]