[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-amaiya--onprem":3,"tool-amaiya--onprem":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151314,2,"2026-04-11T23:32:58",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":95,"forks":96,"last_commit_at":97,"license":98,"difficulty_score":32,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":113,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":114,"updated_at":115,"faqs":116,"releases":147},6788,"amaiya\u002Fonprem","onprem","A toolkit for applying LLMs to sensitive, non-public data in offline or restricted environments","OnPrem 是一款专为处理敏感和非公开数据而设计的 Python 工具包，旨在让大型语言模型（LLM）能够在离线或受限环境中安全运行。它核心解决了企业在应用 AI 时面临的数据隐私顾虑，默认采用完全本地化执行模式，确保数据不出内网，同时也灵活支持接入 OpenAI、Anthropic 等云端模型以满足不同需求。\n\n这款工具非常适合开发者、研究人员以及需要构建私有化文档智能系统的技术团队使用。无论是进行信息提取、文本摘要、分类还是复杂的问答任务，OnPrem 都能提供成熟的分析流水线。其独特亮点在于对低算力环境的友好支持，通过 SparseStore 等技术实现了无需预存嵌入向量的高效检索增强生成（RAG）；此外，它还内置了可视化的工作流搭建器，让用户能通过点选界面轻松组装复杂的文档分析流程，并支持安全沙箱模式来运行 AI 智能体。凭借对多种后端引擎的广泛兼容性及与 Elasticsearch 等现有工具的无缝集成，OnPrem 成为了平衡数据主权与 AI 能力的理想选择。","# OnPrem.LLM\n\n\n\u003C!-- WARNING: THIS FILE WAS AUTOGENERATED! DO NOT EDIT! -->\n\n> A privacy-conscious toolkit for document intelligence — local by\n> default, cloud-capable\n\n**[OnPrem.LLM](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem)** (or “OnPrem” for\nshort) is a Python-based toolkit for applying large language models\n(LLMs) to sensitive, non-public data in offline or restricted\nenvironments. Inspired largely by the\n[privateGPT](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT) project,\n**OnPrem.LLM** is designed for fully local execution, but also supports\nintegration with a wide range of cloud LLM providers (e.g., OpenAI,\nAnthropic).\n\n**Key Features:**\n\n- Fully local execution with option to leverage cloud as needed. See\n  [the cheatsheet](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet).\n- Analysis pipelines for [many different\n  tasks](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#examples), including\n  information extraction, summarization, classification,\n  question-answering, and agents.\n- Support for environments with modest computational resources through\n  modules like the\n  [SparseStore](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_rag.html#advanced-example-nsf-awards)\n  (e.g., RAG without having to store embeddings in advance).\n- Easily integrate with existing tools in your local environment like\n  [Elasticsearch and\n  Sharepoint](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html).\n- A [visual workflow\n  builder](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html#visual-workflow-builder)\n  to assemble complex document analysis pipelines with a point-and-click\n  interface.\n\nThe full documentation is [here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F).\n\n\u003C!--A Google Colab demo of installing and using **OnPrem.LLM** is [here](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing).\n-->\n\n**Quick Start**\n\n``` python\n# install\n!pip install onprem[chroma]\nfrom onprem import LLM, utils\n\n# local LLM with Ollama as backend\n!ollama pull llama3.2\nllm = LLM('ollama\u002Fllama3.2')\n\n# basic prompting\nresult = llm.prompt('Give me a short one sentence definition of an LLM.')\n\n# RAG\nutils.download('https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2505.07672', '\u002Ftmp\u002Fmy_documents\u002Fpaper.pdf')\nllm.ingest('\u002Ftmp\u002Fmy_documents')\nresult = llm.ask('What is OnPrem.LLM?')\n\n# switch to cloud LLM using Anthropic as backend\nllm = LLM(\"anthropic\u002Fclaude-3-7-sonnet-latest\")\n\n# structured outputs\nfrom pydantic import BaseModel, Field\nclass MeasuredQuantity(BaseModel):\n    value: str = Field(description=\"numerical value\")\n    unit: str = Field(description=\"unit of measurement\")\nstructured_output = llm.pydantic_prompt('He was going 35 mph.', pydantic_model=MeasuredQuantity)\nprint(structured_output.value) # 35\nprint(structured_output.unit)  # mph\n\n# Safely launch a sandboxed AI agent\nfrom onprem.pipelines import AgentExecutor\nexecutor = AgentExecutor(model='openai\u002Fgpt-5-mini', sandbox=True)\nresult = executor.run(\"\"\"\nSearch this directory for all .md files and:\n1. Extract all headings (# ## ###)\n2. Count total words in each file\n3. Create an index file 'documentation_index.md' with:\n   - List of all markdown files\n   - Word count for each\n   - Main topics covered (from headings)\n\"\"\")\n```\n\nMany LLM backends are supported (e.g.,\n[llama_cpp](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python),\n[transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers),\n[Ollama](https:\u002F\u002Follama.com\u002F),\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm),\n[OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fmodels),\n[Anthropic](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fabout-claude\u002Fmodels\u002Foverview),\netc.).\n\n------------------------------------------------------------------------\n\n\u003Ccenter>\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_06fc2ca0b16a.png\" border=\"0\" alt=\"onprem.llm\" width=\"200\"\u002F>\n\u003C\u002Fp>\n\u003C\u002Fcenter>\n\u003Ccenter>\n\u003Cp align=\"center\">\n\n**[Install](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#install) \\|\n[Usage](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#how-to-use) \\| [Web\nUI](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fwebapp.html) \\|\n[Examples](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#examples) \\|\n[FAQ](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq) \\| [How to\nCite](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#how-to-cite)**\n\n\u003C\u002Fp>\n\u003C\u002Fcenter>\n\n*Latest News* 🔥\n\n- \\[2026\u002F03\\] v0.22.0 released and now includes the **AgentExecutor**:\n  safely launch AI agents in a sandboxed environment to solve problems\n  in two lines of code. See [the example notebook on\n  agents](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html).\n- \\[2026\u002F01\\] v0.21.0 released and now includes support for\n  **metadata-based query routing**. See the [query routing example\n  here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fpipelines.rag.html#example-using-query-routing-with-rag).\n  Also included in this release: [provider-implemented structured\n  outputs](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#natively-supported-structured-outputs)\n  (e.g., structured outputs with OpenAI, Anthropic, and AWS GovCloud\n  Bedrock).\n- \\[2025\u002F12\\] v0.20.0 released and now includes support for\n  **asynchronous prompts**. See [the\n  example](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples.html#asynchronous-prompts).\n- \\[2025\u002F09\\] v0.19.0 released and now includes support for\n  **workflows**: YAML-configured pipelines for complex document\n  analyses. See [the workflow\n  documentation](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html) for\n  more information.\n- \\[2025\u002F08\\] v0.18.0 released and can now be used with AWS GovCloud\n  LLMs. See [this\n  example](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.backends.html#examples)\n  for more information.\n- \\[2025\u002F07\\] v0.17.0 released and now allows you to connect directly to\n  SharePoint for search and RAG. See the [example notebook on vector\n  stores](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html#rag-with-sharepoint-documents)\n  for more information.\n\n------------------------------------------------------------------------\n\n## Install\n\nOnce you have [installed\nPyTorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F), you can install\n**OnPrem.LLM** with:\n\n``` sh\npip install onprem\n```\n\n**Chroma**: If using RAG with the default Chroma “Dense” vectorstore\n(instead of [sparse\nvectorstore](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#step-1-ingest-the-documents-into-a-vector-database)),\nrun `pip install[chroma]`.\n\n**AI Agents**: If using OnPrem.LLM to launch [AI\nagents](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html), run\n`pip install onprem[agent]`.\n\n**Llama-cpp-python is optional:**\n\nIf using llama-cpp-python as the LLM backend:\n\n- **CPU:** `pip install llama-cpp-python` ([extra\n  steps](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md)\n  required for Microsoft Windows)\n- **GPU**: Follow [instructions\n  below](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#on-gpu-accelerated-inference-with-llama-cpp-python).\n\nInstalling llama-cpp-python is *optional* if any of the following is\ntrue:\n\n- You are using [Ollama](https:\u002F\u002Follama.com\u002F) as the LLM backend.\n- You use Hugging Face Transformers (instead of llama-cpp-python) as the\n  LLM backend by supplying the `model_id` parameter when instantiating\n  an LLM, as [shown\n  here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#using-hugging-face-transformers-instead-of-llama.cpp).\n- You are using **OnPrem.LLM** with an LLM being served through an\n  [external REST API](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet)\n  (e.g., vLLM, OpenLLM).\n- You are using **OnPrem.LLM** with a [cloud\n  LLM](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet) (see cheat sheet\n  below).\n\n### On GPU-Accelerated Inference With `llama-cpp-python`\n\nWhen installing **llama-cpp-python** with\n`pip install llama-cpp-python`, the LLM will run on your **CPU**. To\ngenerate answers much faster, you can run the LLM on your **GPU** by\nbuilding **llama-cpp-python** based on your operating system.\n\n- **Linux**:\n  `CMAKE_ARGS=\"-DGGML_CUDA=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`\n- **Mac**: `CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python`\n- **Windows 11**: Follow the instructions\n  [here](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md#using-the-system-python-in-windows-11s).\n- **Windows Subsystem for Linux (WSL2)**: Follow the instructions\n  [here](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md#using-wsl2-with-gpu-acceleration).\n\nFor Linux and Windows, you will need [an up-to-date NVIDIA\ndriver](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdrivers\u002F) along with the [CUDA\ntoolkit](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads) installed before\nrunning the installation commands above.\n\nAfter following the instructions above, supply the `n_gpu_layers=-1`\nparameter when instantiating an LLM to use your GPU for fast inference:\n\n``` python\nllm = LLM(n_gpu_layers=-1, ...)\n```\n\nQuantized models with 8B parameters and below can typically run on GPUs\nwith as little as 6GB of VRAM. If a model does not fit on your GPU\n(e.g., you get a “CUDA Error: Out-of-Memory” error), you can offload a\nsubset of layers to the GPU by experimenting with different values for\nthe `n_gpu_layers` parameter (e.g., `n_gpu_layers=20`). Setting\n`n_gpu_layers=-1`, as shown above, offloads all layers to the GPU.\n\nSee [the FAQ](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq) for extra tips, if\nyou experience issues with\n[llama-cpp-python](https:\u002F\u002Fpypi.org\u002Fproject\u002Fllama-cpp-python\u002F)\ninstallation.\n\n## How to Use\n\n### Setup\n\n``` python\nfrom onprem import LLM\n\nllm = LLM(verbose=False) # default model and backend are used\n```\n\n#### Cheat Sheet\n\n*Local Models:* A number of different local LLM backends are supported.\n\n- **Llama-cpp**: `llm = LLM(default_model=\"llama\", n_gpu_layers=-1)`\n\n- **Llama-cpp with selected GGUF model via URL**:\n\n  ``` python\n   # prompt templates are required for user-supplied GGUF models (see FAQ)\n   llm = LLM(model_url='https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF\u002Fresolve\u002Fmain\u002Fzephyr-7b-beta.Q4_K_M.gguf', \n             prompt_template= \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\", n_gpu_layers=-1)\n  ```\n\n- **Llama-cpp with selected GGUF model via file path**:\n\n  ``` python\n   # prompt templates are required for user-supplied GGUF models (see FAQ)\n   llm = LLM(model_url='zephyr-7b-beta.Q4_K_M.gguf', \n             model_download_path='\u002Fpath\u002Fto\u002Ffolder\u002Fto\u002Fwhere\u002Fyou\u002Fdownloaded\u002Fmodel',\n             prompt_template= \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\", n_gpu_layers=-1)\n  ```\n\n- **Hugging Face Transformers**:\n  `llm = LLM(model_id='Qwen\u002FQwen2.5-0.5B-Instruct', device='cuda')`\n\n- **Ollama**: `llm = LLM(model_url=\"ollama:\u002F\u002Fllama3.2\", api_key='na')`\n\n- **Also Ollama**:\n  `llm = LLM(model_url=\"ollama\u002Fllama3.2\", api_key='na')`\n\n- **Also Ollama**:\n  `llm = LLM(model_url='http:\u002F\u002Flocalhost:11434\u002Fv1', api_key='na', model='llama3.2')`\n\n- **vLLM**:\n  `llm = LLM(model_url='http:\u002F\u002Flocalhost:8666\u002Fv1', api_key='na', model='Qwen\u002FQwen2.5-0.5B-Instruct')`\n\n- **Also vLLM**:\n  `llm = LLM('hosted_vllm\u002Fserved-model-name', api_base=\"http:\u002F\u002Flocalhost:8666\u002Fv1\", api_key=\"test123\")`\n  (assumes `served-model-name` parameter is supplied to\n  `vllm.entrypoints.openai.api_server`).\n\n- **vLLM with gpt-oss** (assumes `served-model-name` parameter is\n  supplied to vLLM):\n\n  ``` python\n  # important: set max_tokens to high value due to intermediate reasoning steps that are generated\n  llm = LLM(model_url='http:\u002F\u002Flocalhost:8666\u002Fv1', api_key='your_api_key', model=served_model_name, max_tokens=32000)\n  result = llm.prompt(prompt, reasoning_effort=\"high\")\n  ```\n\n*Cloud Models:* In addition to local LLMs, all cloud LLM providers\nsupported by [LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) are\ncompatible:\n\n- **Anthropic Claude**:\n  `llm = LLM(model_url=\"anthropic\u002Fclaude-3-7-sonnet-latest\")`\n\n- **OpenAI GPT-4o**: `llm = LLM(model_url=\"openai\u002Fgpt-4o\")`\n\n- **AWS GovCloud Bedrock** (assumes AWS_ACCESS_KEY_ID and\n  AWS_SECRET_ACCESS_KEY are set as environment variables)\n\n  ``` python\n  from onprem import LLM\n  inference_arn = \"YOUR INFERENCE ARN\"\n  endpoint_url = \"YOUR ENDPOINT URL\"\n  region_name = \"us-gov-east-1\" # replace as necessary\n  # set up LLM connection to Bedrock on AWS GovCloud\n  llm = LLM( f\"govcloud-bedrock:\u002F\u002F{inference_arn}\", region_name=region_name, endpoint_url=endpoint_url)\n  response = llm.prompt(\"Write a haiku about the moon.\")\n  ```\n\nThe instantiations above are described in more detail below.\n\n#### GGUF Models and Llama.cpp\n\nThe default LLM backend is\n[llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python), and the\ndefault model is currently a 7B-parameter model called\n**Zephyr-7B-beta**, which is automatically downloaded and used.\nLlama.cpp run models in [GGUF](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fgguf)\nformat. The two other default models are `llama` and `mistral`. For\ninstance, if `default_model='llama'` is supplied, then a\n**Llama-3.1-8B-Instsruct** model is automatically downloaded and used:\n\n``` python\n# Llama 3.1 is downloaded here and the correct prompt template for Llama-3.1 is automatically configured and used\nllm = LLM(default_model='llama')\n```\n\n*Choosing Your Own Models:* Of course, you can also easily supply the\nURL or path to an LLM of your choosing to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) (see the\n[FAQ](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq) for an example).\n\n*Supplying Extra Parameters:* Any extra parameters supplied to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) are forwarded\ndirectly to\n[llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python), the\ndefault LLM backend.\n\n#### Changing the Default LLM Backend\n\nIf `default_engine=\"transformers\"` is supplied to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm), Hugging Face\n[transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) is used as\nthe LLM backend. Extra parameters to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) (e.g.,\n‘device=’cuda’`) are forwarded diretly to`transformers.pipeline`. If supplying a`model_id\\`\nparameter, the default LLM backend is automatically changed to Hugging\nFace [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers).\n\n``` python\n# LLama-3.1 model quantized using AWQ is downloaded and run with Hugging Face transformers (requires GPU)\nllm = LLM(default_model='llama', default_engine='transformers')\n\n# Using a custom model with Hugging Face Transformers\nllm = LLM(model_id='Qwen\u002FQwen2.5-0.5B-Instruct', device_map='cpu')\n```\n\nSee\n[here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#using-hugging-face-transformers-instead-of-llama.cpp)\nfor more information about using Hugging Face\n[transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) as the LLM\nbackend.\n\nYou can also connect to **Ollama**, local LLM APIs (e.g., vLLM), and\ncloud LLMs.\n\n``` python\n# connecting to an LLM served by Ollama\nlm = LLM(model_url='ollama\u002Fllama3.2')\n\n# connecting to an LLM served through vLLM (set API key as needed)\nllm = LLM(model_url='http:\u002F\u002Flocalhost:8000\u002Fv1', api_key='token-abc123', model='Qwen\u002FQwen2.5-0.5B-Instruct')`\n\n# connecting to a cloud-backed LLM (e.g., OpenAI, Anthropic).\nllm = LLM(model_url=\"openai\u002Fgpt-4o-mini\")  # OpenAI\nllm = LLM(model_url=\"anthropic\u002Fclaude-3-7-sonnet-20250219\") # Anthropic\n```\n\n**OnPrem.LLM** suppports any provider and model supported by the\n[LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) package.\n\nSee\n[here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#connecting-to-llms-served-through-rest-apis)\nfor more information on *local* LLM APIs.\n\nMore information on using OpenAI models specifically with **OnPrem.LLM**\nis [here](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_openai.html).\n\n#### Supplying Parameters to the LLM Backend\n\nExtra parameters supplied to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) and\n[`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt)\nare passed directly to the LLM backend. Parameter names will vary\ndepending on the backend you chose.\n\nFor instance, with the default llama-cpp backend, the default context\nwindow size (`n_ctx`) is set to 3900 and the default output size\n(`max_tokens`) is set 512. Both are configurable parameters to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm). Increase if\nyou have larger prompts or need longer outputs. Other parameters (e.g.,\n`api_key`, `device_map`, etc.) can be supplied directly to\n[`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) and will be\nrouted to the LLM backend or API (e.g., llama-cpp-python, Hugging Face\ntransformers, vLLM, OpenAI, etc.). The `max_tokens` parameter can also\nbe adjusted on-the-fly by supplying it to\n[`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt).\n\nOn the other hand, for Ollama models, context window and output size are\ncontrolled by `num_ctx` and `num_predict`, respectively.\n\nWith the Hugging Face transformers, setting the context window size is\nnot needed, but the output size is controlled by the `max_new_tokens`\nparameter to\n[`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt).\n\n#### Using Hugging Face Transformers Instead of Llama.cpp\n\nBy default, the LLM backend employed by **OnPrem.LLM** is\n[llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python), which\nrequires models in [GGUF format](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fgguf).\nAs of v0.5.0, it is now possible to use [Hugging Face\ntransformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) as the LLM\nbackend instead. This is accomplished by using the `model_id` parameter\n(instead of supplying a `model_url` argument). In the example below, we\nrun the\n[Llama-3.1-8B](https:\u002F\u002Fhuggingface.co\u002Fhugging-quants\u002FMeta-Llama-3.1-8B-Instruct-AWQ-INT4)\nmodel.\n\n``` python\n# llama-cpp-python does NOT need to be installed when using model_id parameter\nllm = LLM(model_id=\"hugging-quants\u002FMeta-Llama-3.1-8B-Instruct-AWQ-INT4\", device_map='cuda')\n```\n\nThis allows you to more easily use any model on the Hugging Face hub in\n[SafeTensors format](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fsafetensors\u002Findex)\nprovided it can be loaded with the Hugging Face `transformers.pipeline`.\nNote that, when using the `model_id` parameter, the `prompt_template` is\nset automatically by `transformers`.\n\nThe Llama-3.1 model loaded above was quantized using\n[AWQ](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fquantization\u002Fawq),\nwhich allows the model to fit onto smaller GPUs (e.g., laptop GPUs with\n6GB of VRAM) similar to the default GGUF format. AWQ models will require\nthe [autoawq](https:\u002F\u002Fpypi.org\u002Fproject\u002Fautoawq\u002F) package to be\ninstalled: `pip install autoawq` (AWQ only supports Linux system,\nincluding Windows Subsystem for Linux). If you do need to load a model\nthat is not quantized, you can supply a quantization configuration at\nload time (known as “inflight quantization”). In the following example,\nwe load an unquantized [Zephyr-7B-beta\nmodel](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceH4\u002Fzephyr-7b-beta) that will be\nquantized during loading to fit on GPUs with as little as 6GB of VRAM:\n\n``` python\nfrom transformers import BitsAndBytesConfig\nquantization_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type=\"nf4\",\n    bnb_4bit_compute_dtype=\"float16\",\n    bnb_4bit_use_double_quant=True,\n)\nllm = LLM(model_id=\"HuggingFaceH4\u002Fzephyr-7b-beta\", device_map='cuda', \n          model_kwargs={\"quantization_config\":quantization_config})\n```\n\nWhen supplying a `quantization_config`, the\n[bitsandbytes](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fbitsandbytes\u002Fmain\u002Fen\u002Finstallation)\nlibrary, a lightweight Python wrapper around CUDA custom functions, in\nparticular 8-bit optimizers, matrix multiplication (LLM.int8()), and 8 &\n4-bit quantization functions, is used. There are ongoing efforts by the\nbitsandbytes team to support multiple backends in addition to CUDA. If\nyou receive errors related to bitsandbytes, please refer to the\n[bitsandbytes\ndocumentation](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fbitsandbytes\u002Fmain\u002Fen\u002Finstallation).\n\n## Built-In Web App\n\n**OnPrem.LLM** includes a built-in Web app to access the LLM. To start\nit, run the following command after installation:\n\n``` shell\nonprem --port 8000\n```\n\nThen, enter `localhost:8000` (or `\u003Cdomain_name>:8000` if running on\nremote server) in a Web browser to access the application:\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_4ecfc2777e0f.png\" border=\"1\" alt=\"screenshot\" width=\"775\"\u002F>\n\nFor more information, [see the corresponding\ndocumentation](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fwebapp.html).\n\n## Examples\n\nThe [documentation](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F) includes many\nexamples.\n\n### 💡 Getting Started\n\n| Documentation Link                                                  | Example                      |\n|---------------------------------------------------------------------|------------------------------|\n| [Prompting Examples](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples.html) | Problem-Solving With Prompts |\n\n### 📚 Document Processing\n\n| Documentation Link                                                                             | Example                                           |\n|------------------------------------------------------------------------------------------------|---------------------------------------------------|\n| [Text Extraction](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_text_extraction.html)               | Document Text Extraction (PDFs, Word, PowerPoint) |\n| [Document Summarization](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_summarization.html)          | Document Summarization                            |\n| [Information Extraction](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_information_extraction.html) | Information Extraction from Documents             |\n\n### 🧠 Question-Answering & Search\n\n| Documentation Link                                                                          | Example                                     |\n|---------------------------------------------------------------------------------------------|---------------------------------------------|\n| [RAG Example](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_rag.html)                            | Question-Answering with RAG                 |\n| [Vector Stores Tutorial](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html) | Using Different Vector Stores               |\n| [Semantic Similarity](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_semantic.html)               | Computing Semantic Similarity Between Texts |\n\n### 🎯 Classification & Analysis\n\n| Documentation Link                                                                           | Example                                  |\n|----------------------------------------------------------------------------------------------|------------------------------------------|\n| [Text Classification](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_classification.html)          | Few-Shot Text Classification             |\n| [Survey Analysis](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_qualitative_survey_analysis.html) | Auto-Coding Qualitative Survey Responses |\n| [Legal Analysis](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_legal_analysis.html)               | Legal and Regulatory Document Analysis   |\n\n### 🛠️ Advanced Features\n\n| Documentation Link                                                                 | Example                                            |\n|------------------------------------------------------------------------------------|----------------------------------------------------|\n| [Agent Examples](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html)              | Agent-Based Task Execution with Tools              |\n| [Structured Outputs](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_guided_prompts.html) | Structured and Guided Outputs with Pydantic Models |\n| [Workflow Builder](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html)                 | Workflow Builder for Document Analysis             |\n\n## FAQ\n\n1.  **How do I use other models with OnPrem.LLM?**\n\n    > You can supply any model of your choice using the `model_url` and\n    > `model_id` parameters to `LLM` (see cheat sheet above).\n\n    > Here, we will go into detail on how to supply a custom GGUF model\n    > using the llma.cpp backend.\n\n    > You can find llama.cpp-supported models with `GGUF` in the file\n    > name on\n    > [huggingface.co](https:\u002F\u002Fhuggingface.co\u002Fmodels?sort=trending&search=gguf).\n\n    > Make sure you are pointing to the URL of the actual GGUF model\n    > file, which is the “download” link on the model’s page. An example\n    > for **Mistral-7B** is shown below:\n\n    > \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_5da013180683.png\" border=\"1\" alt=\"screenshot\" width=\"775\"\u002F>\n\n    > When using the llama.cpp backend, GGUF models have specific prompt\n    > formats that need to supplied to `LLM`. For instance, the prompt\n    > template required for **Zephyr-7B**, as described on the [model’s\n    > page](https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF), is:\n    >\n    > `\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>`\n    >\n    > So, to use the **Zephyr-7B** model, you must supply the\n    > `prompt_template` argument to the `LLM` constructor (or specify it\n    > in the `webapp.yml` configuration for the Web app).\n    >\n    > ``` python\n    > # how to use Zephyr-7B with OnPrem.LLM\n    > llm = LLM(model_url='https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF\u002Fresolve\u002Fmain\u002Fzephyr-7b-beta.Q4_K_M.gguf',\n    >           prompt_template = \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\",\n    >           n_gpu_layers=33)\n    > llm.prompt(\"List three cute names for a cat.\")\n    > ```\n\n    > Prompt templates are **not** required for any other LLM backend\n    > (e.g., when using Ollama as backend or when using `model_id`\n    > parameter for transformers models). Prompt templates are also not\n    > required if using any of the default models.\n\n2.  **When installing `onprem`, I’m getting “build” errors related to\n    `llama-cpp-python` (or `chroma-hnswlib`) on Windows\u002FMac\u002FLinux?**\n\n    > See [this LangChain documentation on\n    > LLama.cpp](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fllms\u002Fllamacpp)\n    > for help on installing the `llama-cpp-python` package for your\n    > system. Additional tips for different operating systems are shown\n    > below:\n\n    > For **Linux** systems like Ubuntu, try this:\n    > `sudo apt-get install build-essential g++ clang`. Other tips are\n    > [here](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui\u002Fissues\u002F1534).\n\n    > For **Windows** systems, please try following [these\n    > instructions](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md).\n    > We recommend you use [Windows Subsystem for Linux\n    > (WSL)](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall)\n    > instead of using Microsoft Windows directly. If you do need to use\n    > Microsoft Window directly, be sure to install the [Microsoft C++\n    > Build\n    > Tools](https:\u002F\u002Fvisualstudio.microsoft.com\u002Fvisual-cpp-build-tools\u002F)\n    > and make sure the **Desktop development with C++** is selected.\n\n    > For **Macs**, try following [these\n    > tips](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT\u002Fissues\u002F445#issuecomment-1563333950).\n\n    > There are also various other tips for each of the above OSes in\n    > [this privateGPT repo\n    > thread](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT\u002Fissues\u002F445). Of\n    > course, you can also [easily\n    > use](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)\n    > **OnPrem.LLM** on Google Colab.\n\n    > Finally, if you still can’t overcome issues with building\n    > `llama-cpp-python`, you can try [installing the pre-built wheel\n    > file](https:\u002F\u002Fabetlen.github.io\u002Fllama-cpp-python\u002Fwhl\u002Fcpu\u002Fllama-cpp-python\u002F)\n    > for your system:\n\n    > **Example:**\n    > `pip install llama-cpp-python==0.2.90 --extra-index-url https:\u002F\u002Fabetlen.github.io\u002Fllama-cpp-python\u002Fwhl\u002Fcpu`\n    >\n    > **Tip:** There are [pre-built wheel files for\n    > `chroma-hnswlib`](https:\u002F\u002Fpypi.org\u002Fproject\u002Fchroma-hnswlib\u002F#files),\n    > as well. If running `pip install onprem` fails on building\n    > `chroma-hnswlib`, it may be because a pre-built wheel doesn’t yet\n    > exist for the version of Python you’re using (in which case you\n    > can try downgrading Python).\n\n3.  **I’m behind a corporate firewall and am receiving an SSL error when\n    trying to download the model?**\n\n    > Try this:\n    >\n    > ``` python\n    > from onprem import LLM\n    > LLM.download_model(url, ssl_verify=False)\n    > ```\n\n    > You can download the embedding model (used by `LLM.ingest` and\n    > `LLM.ask`) as follows:\n    >\n    > ``` sh\n    > wget --no-check-certificate https:\u002F\u002Fpublic.ukp.informatik.tu-darmstadt.de\u002Freimers\u002Fsentence-transformers\u002Fv0.2\u002Fall-MiniLM-L6-v2.zip\n    > ```\n\n    > Supply the unzipped folder name as the `embedding_model_name`\n    > argument to `LLM`.\n\n    > If you’re getting SSL errors when even running `pip install`, try\n    > this:\n    >\n    > ``` sh\n    > pip install –-trusted-host pypi.org –-trusted-host files.pythonhosted.org pip_system_certs\n    > ```\n\n4.  **How do I use this on a machine with no internet access?**\n\n    > Use the `LLM.download_model` method to download the model files to\n    > `\u003Cyour_home_directory>\u002Fonprem_data` and transfer them to the same\n    > location on the air-gapped machine.\n\n    > For the `ingest` and `ask` methods, you will need to also download\n    > and transfer the embedding model files:\n    >\n    > ``` python\n    > from sentence_transformers import SentenceTransformer\n    > model = SentenceTransformer('sentence-transformers\u002Fall-MiniLM-L6-v2')\n    > model.save('\u002Fsome\u002Ffolder')\n    > ```\n\n    > Copy the `some\u002Ffolder` folder to the air-gapped machine and supply\n    > the path to `LLM` via the `embedding_model_name` parameter.\n\n5.  **My model is not loading when I call `llm = LLM(...)`?**\n\n    > This can happen if the model file is corrupt (in which case you\n    > should delete from `\u003Chome directory>\u002Fonprem_data` and\n    > re-download). It can also happen if the version of\n    > `llama-cpp-python` needs to be upgraded to the latest.\n\n6.  **I’m getting an `“Illegal instruction (core dumped)` error when\n    instantiating a `langchain.llms.Llamacpp` or `onprem.LLM` object?**\n\n    > Your CPU may not support instructions that `cmake` is using for\n    > one reason or another (e.g., [due to Hyper-V in VirtualBox\n    > settings](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F65780506\u002Fhow-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit)).\n    > You can try turning them off when building and installing\n    > `llama-cpp-python`:\n\n    > ``` sh\n    > # example\n    > CMAKE_ARGS=\"-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF\" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir\n    > ```\n\n7.  **How can I speed up\n    [`LLM.ingest`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.ingest)?**\n\n    > By default, a GPU, if available, will be used to compute\n    > embeddings, so ensure PyTorch is installed with GPU support. You\n    > can explicitly control the device used for computing embeddings\n    > with the `embedding_model_kwargs` argument.\n    >\n    > ``` python\n    > from onprem import LLM\n    > llm  = LLM(embedding_model_kwargs={'device':'cuda'})\n    > ```\n\n    > You can also supply `store_type=\"sparse\"` to `LLM` to use a sparse\n    > vector store, which sacrifices a small amount of inference speed\n    > (`LLM.ask`) for significant speed ups during ingestion\n    > (`LLM.ingest`).\n    >\n    > ``` python\n    > from onprem import LLM\n    > llm  = LLM(store_type=\"sparse\")\n    > ```\n    >\n    > Note, however, that, unlike dense vector stores, sparse vector\n    > stores assume answer sources will contain at least one word in\n    > common with the question.\n\n\u003C!--\n8. **What are ways in which OnPrem.LLM has been used?**\n    > Examples include:\n    > - extracting key performance parameters and other performance attributes from engineering documents\n    > - auto-coding responses to government requests for information (RFIs)\n    > - analyzing the Federal Aquisition Regulations (FAR)\n    > - understanding where and how Executive Order 14028 on cybersecurity aligns with the National Cybersecurity Strategy\n    > - generating a summary of ways to improve a course from thousdands of reviews\n    > - extracting specific information of interest from resumes for talent acquisition.\n&#10;-->\n\n## How to Cite\n\nPlease cite the [following paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21040) when\nusing **OnPrem.LLM**:\n\n    @article{maiya2025generativeaiffrdcs,\n          title={Generative AI for FFRDCs}, \n          author={Arun S. Maiya},\n          year={2025},\n          eprint={2509.21040},\n          archivePrefix={arXiv},\n          primaryClass={cs.CL},\n          url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21040}, \n    }\n","# 本地部署.LLM\n\n\n\u003C!-- 警告：此文件由自动化工具生成！请勿编辑！ -->\n\n> 一个注重隐私的文档智能工具包——默认本地运行，同时支持云端\n\n**[OnPrem.LLM](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem)**（简称“OnPrem”）是一个基于Python的工具包，用于在离线或受限环境中将大型语言模型（LLMs）应用于敏感的非公开数据。该工具主要受到[privateGPT](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT)项目的启发，设计为完全本地执行，但也支持与多种云端LLM提供商（如OpenAI、Anthropic）集成。\n\n**主要特性：**\n\n- 完全本地执行，并可根据需要选择使用云端服务。详情请参阅[速查表](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet)。\n- 针对[多种任务](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#examples)的分析流水线，包括信息抽取、摘要生成、分类、问答以及智能代理等。\n- 通过诸如[SparseStore](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_rag.html#advanced-example-nsf-awards)等模块，支持计算资源有限的环境（例如，无需预先存储嵌入即可实现RAG）。\n- 可轻松与本地环境中的现有工具集成，如[Elasticsearch和SharePoint](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html)。\n- 提供一个[可视化工作流构建器](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html#visual-workflow-builder)，可通过点选式界面组装复杂的文档分析流水线。\n\n完整文档请见[这里](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F)。\n\n\u003C!-- **OnPrem.LLM** 的 Google Colab 安装与使用演示可在[此处](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing)查看。-->\n\n**快速入门**\n\n``` python\n# 安装\n!pip install onprem[chroma]\nfrom onprem import LLM, utils\n\n# 使用Ollama作为后端的本地LLM\n!ollama pull llama3.2\nllm = LLM('ollama\u002Fllama3.2')\n\n# 基本提示\nresult = llm.prompt('给我一个关于LLM的简短一句话定义。')\n\n# RAG\nutils.download('https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2505.07672', '\u002Ftmp\u002Fmy_documents\u002Fpaper.pdf')\nllm.ingest('\u002Ftmp\u002Fmy_documents')\nresult = llm.ask('OnPrem.LLM是什么？')\n\n# 切换到以Anthropic为后端的云端LLM\nllm = LLM(\"anthropic\u002Fclaude-3-7-sonnet-latest\")\n\n# 结构化输出\nfrom pydantic import BaseModel, Field\nclass MeasuredQuantity(BaseModel):\n    value: str = Field(description=\"数值\")\n    unit: str = Field(description=\"计量单位\")\nstructured_output = llm.pydantic_prompt('他当时的速度是35英里每小时。', pydantic_model=MeasuredQuantity)\nprint(structured_output.value) # 35\nprint(structured_output.unit)  # 英里\u002F小时\n\n# 安全启动沙盒化的AI智能代理\nfrom onprem.pipelines import AgentExecutor\nexecutor = AgentExecutor(model='openai\u002Fgpt-5-mini', sandbox=True)\nresult = executor.run(\"\"\"\n搜索此目录下的所有.md文件，并：\n1. 提取所有标题（# ## ###）\n2. 统计每个文件的总字数\n3. 创建一个索引文件'documentation_index.md'，内容包括：\n   - 所有Markdown文件的列表\n   - 每个文件的字数统计\n   - 主要讨论的主题（从标题中提取）\n\"\"\")\n```\n\n支持的LLM后端众多，例如：\n[llama_cpp](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)、\n[transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)、\n[Ollama](https:\u002F\u002Follama.com\u002F)、\n[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)、\n[OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fmodels)、\n[Anthropic](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fabout-claude\u002Fmodels\u002Foverview) 等。\n\n------------------------------------------------------------------------\n\n\u003Ccenter>\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_06fc2ca0b16a.png\" border=\"0\" alt=\"onprem.llm\" width=\"200\"\u002F>\n\u003C\u002Fp>\n\u003C\u002Fcenter>\n\u003Ccenter>\n\u003Cp align=\"center\">\n\n**[安装](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#install) \\| [使用方法](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#how-to-use) \\| [Web UI](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fwebapp.html) \\| [示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#examples) \\| [常见问题](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq) \\| [如何引用](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#how-to-cite)**\n\n\u003C\u002Fp>\n\u003C\u002Fcenter>\n\n*最新消息* 🔥\n\n- \\[2026年3月\\] 发布v0.22.0版本，新增**AgentExecutor**：只需两行代码即可在沙盒环境中安全启动AI智能代理来解决问题。详情请参阅[关于智能代理的示例笔记本](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html)。\n- \\[2026年1月\\] 发布v0.21.0版本，新增支持**基于元数据的查询路由**。详情请参阅[此处的查询路由示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fpipelines.rag.html#example-using-query-routing-with-rag)。此外，本次发布还包括[提供商原生支持的结构化输出](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#natively-supported-structured-outputs)（例如，OpenAI、Anthropic和AWS GovCloud Bedrock提供的结构化输出）。\n- \\[2025年12月\\] 发布v0.20.0版本，新增支持**异步提示**。详情请参阅[此处的示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples.html#asynchronous-prompts)。\n- \\[2025年9月\\] 发布v0.19.0版本，新增支持**工作流**：用于复杂文档分析的YAML配置流水线。更多信息请参阅[工作流文档](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html)。\n- \\[2025年8月\\] 发布v0.18.0版本，现已可与AWS GovCloud的LLM一起使用。更多信息请参阅[此处的示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.backends.html#examples)。\n- \\[2025年7月\\] 发布v0.17.0版本，现在可以直接连接到SharePoint进行搜索和RAG操作。更多信息请参阅[关于向量存储的示例笔记本](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html#rag-with-sharepoint-documents)。\n\n------------------------------------------------------------------------\n\n## 安装\n\n在您已经[安装 PyTorch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) 之后，您可以使用以下命令安装 **OnPrem.LLM**：\n\n``` sh\npip install onprem\n```\n\n**Chroma**：如果您使用默认的 Chroma “Dense” 向量存储（而不是 [稀疏向量存储](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#step-1-ingest-the-documents-into-a-vector-database)）进行 RAG 操作，请运行 `pip install[chroma]`。\n\n**AI 代理**：如果您使用 OnPrem.LLM 来启动 [AI 代理](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html)，请运行 `pip install onprem[agent]`。\n\n**Llama-cpp-python 是可选的：**\n\n如果您将 llama-cpp-python 用作 LLM 后端：\n\n- **CPU**：`pip install llama-cpp-python`（对于 Microsoft Windows，需要[额外步骤](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md)）\n- **GPU**：请按照[下方说明](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#on-gpu-accelerated-inference-with-llama-cpp-python)操作。\n\n如果满足以下任一条件，则安装 llama-cpp-python 是 *可选的*：\n\n- 您正在使用 [Ollama](https:\u002F\u002Follama.com\u002F) 作为 LLM 后端。\n- 您通过提供 `model_id` 参数来实例化 LLM，从而使用 Hugging Face Transformers（而非 llama-cpp-python）作为 LLM 后端，如[此处所示](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#using-hugging-face-transformers-instead-of-llama.cpp)。\n- 您正在使用 **OnPrem.LLM** 并通过 [外部 REST API](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet) 提供的 LLM（例如 vLLM、OpenLLM）。\n- 您正在使用 **OnPrem.LLM** 与 [云端 LLM](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#cheat-sheet) 配合使用（请参阅下方的备忘单）。\n\n### 使用 `llama-cpp-python` 进行 GPU 加速推理\n\n当您使用 `pip install llama-cpp-python` 安装 **llama-cpp-python** 时，LLM 将在您的 **CPU** 上运行。为了更快地生成答案，您可以根据您的操作系统构建 **llama-cpp-python**，使其在您的 **GPU** 上运行。\n\n- **Linux**：\n  `CMAKE_ARGS=\"-DGGML_CUDA=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir`\n- **Mac**：`CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python`\n- **Windows 11**：请遵循[此处](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md#using-the-system-python-in-windows-11s)的说明。\n- **Windows Subsystem for Linux (WSL2)**：请遵循[此处](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md#using-wsl2-with-gpu-acceleration)的说明。\n\n对于 Linux 和 Windows 系统，在运行上述安装命令之前，您需要先安装[最新的 NVIDIA 驱动程序](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fdrivers\u002F)以及 [CUDA 工具包](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-downloads)。\n\n按照上述说明操作后，在实例化 LLM 时，请提供 `n_gpu_layers=-1` 参数，以利用您的 GPU 进行快速推理：\n\n``` python\nllm = LLM(n_gpu_layers=-1, ...)\n```\n\n通常，参数量为 80 亿及以下的量化模型可以在显存仅为 6GB 的 GPU 上运行。如果模型无法完全加载到您的 GPU 中（例如出现“CUDA Error: Out-of-Memory”错误），您可以尝试调整 `n_gpu_layers` 参数的值（例如 `n_gpu_layers=20`），将部分层卸载到 CPU 上。如上所示设置 `n_gpu_layers=-1` 则会将所有层卸载到 GPU 上。\n\n如果您在安装 [llama-cpp-python](https:\u002F\u002Fpypi.org\u002Fproject\u002Fllama-cpp-python\u002F) 时遇到问题，可以参阅[常见问题解答](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq)获取更多提示。\n\n## 使用方法\n\n### 设置\n\n``` python\nfrom onprem import LLM\n\nllm = LLM(verbose=False) # 默认使用默认模型和后端\n```\n\n#### 备忘单\n\n*本地模型*：支持多种不同的本地 LLM 后端。\n\n- **Llama-cpp**：`llm = LLM(default_model=\"llama\", n_gpu_layers=-1)`\n- **通过 URL 使用选定的 GGUF 模型**：\n\n  ``` python\n   # 对于用户提供的 GGUF 模型，需要提示模板（详见 FAQ）\n   llm = LLM(model_url='https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF\u002Fresolve\u002Fmain\u002Fzephyr-7b-beta.Q4_K_M.gguf', \n             prompt_template= \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\", n_gpu_layers=-1)\n  ```\n\n- **通过文件路径使用选定的 GGUF 模型**：\n\n  ``` python\n   # 对于用户提供的 GGUF 模型，需要提示模板（详见 FAQ）\n   llm = LLM(model_url='zephyr-7b-beta.Q4_K_M.gguf', \n             model_download_path='\u002Fpath\u002Fto\u002Ffolder\u002Fto\u002Fwhere\u002Fyou\u002Fdownloaded\u002Fmodel',\n             prompt_template= \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\", n_gpu_layers=-1)\n  ```\n\n- **Hugging Face Transformers**：\n  `llm = LLM(model_id='Qwen\u002FQwen2.5-0.5B-Instruct', device='cuda')`\n\n- **Ollama**：`llm = LLM(model_url=\"ollama:\u002F\u002Fllama3.2\", api_key='na')`\n\n- **同样使用 Ollama**：\n  `llm = LLM(model_url=\"ollama\u002Fllama3.2\", api_key='na')`\n\n- **再使用 Ollama**：\n  `llm = LLM(model_url='http:\u002F\u002Flocalhost:11434\u002Fv1', api_key='na', model='llama3.2')`\n\n- **vLLM**：\n  `llm = LLM(model_url='http:\u002F\u002Flocalhost:8666\u002Fv1', api_key='na', model='Qwen\u002FQwen2.5-0.5B-Instruct')`\n\n- **同样使用 vLLM**：\n  `llm = LLM('hosted_vllm\u002Fserved-model-name', api_base=\"http:\u002F\u002Flocalhost:8666\u002Fv1\", api_key=\"test123\")`\n  （假设已向 `vllm.entrypoints.openai.api_server` 提供了 `served-model-name` 参数）。\n\n- **使用 gpt-oss 的 vLLM**（假设已向 vLLM 提供了 `served-model-name` 参数）：\n\n  ``` python\n  # 重要提示：由于会生成中间推理步骤，需将 max_tokens 设置为较高值\n  llm = LLM(model_url='http:\u002F\u002Flocalhost:8666\u002Fv1', api_key='your_api_key', model=served_model_name, max_tokens=32000)\n  result = llm.prompt(prompt, reasoning_effort=\"high\")\n  ```\n\n*云模型*：除了本地 LLM 外，所有由 [LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) 支持的云 LLM 提供商也兼容：\n\n- **Anthropic Claude**：\n  `llm = LLM(model_url=\"anthropic\u002Fclaude-3-7-sonnet-latest\")`\n\n- **OpenAI GPT-4o**：`llm = LLM(model_url=\"openai\u002Fgpt-4o\")`\n\n- **AWS GovCloud Bedrock**（假设已将 AWS_ACCESS_KEY_ID 和 AWS_SECRET_ACCESS_KEY 设置为环境变量）\n\n  ``` python\n  from onprem import LLM\n  inference_arn = \"YOUR INFERENCE ARN\"\n  endpoint_url = \"YOUR ENDPOINT URL\"\n  region_name = \"us-gov-east-1\" # 根据需要替换\n  # 设置 LLM 与 AWS GovCloud 上的 Bedrock 的连接\n  llm = LLM( f\"govcloud-bedrock:\u002F\u002F{inference_arn}\", region_name=region_name, endpoint_url=endpoint_url)\n  response = llm.prompt(\"写一首关于月亮的俳句。\")\n  ```\n\n以上实例将在下文中更详细地介绍。\n\n#### GGUF 模型与 Llama.cpp\n\n默认的 LLM 后端是 [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)，而默认模型目前是一个名为 **Zephyr-7B-beta** 的 70 亿参数模型，该模型会自动下载并使用。Llama.cpp 运行的模型采用 [GGUF](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fgguf) 格式。另外两个默认模型是 `llama` 和 `mistral`。例如，如果提供了 `default_model='llama'`，那么系统会自动下载并使用一个 **Llama-3.1-8B-Instsruct** 模型：\n\n``` python\n\n# Llama 3.1 在此处下载，并自动配置和使用适用于 Llama-3.1 的正确提示模板\nllm = LLM(default_model='llama')\n```\n\n*选择您自己的模型：* 当然，您也可以轻松地为 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 提供您所选 LLM 的 URL 或路径（有关示例，请参阅 [常见问题解答](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#faq)）。\n\n*提供额外参数：* 任何传递给 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 的额外参数都会直接转发给默认的 LLM 后端——[llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)。\n\n#### 更改默认 LLM 后端\n\n如果向 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 传递 `default_engine=\"transformers\"`，则会使用 Hugging Face 的 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 作为 LLM 后端。传递给 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 的额外参数（例如 ‘device=’cuda’`) 将直接转发给 `transformers.pipeline`。如果提供了 `model_id` 参数，则默认 LLM 后端会自动切换为 Hugging Face 的 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)。\n\n``` python\n# 使用 AWQ 量化的 LLama-3.1 模型被下载并由 Hugging Face transformers 运行（需要 GPU）\nllm = LLM(default_model='llama', default_engine='transformers')\n\n# 使用自定义模型与 Hugging Face Transformers\nllm = LLM(model_id='Qwen\u002FQwen2.5-0.5B-Instruct', device_map='cpu')\n```\n\n有关将 Hugging Face 的 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 用作 LLM 后端的更多信息，请参阅\n[此处](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#using-hugging-face-transformers-instead-of-llama.cpp)。\n\n您还可以连接到 **Ollama**、本地 LLM API（如 vLLM）以及云端 LLM。\n\n``` python\n# 连接到由 Ollama 提供服务的 LLM\nlm = LLM(model_url='ollama\u002Fllama3.2')\n\n# 连接到通过 vLLM 提供服务的 LLM（根据需要设置 API 密钥）\nllm = LLM(model_url='http:\u002F\u002Flocalhost:8000\u002Fv1', api_key='token-abc123', model='Qwen\u002FQwen2.5-0.5B-Instruct')\n\n# 连接到云端支持的 LLM（如 OpenAI、Anthropic）。\nllm = LLM(model_url=\"openai\u002Fgpt-4o-mini\")  # OpenAI\nllm = LLM(model_url=\"anthropic\u002Fclaude-3-7-sonnet-20250219\") # Anthropic\n```\n\n**OnPrem.LLM** 支持 [LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) 包所支持的任何提供商和模型。\n\n有关 *本地* LLM API 的更多信息，请参阅\n[此处](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F#connecting-to-llms-served-through-rest-apis)。\n\n关于如何在 **OnPrem.LLM** 中专门使用 OpenAI 模型的更多信息，请参阅\n[此处](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_openai.html)。\n\n#### 向 LLM 后端传递参数\n\n传递给 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 和 [`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt) 的额外参数会直接传递给 LLM 后端。具体参数名称会因您选择的后端而异。\n\n例如，在默认的 llama-cpp 后端中，上下文窗口大小（`n_ctx`）默认设置为 3900，输出长度（`max_tokens`）默认设置为 512。这两者都是可配置的参数，可以通过 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm) 进行调整。如果您有较长的提示或需要更长的输出，可以适当增加这些值。其他参数（如 `api_key`、`device_map` 等）可以直接传递给 [`LLM`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm)，它们会被路由到相应的 LLM 后端或 API（如 llama-cpp-python、Hugging Face transformers、vLLM、OpenAI 等）。此外，`max_tokens` 参数也可以通过将其传递给 [`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt) 来动态调整。\n\n另一方面，对于 Ollama 模型，上下文窗口和输出长度分别由 `num_ctx` 和 `num_predict` 控制。\n\n而在使用 Hugging Face transformers 时，无需单独设置上下文窗口大小，但输出长度则由 [`LLM.prompt`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.prompt) 中的 `max_new_tokens` 参数控制。\n\n#### 使用 Hugging Face Transformers 替代 Llama.cpp\n\n默认情况下，**OnPrem.LLM** 使用的 LLM 后端是 [llama-cpp-python](https:\u002F\u002Fgithub.com\u002Fabetlen\u002Fllama-cpp-python)，该后端要求模型采用 [GGUF 格式](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fgguf)。从版本 0.5.0 开始，现在也可以使用 Hugging Face 的 [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 作为 LLM 后端。实现这一点的方法是使用 `model_id` 参数（而不是提供 `model_url` 参数）。在下面的示例中，我们运行的是\n[Llama-3.1-8B](https:\u002F\u002Fhuggingface.co\u002Fhugging-quants\u002FMeta-Llama-3.1-8B-Instruct-AWQ-INT4)\n模型。\n\n``` python\n# 使用 `model_id` 参数时，无需安装 llama-cpp-python\nllm = LLM(model_id=\"hugging-quants\u002FMeta-Llama-3.1-8B-Instruct-AWQ-INT4\", device_map='cuda')\n```\n\n这样一来，您可以更方便地使用 Hugging Face 模块上以 [SafeTensors 格式](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fsafetensors\u002Findex) 存储的任何模型，只要这些模型能够通过 Hugging Face 的 `transformers.pipeline` 加载即可。需要注意的是，当使用 `model_id` 参数时，`prompt_template` 会由 `transformers` 自动设置。\n\n上述加载的 Llama-3.1 模型是使用 [AWQ](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Ftransformers\u002Fmain\u002Fen\u002Fquantization\u002Fawq) 量化的，这使得该模型能够在较小的 GPU 上运行（例如配备 6GB 显存的笔记本电脑 GPU），类似于默认的 GGUF 格式。使用 AWQ 量化的模型需要安装 [autoawq](https:\u002F\u002Fpypi.org\u002Fproject\u002Fautoawq\u002F) 包：`pip install autoawq`（AWQ 仅支持 Linux 系统，包括 Windows Subsystem for Linux）。如果您需要加载未量化的模型，可以在加载时指定量化配置（称为“飞行中量化”）。在下面的示例中，我们加载了一个未量化的 [Zephyr-7B-beta 模型](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceH4\u002Fzephyr-7b-beta)，它将在加载过程中被量化，以便能够在显存仅为 6GB 的 GPU 上运行：\n\n``` python\nfrom transformers import BitsAndBytesConfig\nquantization_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_quant_type=\"nf4\",\n    bnb_4bit_compute_dtype=\"float16\",\n    bnb_4bit_use_double_quant=True,\n)\nllm = LLM(model_id=\"HuggingFaceH4\u002Fzephyr-7b-beta\", device_map='cuda', \n          model_kwargs={\"quantization_config\":quantization_config})\n```\n\n在提供 `quantization_config` 时，会使用 [bitsandbytes](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fbitsandbytes\u002Fmain\u002Fen\u002Finstallation) 库——这是一个轻量级的 Python 封装库，用于 CUDA 自定义函数，特别是 8 位优化器、矩阵乘法（LLM.int8()）以及 8 和 4 位量化功能。目前，bitsandbytes 团队正在努力支持除 CUDA 之外的其他后端。如果您遇到与 bitsandbytes 相关的错误，请参考 [bitsandbytes 文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fbitsandbytes\u002Fmain\u002Fen\u002Finstallation)。\n\n## 内置 Web 应用\n\n**OnPrem.LLM** 包含一个内置的 Web 应用，用于访问 LLM。安装完成后，运行以下命令即可启动：\n\n``` shell\nonprem --port 8000\n```\n\n然后，在浏览器中输入 `localhost:8000`（如果在远程服务器上运行，则输入 `\u003Cdomain_name>:8000`），即可访问该应用：\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_4ecfc2777e0f.png\" border=\"1\" alt=\"screenshot\" width=\"775\"\u002F>\n\n更多信息，请参阅[相关文档](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fwebapp.html)。\n\n## 示例\n\n[文档](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002F)中包含大量示例。\n\n### 💡 入门\n\n| 文档链接                                                  | 示例                      |\n|---------------------------------------------------------------------|------------------------------|\n| [提示示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples.html) | 使用提示解决问题 |\n\n### 📚 文档处理\n\n| 文档链接                                                                             | 示例                                           |\n|------------------------------------------------------------------------------------------------|---------------------------------------------------|\n| [文本提取](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_text_extraction.html)               | 文档文本提取（PDF、Word、PowerPoint） |\n| [文档摘要](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_summarization.html)          | 文档摘要                            |\n| [信息抽取](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_information_extraction.html) | 从文档中抽取信息             |\n\n### 🧠 问答与搜索\n\n| 文档链接                                                                          | 示例                                     |\n|---------------------------------------------------------------------------------------------|---------------------------------------------|\n| [RAG 示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_rag.html)                            | 基于 RAG 的问答                 |\n| [向量存储教程](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_vectorstore_factory.html) | 使用不同的向量存储              |\n| [语义相似度](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_semantic.html)               | 计算文本之间的语义相似度   |\n\n### 🎯 分类与分析\n\n| 文档链接                                                                           | 示例                                  |\n|----------------------------------------------------------------------------------------------|------------------------------------------|\n| [文本分类](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_classification.html)          | 少样本文本分类             |\n| [问卷分析](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_qualitative_survey_analysis.html) | 定性问卷回复的自动编码       |\n| [法律分析](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_legal_analysis.html)               | 法律及监管文件分析   |\n\n### 🛠️ 高级功能\n\n| 文档链接                                                                 | 示例                                            |\n|------------------------------------------------------------------------------------|----------------------------------------------------|\n| [代理示例](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_agent.html)              | 基于代理的任务执行与工具              |\n| [结构化输出](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fexamples_guided_prompts.html) | 使用 Pydantic 模型实现结构化和引导式输出 |\n| [工作流构建器](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fworkflows.html)                 | 用于文档分析的工作流构建器             |\n\n## 常见问题解答\n\n1.  **如何将其他模型与 OnPrem.LLM 一起使用？**\n\n    > 您可以使用 `model_url` 和 `model_id` 参数为 `LLM` 提供任意自定义模型（请参阅上方的速查表）。\n\n    > 下面我们将详细介绍如何使用 llma.cpp 后端提供自定义 GGUF 模型。\n\n    > 您可以在 [huggingface.co](https:\u002F\u002Fhuggingface.co\u002Fmodels?sort=trending&search=gguf) 上找到文件名中带有 `GGUF` 的 llama.cpp 支持模型。\n\n    > 请确保指向的是实际 GGUF 模型文件的 URL，即模型页面上的“下载”链接。以下以 **Mistral-7B** 为例：\n\n    > \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_readme_5da013180683.png\" border=\"1\" alt=\"screenshot\" width=\"775\"\u002F>\n\n    > 使用 llama.cpp 后端时，GGUF 模型需要特定的提示格式传递给 `LLM`。例如，根据 [模型页面](https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF) 的说明，**Zephyr-7B** 所需的提示模板为：\n    >\n    > `\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>`\n    >\n    > 因此，要使用 **Zephyr-7B** 模型，您必须在 `LLM` 构造函数中提供 `prompt_template` 参数（或在 Web 应用的 `webapp.yml` 配置中指定）。\n    >\n    > ``` python\n    > # 如何在 OnPrem.LLM 中使用 Zephyr-7B\n    > llm = LLM(model_url='https:\u002F\u002Fhuggingface.co\u002FTheBloke\u002Fzephyr-7B-beta-GGUF\u002Fresolve\u002Fmain\u002Fzephyr-7b-beta.Q4_K_M.gguf',\n    >           prompt_template = \"\u003C|system|>\\n\u003C\u002Fs>\\n\u003C|user|>\\n{prompt}\u003C\u002Fs>\\n\u003C|assistant|>\",\n    >           n_gpu_layers=33)\n    > llm.prompt(\"列出三个可爱的猫名字。\")\n    > ```\n\n    > 对于其他 LLM 后端（例如使用 Ollama 作为后端，或使用 `model_id` 参数加载 transformers 模型），则无需提供提示模板。此外，使用任何默认模型时也不需要提示模板。\n\n2.  **在 Windows\u002FMac\u002FLinux 上安装 `onprem` 时，我遇到了与 `llama-cpp-python`（或 `chroma-hnswlib`）相关的“构建”错误，这是为什么？**\n\n    > 请参阅 [LangChain 关于 LLama.cpp 的文档](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fllms\u002Fllamacpp)，了解如何为您的系统安装 `llama-cpp-python` 包。以下是针对不同操作系统的额外提示：\n\n    > 对于 Ubuntu 等 **Linux** 系统，您可以尝试运行：`sudo apt-get install build-essential g++ clang`。更多技巧请参见 [此处](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui\u002Fissues\u002F1534)。\n\n> 对于 **Windows** 系统，请尝试按照 [这些说明](https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fblob\u002Fmaster\u002FMSWindows.md) 操作。\n> 我们建议您使用 [适用于 Linux 的 Windows 子系统 (WSL)](https:\u002F\u002Flearn.microsoft.com\u002Fen-us\u002Fwindows\u002Fwsl\u002Finstall)，而不是直接使用 Microsoft Windows。如果您确实需要直接使用 Microsoft Windows，请务必安装 [Microsoft C++ 构建工具](https:\u002F\u002Fvisualstudio.microsoft.com\u002Fvisual-cpp-build-tools\u002F)，并确保选中 **使用 C++ 的桌面开发** 选项。\n\n> 对于 **Mac** 用户，请尝试按照 [这些提示](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT\u002Fissues\u002F445#issuecomment-1563333950) 操作。\n\n> 在 [这个 privateGPT 仓库的讨论帖](https:\u002F\u002Fgithub.com\u002Fimartinez\u002FprivateGPT\u002Fissues\u002F445) 中，还提供了针对上述各操作系统的各种其他技巧。当然，您也可以在 Google Colab 上 [轻松使用](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1LVeacsQ9dmE1BVzwR3eTLukpeRIMmUqi?usp=sharing) **OnPrem.LLM**。\n\n> 最后，如果您仍然无法解决 `llama-cpp-python` 的构建问题，可以尝试为您的系统 [安装预编译的 wheel 文件](https:\u002F\u002Fabetlen.github.io\u002Fllama-cpp-python\u002Fwhl\u002Fcpu\u002Fllama-cpp-python\u002F)：\n\n> **示例：**\n> `pip install llama-cpp-python==0.2.90 --extra-index-url https:\u002F\u002Fabetlen.github.io\u002Fllama-cpp-python\u002Fwhl\u002Fcpu`\n\n> **提示：** 同样存在 [用于 `chroma-hnswlib` 的预编译 wheel 文件](https:\u002F\u002Fpypi.org\u002Fproject\u002Fchroma-hnswlib\u002F#files)。如果运行 `pip install onprem` 时因构建 `chroma-hnswlib` 而失败，可能是因为您使用的 Python 版本尚未有对应的预编译 wheel（此时您可以尝试降级 Python 版本）。\n\n3.  **我位于企业防火墙之后，在尝试下载模型时收到 SSL 错误？**\n\n    > 请尝试以下操作：\n    >\n    > ``` python\n    > from onprem import LLM\n    > LLM.download_model(url, ssl_verify=False)\n    > ```\n\n    > 您可以按如下方式下载嵌入模型（由 `LLM.ingest` 和 `LLM.ask` 使用）：\n    >\n    > ``` sh\n    > wget --no-check-certificate https:\u002F\u002Fpublic.ukp.informatik.tu-darmstadt.de\u002Freimers\u002Fsentence-transformers\u002Fv0.2\u002Fall-MiniLM-L6-v2.zip\n    > ```\n\n    > 将解压后的文件夹名称作为 `embedding_model_name` 参数传递给 `LLM`。\n\n    > 如果即使运行 `pip install` 时也出现 SSL 错误，请尝试以下操作：\n    >\n    > ``` sh\n    > pip install –-trusted-host pypi.org –-trusted-host files.pythonhosted.org pip_system_certs\n    > ```\n\n4.  **如何在没有互联网连接的机器上使用它？**\n\n    > 使用 `LLM.download_model` 方法将模型文件下载到 `\u003Cyour_home_directory>\u002Fonprem_data`，然后将其传输到气隙机器上的相同位置。\n\n    > 对于 `ingest` 和 `ask` 方法，您还需要下载并传输嵌入模型文件：\n    >\n    > ``` python\n    > from sentence_transformers import SentenceTransformer\n    > model = SentenceTransformer('sentence-transformers\u002Fall-MiniLM-L6-v2')\n    > model.save('\u002Fsome\u002Ffolder')\n    > ```\n\n    > 将 `some\u002Ffolder` 文件夹复制到气隙机器上，并通过 `embedding_model_name` 参数将路径提供给 `LLM`。\n\n5.  **当我调用 `llm = LLM(...)` 时，模型无法加载吗？**\n\n    > 这可能是由于模型文件损坏所致（此时应从 `\u003Chome directory>\u002Fonprem_data` 中删除并重新下载）。也可能是 `llama-cpp-python` 的版本需要升级到最新版。\n\n6.  **我在实例化 `langchain.llms.Llamacpp` 或 `onprem.LLM` 对象时遇到 `“非法指令（核心已转储）”` 错误吗？**\n\n    > 您的 CPU 可能不支持 `cmake` 由于某种原因所使用的指令（例如，[由于 VirtualBox 设置中的 Hyper-V 导致](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F65780506\u002Fhow-to-enable-avx-avx2-in-virtualbox-6-1-16-with-ubuntu-20-04-64bit))。您可以在构建和安装 `llama-cpp-python` 时尝试禁用这些指令：\n\n    > ``` sh\n    > # 示例\n    > CMAKE_ARGS=\"-DGGML_CUDA=ON -DGGML_AVX2=OFF -DGGML_AVX=OFF -DGGML_F16C=OFF -DGGML_FMA=OFF\" FORCE_CMAKE=1 pip install --force-reinstall llama-cpp-python --no-cache-dir\n    > ```\n\n7.  **如何加快 [`LLM.ingest`](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fllm.base.html#llm.ingest) 的速度？**\n\n    > 默认情况下，如果有可用的 GPU，将会用于计算嵌入向量，因此请确保 PyTorch 已安装并支持 GPU。您可以通过 `embedding_model_kwargs` 参数显式控制用于计算嵌入向量的设备。\n    >\n    > ``` python\n    > from onprem import LLM\n    > llm  = LLM(embedding_model_kwargs={'device':'cuda'})\n    > ```\n\n    > 您还可以向 `LLM` 提供 `store_type=\"sparse\"` 参数，以使用稀疏向量存储，这会牺牲少量推理速度（`LLM.ask`），但在摄入阶段（`LLM.ingest`）可显著提升速度。\n    >\n    > ``` python\n    > from onprem import LLM\n    > llm  = LLM(store_type=\"sparse\")\n    > ```\n    >\n    > 请注意，与密集向量存储不同，稀疏向量存储假设答案来源至少包含与问题共有的一个词。\n\n\u003C!--\n8. **OnPrem.LLM 都有哪些应用场景？**\n    > 示例包括：\n    > - 从工程文档中提取关键性能参数及其他性能属性\n    > - 自动编写对政府信息请求 (RFI) 的响应\n    > - 分析联邦采购条例 (FAR)\n    > - 研究关于网络安全的第 14028 号行政命令如何与国家网络安全战略相契合\n    > - 根据数千条评价生成课程改进建议摘要\n    > - 为人才招聘从简历中提取特定感兴趣的信息。\n&#10;-->\n\n\n\n## 引用方式\n\n在使用 **OnPrem.LLM** 时，请引用 [以下论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21040)：\n\n    @article{maiya2025generativeaiffrdcs,\n          title={FFRDCs 的生成式 AI}, \n          author={Arun S. Maiya},\n          year={2025},\n          eprint={2509.21040},\n          archivePrefix={arXiv},\n          primaryClass={cs.CL},\n          url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2509.21040}, \n    }","# OnPrem.LLM 快速上手指南\n\nOnPrem.LLM 是一个注重隐私的 Python 工具包，专为在离线或受限环境中对敏感数据应用大语言模型（LLM）而设计。它默认支持完全本地化运行，同时也兼容 OpenAI、Anthropic 等云端模型提供商。\n\n## 1. 环境准备\n\n### 系统要求\n- **操作系统**：Linux, macOS, Windows (WSL2 推荐用于 GPU 加速)\n- **Python**：3.8+\n- **硬件**：\n  - **CPU 模式**：无特殊要求，适合资源有限的环境。\n  - **GPU 模式**：推荐使用 NVIDIA 显卡（6GB+ VRAM 可运行 8B 参数量化模型）。需安装最新的 NVIDIA 驱动和 CUDA Toolkit。\n\n### 前置依赖\n在使用 GPU 加速的 `llama-cpp-python` 后端前，请确保已安装 PyTorch（可选，视具体后端而定）及对应的编译环境。\n\n> **国内开发者提示**：建议使用国内镜像源加速 Python 包下载，例如清华源或阿里源。\n\n## 2. 安装步骤\n\n### 基础安装\n使用 pip 安装核心库：\n\n```bash\npip install onprem -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 可选组件安装\n根据您的需求选择安装以下扩展：\n\n- **启用 Chroma 向量数据库（用于 RAG 检索增强生成）**：\n  ```bash\n  pip install \"onprem[chroma]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n  ```\n\n- **启用 AI Agent 功能（沙箱执行）**：\n  ```bash\n  pip install \"onprem[agent]\" -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n  ```\n\n### GPU 加速配置（可选）\n若需使用 GPU 加速本地模型推理（基于 `llama-cpp-python`），请根据操作系统执行以下命令重新构建该库：\n\n- **Linux**:\n  ```bash\n  CMAKE_ARGS=\"-DGGML_CUDA=on\" FORCE_CMAKE=1 pip install --upgrade --force-reinstall llama-cpp-python --no-cache-dir -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n  ```\n- **macOS (Apple Silicon)**:\n  ```bash\n  CMAKE_ARGS=\"-DGGML_METAL=on\" pip install llama-cpp-python -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n  ```\n- **Windows**: 建议参考官方文档在 WSL2 中配置，或使用预编译 wheel。\n\n## 3. 基本使用\n\n### 初始化与本地模型运行\n以下示例展示如何使用 Ollama 作为后端运行本地模型（需先安装并运行 Ollama 服务）。\n\n```python\n# 安装依赖\n# !pip install onprem[chroma]\n\nfrom onprem import LLM, utils\n\n# 拉取本地模型 (需在终端先运行: ollama pull llama3.2)\n# !ollama pull llama3.2\n\n# 初始化本地 LLM (使用 Ollama 后端)\nllm = LLM('ollama\u002Fllama3.2')\n\n# 基础对话\nresult = llm.prompt('Give me a short one sentence definition of an LLM.')\nprint(result)\n```\n\n### 文档问答 (RAG)\n将本地文档导入并进行问答，无需预先计算嵌入向量（支持 SparseStore）。\n\n```python\n# 下载示例文档\nutils.download('https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2505.07672', '\u002Ftmp\u002Fmy_documents\u002Fpaper.pdf')\n\n#  ingest 文档到本地向量库\nllm.ingest('\u002Ftmp\u002Fmy_documents')\n\n# 针对文档内容提问\nresult = llm.ask('What is OnPrem.LLM?')\nprint(result)\n```\n\n### 切换至云端模型\n轻松切换至云端大模型（如 Anthropic 或 OpenAI），需配置对应 API Key。\n\n```python\n# 切换到 Anthropic Claude 模型\n# 请确保环境变量中设置了 ANTHROPIC_API_KEY\nllm = LLM(\"anthropic\u002Fclaude-3-7-sonnet-latest\")\n\nresult = llm.prompt('Summarize the benefits of local LLMs.')\nprint(result)\n```\n\n### 结构化输出\n利用 Pydantic 模型获取标准化的 JSON 输出。\n\n```python\nfrom pydantic import BaseModel, Field\n\nclass MeasuredQuantity(BaseModel):\n    value: str = Field(description=\"numerical value\")\n    unit: str = Field(description=\"unit of measurement\")\n\n# 提取结构化数据\nstructured_output = llm.pydantic_prompt('He was going 35 mph.', pydantic_model=MeasuredQuantity)\n\nprint(f\"Value: {structured_output.value}\") # 输出: 35\nprint(f\"Unit: {structured_output.unit}\")   # 输出: mph\n```\n\n### 安全运行 AI Agent\n在沙箱环境中启动 AI Agent 执行文件操作任务。\n\n```python\nfrom onprem.pipelines import AgentExecutor\n\n# 初始化 Agent (以 OpenAI 为例，需设置 OPENAI_API_KEY)\nexecutor = AgentExecutor(model='openai\u002Fgpt-4o-mini', sandbox=True)\n\nresult = executor.run(\"\"\"\nSearch this directory for all .md files and:\n1. Extract all headings (# ## ###)\n2. Count total words in each file\n3. Create an index file 'documentation_index.md' with the results.\n\"\"\")\n```","某金融合规团队需要在完全隔离的内网环境中，对数万份包含客户隐私的敏感合同文档进行自动化风险审查与关键条款提取。\n\n### 没有 onprem 时\n- 数据无法出域，团队只能放弃使用强大的云端大模型，被迫依赖准确率低的传统正则匹配或关键词搜索。\n- 若强行搭建本地开源模型，需手动配置复杂的推理后端（如 llama_cpp 或 vLLM）和向量数据库，环境部署耗时数周且极易出错。\n- 面对非结构化的合同文本，难以实现标准化的字段提取（如金额、日期），每次输出格式混乱，后续还需人工二次清洗。\n- 缺乏可视化的流程编排能力，修改审查逻辑需要重写大量代码，业务人员无法参与调整策略。\n\n### 使用 onprem 后\n- 利用 onprem 默认的本地执行模式，直接在离线服务器上运行 Llama 3.2 等模型，确保敏感数据绝不离开内网，满足最高合规要求。\n- 通过一行代码即可切换后端并自动处理文档摄入（ingest），内置的 SparseStore 模块让低配服务器也能流畅运行 RAG 检索，无需预存海量嵌入向量。\n- 借助 Pydantic 结构化输出功能，强制模型按预定 JSON 格式返回条款细节，直接对接内部数据库，消除了人工清洗环节。\n- 使用可视化工作流构建器，合规专家可通过拖拽方式调整“提取 - 分类 - 总结”的分析管道，将策略迭代周期从几天缩短至几小时。\n\nonprem 让金融机构在严守数据隐私红线的前提下，也能享受到与大厂同级的智能文档分析能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Famaiya_onprem_4ecfc277.png","amaiya","Arun S. Maiya","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Famaiya_80488726.png","computer scientist",null,"http:\u002F\u002Farun.maiya.net","https:\u002F\u002Fgithub.com\u002Famaiya",[80,84,88,92],{"name":81,"color":82,"percentage":83},"Jupyter Notebook","#DA5B0B",69.3,{"name":85,"color":86,"percentage":87},"Python","#3572A5",30.6,{"name":89,"color":90,"percentage":91},"Shell","#89e051",0.1,{"name":93,"color":94,"percentage":91},"CSS","#663399",836,55,"2026-04-09T10:29:25","Apache-2.0","Linux, macOS, Windows","非必需（支持 CPU 运行）。若需 GPU 加速：Linux\u002FWindows 需 NVIDIA 显卡及 CUDA Toolkit；Mac 需支持 Metal 的 GPU。8B 及以下量化模型最低需 6GB 显存，建议根据模型大小调整。","未说明（取决于所选模型大小，量化小模型可在低资源环境运行）",{"notes":103,"python":104,"dependencies":105},"该工具默认本地运行以保护隐私，但也支持连接云端 LLM。GPU 加速需手动编译 llama-cpp-python（Linux 需设置 GGML_CUDA=on，Mac 需 GGML_METAL=on）。Windows 用户安装 GPU 版本需参考额外文档。支持多种后端（Ollama, vLLM, HuggingFace 等），若使用这些外部服务则无需本地安装 llama-cpp-python。","未说明",[106,107,108,109,110,111,112],"torch","llama-cpp-python (可选)","chromadb (可选，用于 RAG)","transformers (可选后端)","ollama (可选后端)","vLLM (可选后端)","pydantic",[35,14,13],"2026-03-27T02:49:30.150509","2026-04-12T13:12:16.550028",[117,122,127,132,137,142],{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},30610,"框架能接受的最大输入令牌数（tokens）是多少？","这取决于您与 OnPrem.LLM 一起使用的具体模型。OnPrem.LLM 的默认上下文窗口大小设置为 `n_ctx=2048`，但如果您使用支持更大上下文窗口的模型，可以增加此数值。","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F38",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},30611,"初始化 LLM 时遇到 'ValidationError: Could not load Llama model' 错误怎么办？","这可能是 `llama-cpp-python` 的一个已知问题。尝试在初始化 LLM 时添加 `verbose=True` 参数通常可以解决此问题。例如：`llm = LLM(use_larger=True, n_gpu_layers=35, verbose=True)`。如果问题依旧，请检查您的 `llama-cpp-python` 版本。","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F37",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},30612,"运行文档摄入（ingest）功能时出现 'AttributeError: module 'numpy.linalg._umath_linalg' has no attribute '_ilp64'' 错误如何解决？","这通常是由于 numpy 安装问题引起的。您可以尝试卸载并重新安装 numpy。如果您是在 Google Colab 中使用 GPU 运行笔记本，请务必按照笔记本中的指示重启运行时环境。","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F25",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},30613,"运行 ingest 函数时出现 'RuntimeError: An attempt has been made to start a new process before the current process has finished its bootstrapping phase' 错误怎么办？","如果您是在 `.py` 文件中运行代码（而非 Jupyter Notebook），需要将代码包裹在 `if __name__ == '__main__':` 保护块中。示例代码如下：\n```python\nif __name__ == '__main__':\n    from onprem import LLM\n    llm = LLM()\n    llm.ingest('.\u002Fsample_docs\u002F')\n```","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F4",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},30614,"如何指定自定义路径来下载大语言模型？","虽然目前主要通过在 notebooks (`nbs` 文件夹) 中开发来生成代码，但该功能已被讨论并计划实现。用户可以通过传递 `model_download_path` 参数来指定路径，例如：`llm = LLM(model_download_path=\"~\u002Fmodels\")` 或 `llm.download_model(model_download_path=\"~\u002Fmodels\")`。如果不指定该参数，默认将模型存储在 `~\u002Fonprem_data`。","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F5",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},30615,"项目是否支持 GGUF 格式的模型？如何从 GGML 迁移到 GGUF？","是的，项目支持 GGUF 格式。您只需要将 `llama-cpp-python` 更新到最新版本即可。不需要特殊的绑定，只需确保不提供默认的 GGML 模型 URL，而是提供 GGUF 模型的 URL（例如 HuggingFace 上的 TheBloke 仓库中的 GGUF 模型）。","https:\u002F\u002Fgithub.com\u002Famaiya\u002Fonprem\u002Fissues\u002F1",[148,153,158,163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243],{"id":149,"version":150,"summary_zh":151,"released_at":152},222496,"v0.22.1","## 0.22.1（2026-03-24）\n\n### 新增：\n- 无\n\n### 变更：\n- 无\n\n### 修复：\n- 由于[此问题](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm\u002Fissues\u002F24512)，将依赖锁定至 `litellm\u003C=1.82.6`。","2026-03-24T15:45:14",{"id":154,"version":155,"summary_zh":156,"released_at":157},222497,"v0.22.0","## 0.22.0 (2026-03-17)\n\n### 新增：\n- **`AgentExecutor`**：一个新的代理流水线，用于在沙箱环境中安全地启动代理\n\n### 变更：\n- 重构了 `pyproject.toml` 中的可选依赖项\n\n### 修复：\n- 修复了新字段无法添加到索引的问题\n","2026-03-17T16:09:05",{"id":159,"version":160,"summary_zh":161,"released_at":162},222498,"v0.21.5","## 0.21.5（2026-03-03）\n\n### 新增：\n- 无\n\n### 变更：\n- 升级到 `nbdev==3`\n\n### 修复：\n- 无","2026-03-03T23:10:25",{"id":164,"version":165,"summary_zh":166,"released_at":167},222499,"v0.21.4","## 0.21.4（2026-01-26）\n\n### 新增：\n- 无\n\n### 变更\n- 因 Set Fit 问题 (#247)，将依赖锁定至 `transformers\u003C5`\n\n### 修复：\n- 使用 `SetFit` 包实现懒加载 (#247)\n- 在 `transformers>=5` 版本中使用 `processing_class` (#172)\n- 在 HF 测试中使用显式超参数（c39b2935ec2194f70d02d670e176421d2b101e94）","2026-01-26T21:33:01",{"id":169,"version":170,"summary_zh":171,"released_at":172},222500,"v0.21.3","## 0.21.3（2026-01-22）\n\n### 新增：\n- 无\n\n### 变更\n- 无\n\n### 修复：\n- 修复 `ChatOpenAI` 在使用 gpt-5 时失败的问题\n- 修复类型注解 (#244)\n- 修复类型注解 (#246)","2026-01-22T21:34:51",{"id":174,"version":175,"summary_zh":176,"released_at":177},222501,"v0.21.2","## 0.21.2（2026-01-12）\n\n### 新增：\n- 无\n\n### 变更：\n- 无\n\n### 修复：\n- 修复了 `KVRouter` 的问题 (#243)","2026-01-12T18:22:22",{"id":179,"version":180,"summary_zh":181,"released_at":182},222502,"v0.21.1","## 0.21.1（2026-01-06）\n\n### 新增：\n- 无\n\n### 变更：\n- 无\n\n### 修复：\n- 修复并优化了 `LLM._prompt_internal` 中的 `response_format` 处理逻辑（#241）。","2026-01-07T01:53:47",{"id":184,"version":185,"summary_zh":186,"released_at":187},222503,"v0.21.0","## 0.21.0 (2026-01-06)\n\n### 新增：\n- **KVRouter**：基于元数据的[查询路由](https:\u002F\u002Famaiya.github.io\u002Fonprem\u002Fpipelines.rag.html#example-using-query-routing-with-rag)      (#238)\n- 支持使用 AWS GovCloud Bedrock 的原生结构化输出 (#240)\n\n### 变更：\n- 所有 Azure 模型均使用 LiteLLM\n- 将 RAG 功能重构为 RAGPipeline (#237)\n- 为 `KVRouter` 和 `LLM.ask` 的 `selfask=True` 选项添加了测试 (#239)\n\n### 修复：\n- 解决 RAG 重构中的问题 (#238)\n- 修复在使用 `response_format` 时回退到 `pydantic_prompt` 的问题 (ae0d9ebd)","2026-01-06T19:43:39",{"id":189,"version":190,"summary_zh":191,"released_at":192},222504,"v0.20.0","## 0.20.0 (2025-12-29)\n\n### 新增：\n- 添加了异步支持 (#233, #235)\n\n### 变更：\n- **破坏性变更**：默认分块大小增加到 1000 个字符 (#236)\n- 向 `prompt` 方法添加了 `response_format` 参数 (#232)\n\n### 修复：\n- 修复 WhooshStore 中的写锁问题 (#234)","2025-12-29T19:15:00",{"id":194,"version":195,"summary_zh":196,"released_at":197},222505,"v0.19.6","## 0.19.6（2025-12-05）\n\n### 新增：\n- 无\n\n### 变更：\n- 无\n\n### 修复：\n- 修复了 Web 应用中的快速链接\n- 在 `onprem\u002Fdocker` 文件夹中添加了 Podman 指南","2025-12-05T20:11:06",{"id":199,"version":200,"summary_zh":201,"released_at":202},222506,"v0.19.5","## 0.19.5 (2025-12-05)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- cache vectorstore instance in web app (#231)","2025-12-05T13:50:07",{"id":204,"version":205,"summary_zh":206,"released_at":207},222507,"v0.19.4","## 0.19.4 (2025-12-04)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- Removed incorrect `helpers.` prefix from `extract_filemetadata` (9ab0e2e)\r\n- Fix issue with blank tooltip in DocumentQA (#229)\r\n- lazy loading of SentenceTransformers (#230)\r\n- Added \"loading vectorstore\" spinner to web app","2025-12-04T22:24:49",{"id":209,"version":210,"summary_zh":211,"released_at":212},222508,"v0.19.3","## 0.19.3 (2025-11-20)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- Added `raw_text` and `chunks` parameters to `Summarizer.summarize_by_concept` (#227)\r\n- Added `raw_text` parameter to `Summarizer.summarize` (#227)\r\n\r\n### fixed:\r\n- Fixed CI errors with agents (#228)\r\n- Update deprecated Streamlit `use_container_width` parameter (#225)\r\n- Deprecate conversation chain (#226)","2025-11-20T21:27:20",{"id":214,"version":215,"summary_zh":216,"released_at":217},222509,"v0.19.2","## 0.19.2 (2025-11-05)\r\n\r\n### new:\r\n\r\n### changed\r\n- Add `get_aggregations` method to WhooshStore (#223)\r\n\r\n### fixed:\r\n- Added test for OCR\r\n- Fixed issue with filter searches in `WhooshStore` (#222)","2025-11-05T18:27:56",{"id":219,"version":220,"summary_zh":221,"released_at":222},222510,"v0.19.1","## 0.19.1 (2025-10-20)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- Resolved issue with `terms` filter in Elasticsearch stores (#221)","2025-10-20T14:02:59",{"id":224,"version":225,"summary_zh":226,"released_at":227},222511,"v0.19.0","## 0.19.0 (2025-09-25)\r\n\r\n### new:\r\n- Support for workflows\r\n- Added *Visual Workflow Builder* to web UI\r\n\r\n### changed\r\n- Added `keep_full_documents` and `max_words` parameters to `load_single_document` (#220)\r\n\r\n### fixed:\r\n- Resolved issue with `source` in Document objects being empty when `pdf_markdown=True`.","2025-09-26T02:58:58",{"id":229,"version":230,"summary_zh":231,"released_at":232},222512,"v0.18.2","## 0.18.2 (2025-08-28)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- Fix bugs in the **Document Search** and **Document Analysis** screens of web app (#219)\r\n- Some fixes and improvements to docker setup (#218)","2025-08-28T17:37:45",{"id":234,"version":235,"summary_zh":236,"released_at":237},222513,"v0.18.1","## 0.18.1 (2025-08-22)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- Fix parameter routing for local APIs like vLLM (#216)","2025-08-22T18:10:20",{"id":239,"version":240,"summary_zh":241,"released_at":242},222514,"v0.18.0","## 0.18.0 (2025-08-20)\r\n\r\n### new:\r\n- Support for AWS GovCloud (#215)\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- N\u002FA","2025-08-20T13:46:10",{"id":244,"version":245,"summary_zh":246,"released_at":247},222515,"v0.17.2","## 0.17.2 (2025-08-15)\r\n\r\n### new:\r\n- N\u002FA\r\n\r\n### changed\r\n- N\u002FA\r\n\r\n### fixed:\r\n- updated min version to 3.10 due to `smolagents`","2025-08-15T14:13:17"]