[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Dicklesworthstone--swiss_army_llama":3,"tool-Dicklesworthstone--swiss_army_llama":61},[4,18,28,37,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},10095,"AutoGPT","Significant-Gravitas\u002FAutoGPT","AutoGPT 是一个旨在让每个人都能轻松使用和构建 AI 的强大平台，核心功能是帮助用户创建、部署和管理能够自动执行复杂任务的连续型 AI 智能体。它解决了传统 AI 应用中需要频繁人工干预、难以自动化长流程工作的痛点，让用户只需设定目标，AI 即可自主规划步骤、调用工具并持续运行直至完成任务。\n\n无论是开发者、研究人员，还是希望提升工作效率的普通用户，都能从 AutoGPT 中受益。开发者可利用其低代码界面快速定制专属智能体；研究人员能基于开源架构探索多智能体协作机制；而非技术背景用户也可直接选用预置的智能体模板，立即投入实际工作场景。\n\nAutoGPT 的技术亮点在于其模块化“积木式”工作流设计——用户通过连接功能块即可构建复杂逻辑，每个块负责单一动作，灵活且易于调试。同时，平台支持本地自托管与云端部署两种模式，兼顾数据隐私与使用便捷性。配合完善的文档和一键安装脚本，即使是初次接触的用户也能在几分钟内启动自己的第一个 AI 智能体。AutoGPT 正致力于降低 AI 应用门槛，让人人都能成为 AI 的创造者与受益者。",183572,"2026-04-20T04:47:55",[13,36,27,14,15],"语言模型",{"id":38,"name":39,"github_repo":40,"description_zh":41,"stars":42,"difficulty_score":10,"last_commit_at":43,"category_tags":44,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":24,"last_commit_at":51,"category_tags":52,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161692,"2026-04-20T11:33:57",[14,13,36],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":24,"last_commit_at":59,"category_tags":60,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":93,"forks":94,"last_commit_at":95,"license":76,"difficulty_score":96,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":114,"github_topics":115,"view_count":24,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":122,"updated_at":123,"faqs":124,"releases":153},10197,"Dicklesworthstone\u002Fswiss_army_llama","swiss_army_llama","A FastAPI service for semantic text search using precomputed embeddings and advanced similarity measures, with built-in support for various file types through textract.","Swiss Army Llama 是一款基于 FastAPI 构建的高效本地大语言模型服务，旨在为开发者提供一站式的语义文本搜索与处理方案。它解决了本地部署大模型时流程繁琐、多格式文件处理困难以及重复计算资源浪费等痛点。\n\n该工具特别适合需要构建本地知识库、进行语义检索或集成 AI 能力的开发者和研究人员。用户只需简单配置，即可通过直观的 Swagger 界面调用各类功能。其核心亮点在于强大的多模态处理能力：不仅能自动解析 PDF（含 OCR 识别）、Word 文档，还能利用 Whisper 模型将音频转录为文本并生成向量嵌入。为避免重复计算，系统会自动将结果缓存至 SQLite。此外，Swiss Army Llama 引入了高性能的 Rust 库 `fast_vector_similarity`，支持多种高级相似度度量算法，并提供均值池化、SVD 分解等多种灵活的嵌入聚合策略。结合 FAISS 向量搜索技术，它能快速实现大规模数据的语义匹配，是打造本地化 AI 应用的得力助手。","# 🇨🇭🎖️🦙 Swiss Army Llama\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_e98d25991734.webp\" width=\"500\">\n\u003C\u002Fdiv>\n\n## Introduction\n\nThe Swiss Army Llama is designed to facilitate and optimize the process of working with local LLMs by using FastAPI to expose convenient REST endpoints for various tasks, including obtaining text embeddings and completions using different LLMs via llama_cpp, as well as automating the process of obtaining all the embeddings for most common document types, including PDFs (even ones that require OCR), Word files, etc; it even allows you to submit an audio file and automatically transcribes it with the Whisper model, cleans up the resulting text, and then computes the embeddings for it. To avoid wasting computation, these embeddings are cached in SQlite and retrieved if they have already been computed before. To speed up the process of loading multiple LLMs, optional RAM Disks can be used, and the process for creating and managing them is handled automatically for you. With a quick and easy setup process, you will immediately get access to a veritable \"Swiss Army Knife\" of LLM related tools, all accessible via a convenient Swagger UI and ready to be integrated into your own applications with minimal fuss or configuration required.\n\nSome additional useful endpoints are provided, such as computing semantic similarity between submitted text strings. The service leverages a high-performance Rust-based library, `fast_vector_similarity`, to offer a range of similarity measures including `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_dependency_measure`, and [`hoeffding_d`](https:\u002F\u002Fblogs.sas.com\u002Fcontent\u002Fiml\u002F2021\u002F05\u002F03\u002Fexamples-hoeffding-d.html). Additionally, semantic search across all your cached embeddings is supported using FAISS vector searching. You can either use the built in cosine similarity from FAISS, or supplement this with a second pass that computes the more sophisticated similarity measures for the most relevant subset of the stored vectors found using cosine similarity (see the advanced semantic search endpoint for this functionality).\n\nAlso, we now support multiple embedding pooling methods for combining token-level embedding vectors into a single fixed-length embedding vector for any length of input text, including the following:\n   - `mean`: Mean pooling of token embeddings.\n   - `mins_maxes`: Concatenation of the minimum and maximum values of each dimension of the token embeddings.\n   - `svd`: Concatenation of the first two singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix.\n   - `svd_first_four`: Concatenation of the first four singular vectors obtained from the Singular Value Decomposition (SVD) of the token embeddings matrix.\n   - `ica`: Flattened independent components obtained from Independent Component Analysis (ICA) of the token embeddings.\n   - `factor_analysis`: Flattened factors obtained from Factor Analysis of the token embeddings.\n   - `gaussian_random_projection`: Flattened embeddings obtained from Gaussian Random Projection of the token embeddings.\n\n\nAs mentioned above, you can now submit not only plaintext and fully digital PDFs but also MS Word documents, images, and other file types supported by the textract library. The library can automatically apply OCR using Tesseract for scanned text. The returned embeddings for each sentence in a document can be organized in various formats like records, table, etc., using the Pandas to_json() function. The results can be returned either as a ZIP file containing a JSON file or as a direct JSON response. You can now also submit audio files in MP3 or WAV formats. The library uses OpenAI's Whisper model, as optimized by the Faster Whisper Python library, to transcribe the audio into text. Optionally, this transcript can be treated like any other document, with each sentence's embeddings computed and stored. The results are returned as a URL to a downloadable ZIP file containing a JSON with the embedding vector data.\n\nFinally, we add a new endpoint for generating multiple text completions for a given input prompt, with the ability to specify a grammar file that will enforce a particular form of response, such as JSON. There is also a useful new utility feature: a real-time application log viewer that can be accessed via a web browser, which allows for syntax highlighting and offers options for downloading the logs or copying them to the clipboard. This allows a user to watch the logs without having direct SSH access to the server.\n\n## Screenshots\n![Swiss Army Llama Swagger UI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_42018885e515.png)\n![Swiss Army Llama Runnig](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_a5676429e904.png)\n\n*TLDR:* If you just want to try it very quickly on a fresh Ubuntu 22+ machine (warning, this will install docker using apt):\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\ncd swiss_army_llama\nchmod +x setup_dockerized_app_on_fresh_machine.sh\nsudo .\u002Fsetup_dockerized_app_on_fresh_machine.sh\n```\n\nTo run it natively (not using Docker) in a Python venv (recommended!), you can use these commands:\n\n```bash\nsudo apt-get update\nsudo apt-get install build-essential libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig redis-server libpoppler-cpp-dev pkg-config -y\nsudo systemctl enable redis-server\nsudo systemctl start redis\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\ncd swiss_army_llama\npython3 -m venv venv\nsource venv\u002Fbin\u002Factivate\npython3 -m pip install --upgrade pip\npython3 -m pip install wheel\npython3 -m pip install --upgrade setuptools wheel\npip install -r requirements.txt\npython3 swiss_army_llama.py\n```\n\nAlternatively, you can also just run the included script, which will install PyEnv if it's not already installed on your machine, and then install Python 3.12 and create a virtual environment for you. You can do everything with a single one-liner from scratch on a fresh Ubuntu machine like this:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama && cd swiss_army_llama && chmod +x install_swiss_army_llama.sh && .\u002Finstall_swiss_army_llama.sh && pyenv local 3.12 && source venv\u002Fbin\u002Factivate && python swiss_army_llama.py\n```\n\nThen open a browser to `\u003Cyour_static_ip_address>:8089` if you're using a VPS to get to the FastAPI Swagger page at `http:\u002F\u002Flocalhost:8089`.\n\nOr to `localhost:8089` if you're using your own machine-- but, really, you should never run untrusted code with sudo on your own machine! Just get a cheap VPS to experiment with for $30\u002Fmonth.\n\nWatch the the automated setup process in action [here](https:\u002F\u002Fasciinema.org\u002Fa\u002F601603).\n\n---\n\n## Features\n\n1. **Text Embedding Computation**: Utilizes pre-trained LLama3 and other LLMs via llama_cpp to generate embeddings for any provided text.\n2. **Embedding Caching**: Efficiently stores and retrieves computed embeddings in SQLite, minimizing redundant computations.\n3. **Advanced Similarity Measurements and Retrieval**: Utilizes the author's own `fast_vector_similarity` library written in Rust to offer highly optimized advanced similarity measures such as `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_dependency_measure`, and `hoeffding_d`. Semantic search across cached embeddings is also supported using FAISS vector searching.\n4. **Two-Step Advanced Semantic Search**: The API first leverages FAISS and cosine similarity for rapid filtering, and then applies additional similarity measures like `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_dependency_measure`, and `hoeffding_d` for a more nuanced comparison.\n5. **File Processing for Documents**: The library now accepts a broader range of file types including plaintext, PDFs, MS Word documents, and images. It can also handle OCR automatically. Returned embeddings for each sentence are organized in various formats like records, table, etc., using Pandas to_json() function.\n6. **Advanced Text Preprocessing**: The library now employs a more advanced sentence splitter to segment text into meaningful sentences. It handles cases where periods are used in abbreviations, domain names, or numbers and also ensures complete sentences even when quotes are used. It also takes care of pagination issues commonly found in scanned documents, such as awkward newlines and hyphenated line breaks.\n7. **Audio Transcription and Embedding**: Upload an audio file in MP3 or WAV format. The library uses OpenAI's Whisper model for transcription. Optionally, sentence embeddings can be computed for the transcript.\n8. **RAM Disk Usage**: Optionally uses RAM Disk to store models for faster access and execution. Automatically handles the creation and management of RAM Disks.\n9. **Robust Exception Handling**: Features comprehensive exception management to ensure system resilience.\n10. **Interactive API Documentation**: Integrates with Swagger UI for an interactive and user-friendly experience, accommodating large result sets without crashing.\n11. **Scalability and Concurrency**: Built on the FastAPI framework, handles concurrent requests and supports parallel inference with configurable concurrency levels.\n12. **Flexible Configurations**: Offers configurable settings through environment variables and input parameters, including response formats like JSON or ZIP files.\n13. **Comprehensive Logging**: Captures essential information with detailed logs, without overwhelming storage or readability.\n14. **Support for Multiple Models and Measures**: Accommodates multiple embedding models and similarity measures, allowing flexibility and customization based on user needs.\n15. **Ability to Generate Multiple Completions using Specified Grammar**: Get back structured LLM completions for a specified input prompt.\n16. **Real-Time Log File Viewer in Browser**: Lets anyone with access to the API server conveniently watch the application logs to gain insight into the execution of their requests.\n17. **Uses Redis for Request Locking**: Uses Redis to allow for multiple Uvicorn workers to run in parallel without conflicting with each other.\n\n## Demo Screen Recording in Action\n[Here](https:\u002F\u002Fasciinema.org\u002Fa\u002F39dZ8vv9nkcNygasUl35wnBPq) is the live console output while I interact with it from the Swagger page to make requests.\n\n---\n\n## Requirements\n\nSystem requirements for running the application (to support all the file types handled by textract):\n\n```bash\nsudo apt-get update\nsudo apt-get install libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig -y\n```\n\nPython Requirements:\n\n```bash\naioredis\naioredlock\naiosqlite\napscheduler\nfaiss-cpu\nfast_vector_similarity\nfastapi\nfaster-whisper\nfilelock\nhttpx\nllama-cpp-python\nmagika\nmutagen\nnvgpu\npandas\npillow\npsutil\npydantic\nPyPDF2\npytest\npython-decouple\npython-multipart\npytz\nredis\nruff\nscikit-learn\nscipy\nsqlalchemy\ntextract-py3\nuvicorn\nuvloop\nzstandard\n```\n\n## Running the Application\n\nYou can run the application using the following command:\n\n```bash\npython swiss_army_llama.py\n```\n\nThe server will start on `0.0.0.0` at the port defined by the `SWISS_ARMY_LLAMA_SERVER_LISTEN_PORT` variable.\n\nAccess the Swagger UI:\n\n```\nhttp:\u002F\u002Flocalhost:\u003CSWISS_ARMY_LLAMA_SERVER_LISTEN_PORT>\n```\n\n## Configuration\n\nYou can configure the service easily by editing the included `.env` file. Here's a list of available configuration options:\n\n- `USE_SECURITY_TOKEN`: Whether to use a hardcoded security token. (e.g., `1`)\n- `USE_PARALLEL_INFERENCE_QUEUE`: Use parallel processing. (e.g., `1`)\n- `MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS`: Maximum number of parallel inference tasks. (e.g., `30`)\n- `DEFAULT_MODEL_NAME`: Default model name to use. (e.g., `Llama-3-8B-Instruct-64k`)\n- `LLM_CONTEXT_SIZE_IN_TOKENS`: Context size in tokens for LLM. (e.g., `512`)\n- `SWISS_ARMY_LLAMA_SERVER_LISTEN_PORT`: Port number for the service. (e.g., `8089`)\n- `UVICORN_NUMBER_OF_WORKERS`: Number of workers for Uvicorn. (e.g., `2`)\n- `MINIMUM_STRING_LENGTH_FOR_DOCUMENT_EMBEDDING`: Minimum string length for document embedding. (e.g., `15`)\n- `MAX_RETRIES`: Maximum retries for locked database. (e.g., `10`)\n- `DB_WRITE_BATCH_SIZE`: Database write batch size. (e.g., `25`)\n- `RETRY_DELAY_BASE_SECONDS`: Retry delay base in seconds. (e.g., `1`)\n- `JITTER_FACTOR`: Jitter factor for retries. (e.g., `0.1`)\n- `USE_RAMDISK`: Use RAM disk. (e.g., `1`)\n- `RAMDISK_PATH`: Path to the RAM disk. (e.g., `\"\u002Fmnt\u002Framdisk\"`)\n- `RAMDISK_SIZE_IN_GB`: RAM disk size in GB. (e.g., `40`)\n\n## Contributing\n\nIf you'd like to contribute to the project, please submit a pull request! Seriously, I'd love to get some more community going so we can make this a standard library!\n\n## License\n\nThis project is licensed under the MIT License.\n\n## Some Llama Knife Images I found on Google\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_c94bc0d470f1.webp\" width=\"500\" style=\"margin-right: 10px;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_bc38cb57de5a.jpg\" width=\"500\">\n\u003C\u002Fp>\n\n---\n\n## Setup and Configuration\n\n### RAM Disk Configuration\n\nTo enable password-less sudo for RAM Disk setup and teardown, edit the `sudoers` file with `sudo visudo`. Add the following lines, replacing `username` with your actual username:\n\n```plaintext\nusername ALL=(ALL) NOPASSWD: \u002Fbin\u002Fmount -t tmpfs -o size=*G tmpfs \u002Fmnt\u002Framdisk\nusername ALL=(ALL) NOPASSWD: \u002Fbin\u002Fumount \u002Fmnt\u002Framdisk\n```\n\nThe application provides functionalities to set up, clear, and manage RAM Disk. RAM Disk is used to store models in memory for faster access. It calculates the available RAM and sets up the RAM Disk accordingly. The functions `setup_ramdisk`, `copy_models_to_ramdisk`, and `clear_ramdisk` manage these tasks.\n\n## API Endpoints\n\nThe following endpoints are available:\n\n- **GET `\u002Fget_list_of_available_model_names\u002F`**: Retrieve Available Model Names. Retrieves the list of available model names for generating embeddings.\n- **GET `\u002Fget_all_stored_strings\u002F`**: Retrieve All Strings. Retrieves a list of all stored strings from the database for which embeddings have been computed.\n- **GET `\u002Fget_all_stored_documents\u002F`**: Retrieve All Stored Documents. Retrieves a list of all stored documents from the database for which embeddings have been computed.\n- **GET `\u002Fshow_logs\u002F`**:  Shows logs for the last 5 minutes by default. Can also provide a parameter like this: `\u002Fshow_logs\u002F{minutes}` to get the last N minutes of log data.\n- **POST `\u002Fadd_new_model\u002F`**: Add New Model by URL. Submit a new model URL for download and use. The model must be in `.gguf` format and larger than 100 MB to ensure it's a valid model file (you can directly paste in the Huggingface URL)\n- **POST `\u002Fget_embedding_vector_for_string\u002F`**: Retrieve Embedding Vector for a Given Text String. Retrieves the embedding vector for a given input text string using the specified model.\n- **POST `\u002Fcompute_similarity_between_strings\u002F`**: Compute Similarity Between Two Strings. Leverages the `fast_vector_similarity` library to compute the similarity between two given input strings using specified model embeddings and a selected similarity measure.\n- **POST `\u002Fsearch_stored_embeddings_with_query_string_for_semantic_similarity\u002F`**: Get Most Similar Strings from Stored Embeddings in Database. Find the most similar strings in the database to the given input \"query\" text.\n- **POST `\u002Fadvanced_search_stored_embeddings_with_query_string_for_semantic_similarity\u002F`**: Perform a two-step advanced semantic search. First uses FAISS and cosine similarity to narrow down the most similar strings, then applies additional similarity measures for refined comparison.\n- **POST `\u002Fget_all_embedding_vectors_for_document\u002F`**: Get Embeddings for a Document. Extract text embeddings for a document. This endpoint supports plain text, .doc\u002F.docx (MS Word), PDF files, images (using Tesseract OCR), and many other file types supported by the textract library.\n- **POST `\u002Fcompute_transcript_with_whisper_from_audio\u002F`**: Transcribe and Embed Audio using Whisper and LLM. This endpoint accepts an audio file and optionally computes document embeddings. The transcription and embeddings are stored, and a ZIP file containing the embeddings can be downloaded.\n- **POST `\u002Fget_text_completions_from_input_prompt\u002F`**: Get back multiple completions from the specified LLM model, with the ability to specify a grammar file which will enforce a particular format of the response, such as JSON. \n- **POST `\u002Fclear_ramdisk\u002F`**: Clear Ramdisk Endpoint. Clears the RAM Disk if it is enabled.\n\nFor detailed request and response schemas, please refer to the Swagger UI available at the root URL or the section at the end of this `README`.\n\n## Exception Handling\n\nThe application has robust exception handling to deal with various types of errors, including database errors and general exceptions. Custom exception handlers are defined for `SQLAlchemyError` and general `Exception`.\n\n## Logging\n\nLogging is configured at the INFO level to provide detailed logs for debugging and monitoring. The logger provides information about the state of the application, errors, and activities.\n\nThe logs are stored in a file named `swiss_army_llama.log`, and a log rotation mechanism is implemented to handle log file backups. The rotating file handler is configured with a maximum file size of 10 MB, and it keeps up to 5 backup files.\n\nWhen a log file reaches its maximum size, it is moved to the `old_logs` directory, and a new log file is created. The log entries are also printed to the standard output stream.\n\nHere are some details of the logging configuration:\n\n- Log Level: INFO\n- Log Format: `%(asctime)s - %(levelname)s - %(message)s`\n- Max Log File Size: 10 MB\n- Backup Count: 5\n- Old Logs Directory: `old_logs`\n\nAdditionally, the log level for SQLAlchemy's engine is set to WARNING to suppress verbose database logs.\n\n## Database Structure\n\nThe application uses a SQLite database via SQLAlchemy ORM. Here are the data models used, which can be found in the `embeddings_data_models.py` file:\n\n### TextEmbedding Table\n\n- `id`: Primary Key\n- `text`: Text for which the embedding was computed\n- `text_hash`: Hash of the text, computed using SHA3-256\n- `embedding_pooling_method`: The method used to pool the embeddings\n- `embedding_hash`: Hash of the computed embedding\n- `llm_model_name`: Model used to compute the embedding\n- `corpus_identifier_string`: An optional string identifier for grouping embeddings into a specific corpus\n- `embedding_json`: The computed embedding in JSON format\n- `ip_address`: Client IP address\n- `request_time`: Timestamp of the request\n- `response_time`: Timestamp of the response\n- `total_time`: Total time taken to process the request\n- `document_file_hash`: Foreign Key referencing the DocumentEmbedding table\n- `document`: Relationship with DocumentEmbedding\n\n### DocumentEmbedding Table\n\n- `id`: Primary Key\n- `document_hash`: Foreign Key referencing the Documents table\n- `filename`: Name of the document file\n- `mimetype`: MIME type of the document file\n- `document_file_hash`: Hash of the file\n- `embedding_pooling_method`: The method used to pool the embeddings\n- `llm_model_name`: Model used to compute the embedding\n- `corpus_identifier_string`: An optional string identifier for grouping documents into a specific corpus\n- `file_data`: Binary data of the original file\n- `sentences`: The extracted sentences from the document\n- `document_embedding_results_json_compressed_binary`: The computed embedding results in JSON format compressed with Z-standard compression\n- `ip_address`: Client IP address\n- `request_time`: Timestamp of the request\n- `response_time`: Timestamp of the response\n- `total_time`: Total time taken to process the request\n- `embeddings`: Relationship with TextEmbedding\n- `document`: Relationship with Document\n\n### Document Table\n\n- `id`: Primary Key\n- `llm_model_name`: Model name associated with the document\n- `corpus_identifier_string`: An optional string identifier for grouping documents into a specific corpus\n- `document_hash`: Computed Hash of the document\n- `document_embeddings`: Relationship with DocumentEmbedding\n\n### AudioTranscript Table\n\n- `audio_file_hash`: Primary Key\n- `audio_file_name`: Name of the audio file\n- `audio_file_size_mb`: File size in MB\n- `segments_json`: Transcribed segments as JSON\n- `combined_transcript_text`: Combined transcript text\n- `combined_transcript_text_list_of_metadata_dicts`: List of metadata dictionaries for each segment of the combined transcript\n- `info_json`: Transcription info as JSON\n- `ip_address`: Client IP address\n- `request_time`: Timestamp of the request\n- `response_time`: Timestamp of the response\n- `total_time`: Total time taken to process the request\n- `corpus_identifier_string`: An optional string identifier for grouping transcripts into a specific corpus\n\n### Database Relationships\n\n1. **TextEmbedding - DocumentEmbedding**:\n   - `TextEmbedding` has a Foreign Key `document_file_hash` that references `DocumentEmbedding`'s `document_file_hash`.\n   - This means multiple text embeddings can belong to a single document embedding, establishing a one-to-many relationship.\n  \n2. **DocumentEmbedding - Document**:\n   - `DocumentEmbedding` has a Foreign Key `document_hash` that references `Document`'s `document_hash`.\n   - This establishes a one-to-many relationship between `Document` and `DocumentEmbedding`.\n\n3. **AudioTranscript**:  \n   - This table doesn't have a direct relationship with other tables based on the given code.\n\n4. **Request\u002FResponse Models**:  \n   - These are not directly related to the database tables but are used for handling API requests and responses.\n   - The following Pydantic models are used for request and response validation:\n     - EmbeddingRequest\n     - SimilarityRequest\n     - SemanticSearchRequest\n     - SemanticSearchResponse\n     - AdvancedSemanticSearchRequest\n     - AdvancedSemanticSearchResponse\n     - EmbeddingResponse\n     - SimilarityResponse\n     - AllStringsResponse\n     - AllDocumentsResponse\n     - TextCompletionRequest\n     - TextCompletionResponse\n     - ImageQuestionResponse\n     - AudioTranscriptResponse\n     - ShowLogsIncrementalModel\n     - AddGrammarRequest\n     - AddGrammarResponse\n\nFor detailed field descriptions and validations, please refer to the `embeddings_data_models.py` file.\n\n## Performance Optimizations\n\nThis section highlights the major performance enhancements integrated into the provided code to ensure swift responses and optimal resource management.\n\n### 1. **Asynchronous Programming**:\n\n- **Benefit**: Handles multiple tasks concurrently, enhancing efficiency for I\u002FO-bound operations like database transactions and network requests.\n- **Implementation**: Utilizes Python's `asyncio` library for asynchronous database operations.\n\n### 2. **Database Optimizations**:\n\n- **Write-Ahead Logging (WAL) Mode**: Enables concurrent reads and writes, optimizing for applications with frequent write demands.\n- **Retry Logic with Exponential Backoff**: Manages locked databases by retrying operations with progressive waiting times.\n- **Batch Writes**: Aggregates write operations for more efficient database interactions.\n- **DB Write Queue**: Uses an asynchronous queue to serialize write operations, ensuring consistent and non-conflicting database writes.\n\n### 3. **RAM Disk Utilization**:\n\n- **Benefit**: Speeds up I\u002FO-bound tasks by prioritizing operations in RAM over disk.\n- **Implementation**: Detects and prioritizes a RAM disk (`\u002Fmnt\u002Framdisk`) if available, otherwise defaults to the standard file system.\n\n### 4. **Model Caching**:\n\n- **Benefit**: Reduces overhead by keeping loaded models in memory for subsequent requests.\n- **Implementation**: Uses a global `model_cache` dictionary to store and retrieve models.\n\n### 5. **Parallel Inference**:\n\n- **Benefit**: Enhances processing speed for multiple data units, like document sentences.\n- **Implementation**: Employs `asyncio.gather` for concurrent inferences, regulated by a semaphore (`MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS`).\n\n### 6. **Embedding Caching**:\n\n- **Benefit**: Once embeddings are computed for a particular text, they are stored in the database, eliminating the need for re-computation during subsequent requests.\n- **Implementation**: When a request is made to compute an embedding, the system first checks the database. If the embedding for the given text is found, it is returned immediately, ensuring faster response times.\n\n---\n\n### Dockerized Version\n\nA bash script is included in this repo, `setup_dockerized_app_on_fresh_machine.sh`, that will automatically do everything for you, including installing docker with apt install. \n\nTo use it, first make the script executable and then run it like this:\n\n```bash\nchmod +x setup_dockerized_app_on_fresh_machine.sh\nsudo .\u002Fsetup_dockerized_app_on_fresh_machine.sh\n```\n\nIf you prefer a manual setup, then read the following instructions:\n\n#### Prerequisites\n\nEnsure that you have Docker installed on your system. If not, follow these steps to install Docker on Ubuntu:\n\n```bash\nsudo apt-get update\nsudo apt-get install docker.io\nsudo systemctl start docker\nsudo docker --version\nsudo usermod -aG docker $USER\n```\n\nYou may need to log out and log back in or restart your system to apply the new group permissions, or use sudo in the following steps to build and run the container.\n\n#### Setup and Running the Application\n\n1. **Clone the Repository:**\n\n   Clone the Swiss Army Llama repository to your local machine:\n\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\n   cd swiss_army_llama\n   ```\n\n2. **Build the Docker Image:**\n\n   Build the Docker image using the provided Dockerfile:\n\n   ```bash\n   sudo docker build -t llama-embeddings .\n   ```\n\n3. **Run the Docker Container:**\n\n   Run the Docker container, mapping the container's port 8089 to the host's port 8089:\n\n   ```bash\n   sudo docker run -p 8089:8089 llama-embeddings\n   ```\n\n4. **Accessing the Application:**\n\n   The FastAPI application will now be accessible at `http:\u002F\u002Flocalhost:8089` or at the static IP address of your VPS instance if you're running on one (You can get a 10-core, 30gb RAM, 1tb SSD with a static IP running Ubuntu 22.04 at Contabo for around $30\u002Fmonth, which is the cheapest I've found so far).\n\n   You can interact then with the API using tools like `curl` or by accessing the FastAPI documentation at `http:\u002F\u002Flocalhost:8089\u002Fdocs`.\n\n5. **Viewing Logs:**\n\n   Logs from the application can be viewed directly in the terminal where you ran the `docker run` command.\n\n#### Stopping and Managing the Container\n\n- To stop the running container, press `Ctrl+C` in the terminal or find the container ID using `docker ps` and run `sudo docker stop \u003Ccontainer_id>`.\n- To remove the built image, use `sudo docker rmi llama-embeddings`.\n\n---\n\n## Startup Procedures\n\nDuring startup, the application performs the following tasks:\n\n1. **Database Initialization**:\n    - The application initializes the SQLite database, setting up tables and executing important PRAGMAs to optimize performance. \n    - Some of the important SQLite PRAGMAs include setting the database to use Write-Ahead Logging (WAL) mode, setting synchronous mode to NORMAL, increasing cache size to 1GB, setting the busy timeout to 2 seconds, and setting the WAL autocheckpoint to 100.\n2. **Initialize Database Writer**:\n    - A dedicated database writer (`DatabaseWriter`) is initialized with a dedicated asynchronous queue to handle the write operations.\n    - A set of hashes is created which represents the operations that are currently being processed or have already been processed. This avoids any duplicate operations in the queue.\n3. **RAM Disk Setup**:\n    - If the `USE_RAMDISK` variable is enabled and the user has the required permissions, the application sets up a RAM Disk.\n    - The application checks if there's already a RAM Disk set up at the specified path, if not, it calculates the optimal size for the RAM Disk and sets it up.\n    - If the RAM Disk is enabled but the user lacks the required permissions, the RAM Disk feature is disabled and the application proceeds without it.\n4. **Model Downloads**:\n    - The application downloads the required models.\n5. **Model Loading**:\n    - Each downloaded model is loaded into memory. If any model file is not found, an error log is recorded.\n6. **Build FAISS Indexes**:\n    - The application creates FAISS indexes for efficient similarity search using the embeddings from the database.\n    - Associated texts are stored by model name for further use.\n\nNote: \n- If the RAM Disk feature is enabled but the user lacks the required permissions, the application will disable the RAM Disk feature and proceed without it.\n- For any database operations, if the database is locked, the application will attempt to retry the operation a few times with an exponential backoff and a jitter.\n\n---\n\n## Endpoint Functionality and Workflow Overview\n\nHere's a detailed breakdown of the main endpoints provided by the FastAPI server, explaining their functionality, input parameters, and how they interact with underlying models and systems:\n\n### 1. `\u002Fget_embedding_vector_for_string\u002F` (POST)\n\n#### Purpose\nRetrieve the embedding vector for a given input text string using the specified model.\n\n#### Parameters\n- `text`: The input text for which the embedding vector is to be retrieved.\n- `model_name`: The model used to calculate the embedding (optional, will use the default model if not provided).\n- `token`: Security token (optional).\n- `client_ip`: Client IP address (optional).\n\n#### Workflow\n1. **Retrieve Embedding**: The function retrieves or computes the embedding vector for the provided text using the specified or default model.\n2. **Return Result**: The embedding vector for the input text string is returned in the response.\n\n### 2. `\u002Fcompute_similarity_between_strings\u002F` (POST)\n\n#### Purpose\nCompute the similarity between two given input strings using specified model embeddings and a selected similarity measure.\n\n#### Parameters\n- `text1`: The first input text.\n- `text2`: The second input text.\n- `llm_model_name`: The model used to calculate embeddings (optional).\n- `similarity_measure`: The similarity measure to be used. Supported measures include `all`, `spearman_rho`, `kendall_tau`, `approximate_distance_correlation`, `jensen_shannon_dependency_measure`, and `hoeffding_d` (optional, default is `all`).\n\n#### Workflow\n1. **Retrieve Embeddings**: The embeddings for `text1` and `text2` are retrieved or computed using the specified or default model.\n2. **Compute Similarity**: The similarity between the two embeddings is calculated using the specified similarity measure.\n3. **Return Result**: The similarity score, along with the embeddings and input texts, is returned in the response.\n\n### 3. `\u002Fsearch_stored_embeddings_with_query_string_for_semantic_similarity\u002F` (POST)\n\n#### Purpose\nFind the most similar strings in the database to the given input \"query\" text. This endpoint uses a pre-computed FAISS index to quickly search for the closest matching strings.\n\n#### Parameters\n- `query_text`: The input text for which to find the most similar string.\n- `model_name`: The model used to calculate embeddings.\n- `number_of_most_similar_strings_to_return`: (Optional) The number of most similar strings to return, defaults to 10.\n- `token`: Security token (optional).\n\n#### Workflow\n1. **Search FAISS Index**: The FAISS index, built on stored embeddings, is searched to find the most similar embeddings to the `query_text`.\n2. **Return Result**: The most similar strings found in the database, along with the similarity scores, are returned in the response.\n\n### 4. `\u002Fadvanced_search_stored_embeddings_with_query_string_for_semantic_similarity\u002F` (POST)\n\n#### Purpose\nPerforms a two-step advanced semantic search. Utilizes FAISS and cosine similarity for initial filtering, followed by additional similarity measures for refined comparisons.\n\n#### Parameters\n- `query_text`: The input text for which to find the most similar strings.\n- `llm_model_name`: The model used to calculate embeddings.\n- `similarity_filter_percentage`: (Optional) Percentage of embeddings to filter based on cosine similarity; defaults to 0.02 (i.e., top 2%).\n- `number_of_most_similar_strings_to_return`: (Optional) Number of most similar strings to return after second similarity measure; defaults to 10.\n\n#### Workflow\n1. **Initial Filtering**: Use FAISS and cosine similarity to find a set of similar strings.\n2. **Refined Comparison**: Apply additional similarity measures to the filtered set.\n3. **Return Result**: Return the most similar strings along with their multiple similarity scores.\n\n#### Example Request\n```json\n{\n  \"query_text\": \"Find me the most similar string!\",\n  \"llm_model_name\": \"openchat_v3.2_super\",\n  \"similarity_filter_percentage\": 0.02,\n  \"number_of_most_similar_strings_to_return\": 5\n}\n```\n\n### 5. `\u002Fget_all_embedding_vectors_for_document\u002F` (POST)\n\n#### Purpose\nExtract text embeddings for a document. The library now supports a wide range of file types including plain text, .doc\u002F.docx, PDF files, images (using Tesseract OCR), and many other types supported by the `textract` library.\n\n#### Parameters\n- `file`: The uploaded document file (either plain text, .doc\u002F.docx, PDF, etc.).\n- `llm_model_name`: (Optional) The model used to calculate embeddings.\n- `json_format`: (Optional) The format of the JSON response.\n- `send_back_json_or_zip_file`: Whether to return a JSON file or a ZIP file containing the embeddings file (optional, defaults to `zip`).\n- `token`: Security token (optional).\n\n### 6. `\u002Fcompute_transcript_with_whisper_from_audio\u002F` (POST)\n\n#### Purpose\nTranscribe an audio file and optionally compute document embeddings for the resulting transcript. This endpoint uses the Whisper model for transcription and a language model for generating embeddings. The transcription and embeddings can then be stored, and a ZIP file containing the embeddings can be made available for download.\n\n#### Parameters\n- `file`: The audio file that you need to upload for transcription.\n- `compute_embeddings_for_resulting_transcript_document`: Boolean to indicate whether document embeddings should be computed (optional, defaults to False).\n- `llm_model_name`: The language model used for computing embeddings (optional, defaults to the default model name).\n- `req`: HTTP request object for additional request metadata (optional).\n- `token`: Security token (optional).\n- `client_ip`: Client IP address (optional).\n\n#### Request File and Parameters\nYou will need to use a multipart\u002Fform-data request to upload the audio file. The additional parameters like `compute_embeddings_for_resulting_transcript_document` and `llm_model_name` can be sent along as form fields.\n\n#### Example Request\n```bash\ncurl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fcompute_transcript_with_whisper_from_audio\u002F' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Authorization: Bearer YOUR_ACCESS_TOKEN' \\\n  -F 'file=@your_audio_file.wav' \\\n  -F 'compute_embeddings_for_resulting_transcript_document=true' \\\n  -F 'llm_model_name=custom-llm-model'\n```\n\n### 7. `\u002Fget_text_completions_from_input_prompt\u002F` (POST)\n\n#### Purpose\nGenerate text completions for a given input prompt using the specified model.\n\n#### Parameters\n- `request`: A JSON object containing various options like `input_prompt`, `llm_model_name`, etc.\n- `token`: Security token (optional).\n- `req`: HTTP request object (optional).\n- `client_ip`: Client IP address (optional).\n\n#### Request JSON Format\nThe JSON object should have the following keys:\n- `input_prompt`\n- `llm_model_name`\n- `temperature`\n- `grammar_file_string`\n- `number_of_completions_to_generate`\n- `number_of_tokens_to_generate`\n\n#### Example Request\n```json\n{\n  \"input_prompt\": \"The Kings of France in the 17th Century:\",\n  \"llm_model_name\": \"phind-codellama-34b-python-v1\",\n  \"temperature\": 0.95,\n  \"grammar_file_string\": \"json\",\n  \"number_of_tokens_to_generate\": 500,\n  \"number_of_completions_to_generate\": 3\n}\n```\n\n### 8. `\u002Fget_list_of_available_model_names\u002F` (GET)\n\n#### Purpose\nRetrieve the list of available model names for generating embeddings.\n\n#### Parameters\n- `token`: Security token (optional).\n\n### 9. `\u002Fget_all_stored_strings\u002F` (GET)\n\n#### Purpose\nRetrieve a list of all stored strings from the database for which embeddings have been computed.\n\n#### Parameters\n- `token`: Security token (optional).\n\n### 10. `\u002Fget_all_stored_documents\u002F` (GET)\n\n#### Purpose\nRetrieve a list of all stored documents from the database for which embeddings have been computed.\n\n#### Parameters\n- `token`: Security token (optional).\n\n### 11. `\u002Fclear_ramdisk\u002F` (POST)\n\n#### Purpose\nClear the RAM Disk to free up memory.\n\n#### Parameters\n- `token`: Security token (optional).\n\n### 12. `\u002Fdownload\u002F{file_name}` (GET)\n\n#### Purpose\nDownload a ZIP file containing document embeddings that were generated through the `\u002Fcompute_transcript_with_whisper_from_audio\u002F` endpoint. The URL for this download will be supplied in the JSON response of the audio file transcription endpoint.\n\n#### Parameters\n- `file_name`: The name of the ZIP file that you want to download.\n\n### 13. `\u002Fadd_new_model\u002F` (POST)\n\n#### Purpose\nSubmit a new model URL for download and use. The model must be in `.gguf` format and larger than 100 MB to ensure it's a valid model file.\n\n#### Parameters\n- `model_url`: The URL of the model weight file, which must end with `.gguf`.\n- `token`: Security token (optional).\n\n\n### Token-Level Embedding Vector Pooling\n\nPooling methods are designed to aggregate token-level embeddings, which are typically variable in length due to differing numbers of tokens in sentences or documents. By converting these token-level embeddings into a single, fixed-length vector, we ensure that each input text is represented consistently, regardless of its length. This fixed-length vector can then be used in various machine learning models that require inputs of a consistent size. \n\nThe primary goal of these pooling methods is to retain as much useful information as possible from the original token-level embeddings while ensuring that the transformation is deterministic and does not distort the data. Each method achieves this by applying different statistical or mathematical techniques to summarize the token embeddings.\n\n#### Explanation of Pooling Methods\n\n1. **SVD (Singular Value Decomposition)**:\n   - **How it works**: Concatenates the first two singular vectors obtained from the SVD of the token embeddings matrix.\n   - **Rationale**: SVD is a dimensionality reduction technique that captures the most important features of the data. Using the first two singular vectors provides a compact representation that retains significant information.\n\n2. **SVD_First_Four**:\n   - **How it works**: Uses the first four singular vectors obtained from the SVD of the token embeddings matrix.\n   - **Rationale**: By using more singular vectors, this method captures more of the variance in the data, providing a richer representation while still reducing dimensionality.\n\n3. **ICA (Independent Component Analysis)**:\n    - **How it works**: Applies ICA to the embeddings matrix to find statistically independent components, then flattens the result.\n    - **Rationale**: ICA is useful for identifying independent sources in the data, providing a representation that highlights these independent features.\n\n4. **Factor_Analysis**:\n    - **How it works**: Applies factor analysis to the embeddings matrix to identify underlying factors, then flattens the result.\n    - **Rationale**: Factor analysis models the data in terms of latent factors, providing a summary that captures these underlying influences.\n\n5. **Gaussian_Random_Projection**:\n    - **How it works**: Applies Gaussian random projection to reduce the dimensionality of the embeddings, then flattens the result.\n    - **Rationale**: This method provides a fast and efficient way to reduce dimensionality while preserving the pairwise distances between points, useful for large datasets.\n  \n---\n\nThanks for your interest in my open-source project! I hope you find it useful. You might also find my commercial web apps useful, and I would really appreciate it if you checked them out:\n\n**[YoutubeTranscriptOptimizer.com](https:\u002F\u002Fyoutubetranscriptoptimizer.com)** makes it really quick and easy to paste in a YouTube video URL and have it automatically generate not just a really accurate direct transcription, but also a super polished and beautifully formatted written document that can be used independently of the video.\n\nThe document basically sticks to the same material as discussed in the video, but it sounds much more like a real piece of writing and not just a transcript. It also lets you optionally generate quizzes based on the contents of the document, which can be either multiple choice or short-answer quizzes, and the multiple choice quizzes get turned into interactive HTML files that can be hosted and easily shared, where you can actually take the quiz and it will grade your answers and score the quiz for you.\n\n**[FixMyDocuments.com](https:\u002F\u002Ffixmydocuments.com\u002F)** lets you submit any kind of document— PDFs (including scanned PDFs that require OCR), MS Word and Powerpoint files, images, audio files (mp3, m4a, etc.) —and turn them into highly optimized versions in nice markdown formatting, from which HTML and PDF versions are automatically generated. Once converted, you can also edit them directly in the site using the built-in markdown editor, where it saves a running revision history and regenerates the PDF\u002FHTML versions.\n\nIn addition to just getting the optimized version of the document, you can also generate many other kinds of \"derived documents\" from the original: interactive multiple-choice quizzes that you can actually take and get graded on; slick looking presentation slides as PDF or HTML (using LaTeX and Reveal.js), an in-depth summary, a concept mind map (using Mermaid diagrams) and outline, custom lesson plans where you can select your target audience, a readability analysis and grade-level versions of your original document (good for simplifying concepts for students), Anki Flashcards that you can import directly into the Anki app or use on the site in a nice interface, and more.\n","# 🇨🇭🎖️🦙 瑞士军刀骆驼\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_e98d25991734.webp\" width=\"500\">\n\u003C\u002Fdiv>\n\n## 简介\n\n瑞士军刀骆驼旨在通过使用 FastAPI 暴露便捷的 REST API 端点，来简化和优化本地大语言模型（LLM）的工作流程。这些端点支持多种任务，包括利用 llama_cpp 获取不同 LLM 的文本嵌入和补全；同时还能自动处理大多数常见文档类型的嵌入提取，例如 PDF（甚至需要 OCR 处理的文件）、Word 文档等。此外，它还允许用户上传音频文件，自动使用 Whisper 模型进行转录，清理生成的文本，并计算其嵌入向量。为避免重复计算，这些嵌入会被缓存到 SQLite 数据库中，若之前已计算过则直接从缓存中读取。为了加快加载多个 LLM 的速度，还可以选择使用 RAM 盘，而创建和管理 RAM 盘的过程将由系统自动完成。凭借快速简便的部署流程，您将立即获得一个功能强大的“瑞士军刀”式 LLM 工具集，所有功能均可通过便捷的 Swagger UI 访问，并且只需极少的配置即可轻松集成到您自己的应用中。\n\n此外，还提供了一些实用的额外端点，例如计算提交的文本字符串之间的语义相似度。该服务基于高性能的 Rust 库 `fast_vector_similarity`，支持多种相似度度量方法，包括 Spearman 秩相关系数、Kendall 秩相关系数、近似距离相关性、Jensen-Shannon 相关性度量以及 Hoeffding D 统计量。同时，借助 FAISS 向量检索技术，还支持对所有缓存嵌入进行语义搜索。您可以选择使用 FAISS 内置的余弦相似度，也可以在第一次筛选出最相关的向量后，再进行第二轮更复杂的相似度计算（具体功能请参见高级语义搜索端点）。\n\n另外，我们还支持多种嵌入池化方法，用于将标记级别的嵌入向量组合成固定长度的单个嵌入向量，适用于任意长度的输入文本，包括以下几种：\n   - `mean`：标记嵌入的平均池化。\n   - `mins_maxes`：将每个维度的最小值和最大值拼接起来。\n   - `svd`：将标记嵌入矩阵的奇异值分解得到的前两个奇异向量拼接起来。\n   - `svd_first_four`：将标记嵌入矩阵的奇异值分解得到的前四个奇异向量拼接起来。\n   - `ica`：对标记嵌入进行独立成分分析后，展开得到的独立成分。\n   - `factor_analysis`：对标记嵌入进行因子分析后，展开得到的因子。\n   - `gaussian_random_projection`：对标记嵌入进行高斯随机投影后，展开得到的嵌入。\n\n如上所述，现在不仅可以提交纯文本和完全数字化的 PDF 文件，还可以上传 MS Word 文档、图片以及其他由 textract 库支持的文件类型。对于扫描件中的文字，该库会自动使用 Tesseract 进行 OCR 处理。文档中每句话的嵌入向量可以通过 Pandas 的 `to_json()` 函数以记录、表格等多种格式组织起来。结果可以以包含 JSON 文件的 ZIP 压缩包形式返回，也可以直接以 JSON 格式响应。此外，现在还支持上传 MP3 或 WAV 格式的音频文件。该库使用 OpenAI 的 Whisper 模型，并结合 Faster Whisper Python 库进行优化，将音频转录为文本。用户还可以选择将转录后的文本当作普通文档处理，为其每句话计算并存储嵌入向量。最终结果将以可下载 ZIP 文件的 URL 形式返回，其中包含带有嵌入向量数据的 JSON 文件。\n\n最后，我们新增了一个端点，用于根据给定的提示生成多个文本补全结果，并支持指定语法文件以强制生成特定格式的响应，例如 JSON。此外，还有一个非常实用的新功能：实时应用日志查看器，可通过浏览器访问，支持语法高亮显示，并提供下载日志或复制到剪贴板的选项。这使得用户无需直接 SSH 登录服务器即可实时监控日志。\n\n## 截图\n![Swiss Army Llama Swagger UI](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_42018885e515.png)\n![Swiss Army Llama 运行中](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_a5676429e904.png)\n\n*简要说明：* 如果您只想在一台全新的 Ubuntu 22+ 系统上快速试用（注意：此操作会通过 apt 安装 Docker），可以执行以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\ncd swiss_army_llama\nchmod +x setup_dockerized_app_on_fresh_machine.sh\nsudo .\u002Fsetup_dockerized_app_on_fresh_machine.sh\n```\n\n如果您希望在 Python 虚拟环境中原生运行（不使用 Docker，推荐方式），可以使用以下命令：\n\n```bash\nsudo apt-get update\nsudo apt-get install build-essential libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig redis-server libpoppler-cpp-dev pkg-config -y\nsudo systemctl enable redis-server\nsudo systemctl start redis\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\ncd swiss_army_llama\npython3 -m venv venv\nsource venv\u002Fbin\u002Factivate\npython3 -m pip install --upgrade pip\npython3 -m pip install wheel\npython3 -m pip install --upgrade setuptools wheel\npip install -r requirements.txt\npython3 swiss_army_llama.py\n```\n\n或者，您也可以直接运行附带的脚本，它会在您的机器上尚未安装 PyEnv 的情况下自动安装 PyEnv，然后安装 Python 3.12 并为您创建虚拟环境。您可以在一台全新的 Ubuntu 机器上仅用一条命令完成所有操作，如下所示：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama && cd swiss_army_llama && chmod +x install_swiss_army_llama.sh && .\u002Finstall_swiss_army_llama.sh && pyenv local 3.12 && source venv\u002Fbin\u002Factivate && python swiss_army_llama.py\n```\n\n随后，在 VPS 上使用 `\u003Cyour_static_ip_address>:8089` 打开浏览器，即可访问 FastAPI 的 Swagger 页面 `http:\u002F\u002Flocalhost:8089`。\n\n如果是在本地机器上运行，则访问 `localhost:8089` 即可——不过，切记不要在自己的机器上以 sudo 权限运行不受信任的代码！建议租用一台价格低廉的 VPS，每月仅需 30 美元即可进行实验。\n\n您可以在 [这里](https:\u002F\u002Fasciinema.org\u002Fa\u002F601603) 观看自动化部署过程的实际演示。\n\n---\n\n## 功能特性\n\n1. **文本嵌入计算**：通过 llama_cpp 利用预训练的 LLama3 及其他大语言模型，为任意提供的文本生成嵌入向量。\n2. **嵌入缓存**：高效地在 SQLite 中存储和检索已计算的嵌入，从而减少重复计算。\n3. **高级相似度度量与检索**：使用作者自研的 Rust 编写的 `fast_vector_similarity` 库，提供高度优化的高级相似度度量方法，如 Spearman 秩相关系数、Kendall 等级相关系数、近似距离相关性、Jensen-Shannon 相依性度量及 Hoeffding D 统计量。同时支持基于 FAISS 向量检索技术的缓存嵌入语义搜索。\n4. **两步高级语义搜索**：API 首先利用 FAISS 和余弦相似度进行快速筛选，随后再应用 Spearman 秩相关系数、Kendall 等级相关系数、近似距离相关性、Jensen-Shannon 相依性度量及 Hoeffding D 等多种相似度度量方法，以实现更为细致的比较。\n5. **文档文件处理**：该库现可接受更广泛的文件类型，包括纯文本、PDF、MS Word 文档及图像等，并能自动进行 OCR 处理。每句话的嵌入向量将以记录、表格等多种格式返回，使用 Pandas 的 to_json() 函数组织数据。\n6. **高级文本预处理**：库内采用更先进的句子分割器，将文本切分为有意义的句子。它能够正确处理缩写、域名或数字中出现的句号，并确保即使在引用的情况下也能完整提取句子。此外，还能有效解决扫描文档中常见的分页问题，例如不自然的换行符和连字符断行。\n7. **音频转录与嵌入**：上传 MP3 或 WAV 格式的音频文件，库会使用 OpenAI 的 Whisper 模型进行转录。用户还可选择为转录文本计算句子嵌入。\n8. **RAM 盘使用**：可选地使用 RAM 盘来存储模型，以加快访问和执行速度。系统会自动创建和管理 RAM 盘。\n9. **健壮的异常处理**：具备全面的异常处理机制，确保系统的稳定性。\n10. **交互式 API 文档**：集成 Swagger UI，提供交互友好且用户友好的体验，能够处理大规模结果集而不崩溃。\n11. **可扩展性和并发性**：基于 FastAPI 框架构建，能够处理并发请求，并支持并行推理，且并发级别可配置。\n12. **灵活的配置选项**：通过环境变量和输入参数提供可配置设置，包括 JSON 或 ZIP 文件等响应格式。\n13. **全面的日志记录**：捕获关键信息并生成详细日志，既不会占用过多存储空间，也不会影响日志的可读性。\n14. **多模型与多度量支持**：兼容多种嵌入模型和相似度度量方法，可根据用户需求灵活定制。\n15. **基于指定语法生成多个补全结果**：根据给定的输入提示，获取结构化的 LLM 补全结果。\n16. **浏览器实时日志查看器**：任何有权访问 API 服务器的用户均可方便地查看应用程序日志，从而深入了解其请求的执行过程。\n17. **使用 Redis 进行请求锁定**：借助 Redis 实现多个 Uvicorn 工作进程的并行运行，避免彼此冲突。\n\n## 演示屏幕录像\n[这里](https:\u002F\u002Fasciinema.org\u002Fa\u002F39dZ8vv9nkcNygasUl35wnBPq)是我在 Swagger 页面上与之交互并发出请求时的实时控制台输出。\n\n---\n\n## 系统要求\n\n运行本应用所需的系统要求（以支持 textract 处理的所有文件类型）：\n\n```bash\nsudo apt-get update\nsudo apt-get install libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig -y\n```\n\nPython 依赖项：\n\n```bash\naioredis\naioredlock\naiosqlite\napscheduler\nfaiss-cpu\nfast_vector_similarity\nfastapi\nfaster-whisper\nfilelock\nhttpx\nllama-cpp-python\nmagika\nmutagen\nnvgpu\npandas\npillow\npsutil\npydantic\nPyPDF2\npytest\npython-decouple\npython-multipart\npytz\nredis\nruff\nscikit-learn\nscipy\nsqlalchemy\ntextract-py3\nuvicorn\nuvloop\nzstandard\n```\n\n## 运行应用\n\n您可以通过以下命令运行该应用：\n\n```bash\npython swiss_army_llama.py\n```\n\n服务器将在 `0.0.0.0` 上，由 `SWISS_ARMY_LLAMA_SERVER_LISTEN_PORT` 变量定义的端口启动。\n\n访问 Swagger UI：\n\n```\nhttp:\u002F\u002Flocalhost:\u003CSWISS_ARMY_LLAMA_SERVER_LISTEN_PORT>\n```\n\n## 配置\n\n您可以通过编辑附带的 `.env` 文件轻松配置服务。以下是可用的配置选项列表：\n\n- `USE_SECURITY_TOKEN`：是否使用硬编码的安全令牌。（例如 `1`）\n- `USE_PARALLEL_INFERENCE_QUEUE`：是否启用并行处理。（例如 `1`）\n- `MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS`：并行推理任务的最大数量。（例如 `30`）\n- `DEFAULT_MODEL_NAME`：默认使用的模型名称。（例如 `Llama-3-8B-Instruct-64k`）\n- `LLM_CONTEXT_SIZE_IN_TOKENS`：LLM 的上下文大小（以 token 数量计）。（例如 `512`）\n- `SWISS_ARMY_LLAMA_SERVER_LISTEN_PORT`：服务监听的端口号。（例如 `8089`）\n- `UVICORN_NUMBER_OF_WORKERS`：Uvicorn 的工作进程数量。（例如 `2`）\n- `MINIMUM_STRING_LENGTH_FOR_DOCUMENT_EMBEDDING`：文档嵌入的最小字符串长度。（例如 `15`）\n- `MAX_RETRIES`：数据库锁定时的最大重试次数。（例如 `10`）\n- `DB_WRITE_BATCH_SIZE`：数据库写入批次大小。（例如 `25`）\n- `RETRY_DELAY_BASE_SECONDS`：每次重试的基础延迟时间（以秒计）。（例如 `1`）\n- `JITTER_FACTOR`：重试时的抖动因子。（例如 `0.1`）\n- `USE_RAMDISK`：是否使用 RAM 盘。（例如 `1`）\n- `RAMDISK_PATH`：RAM 盘的路径。（例如 `\"\u002Fmnt\u002Framdisk\"`）\n- `RAMDISK_SIZE_IN_GB`：RAM 盤的大小（以 GB 计）。（例如 `40`）\n\n## 贡献\n如果您希望为该项目贡献力量，请提交 Pull Request！我非常期待社区的参与，共同将其打造为一个标准库！\n\n## 许可证\n本项目采用 MIT 许可证授权。\n\n## 我在 Google 上找到的一些羊驼刀图片\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_c94bc0d470f1.webp\" width=\"500\" style=\"margin-right: 10px;\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_readme_bc38cb57de5a.jpg\" width=\"500\">\n\u003C\u002Fp>\n\n---\n\n## 设置与配置\n\n### 内存磁盘配置\n\n要为内存磁盘的设置和卸载启用免密码 sudo，请使用 `sudo visudo` 编辑 `sudoers` 文件。添加以下行，将 `username` 替换为您的实际用户名：\n\n```plaintext\nusername ALL=(ALL) NOPASSWD: \u002Fbin\u002Fmount -t tmpfs -o size=*G tmpfs \u002Fmnt\u002Framdisk\nusername ALL=(ALL) NOPASSWD: \u002Fbin\u002Fumount \u002Fmnt\u002Framdisk\n```\n\n该应用程序提供了设置、清空和管理内存磁盘的功能。内存磁盘用于将模型存储在内存中以实现更快的访问速度。它会计算可用的 RAM 并相应地设置内存磁盘。`setup_ramdisk`、`copy_models_to_ramdisk` 和 `clear_ramdisk` 函数负责管理这些任务。\n\n## API 端点\n\n以下是可用的端点：\n\n- **GET `\u002Fget_list_of_available_model_names\u002F`**：获取可用模型名称。检索可用于生成嵌入的可用模型名称列表。\n- **GET `\u002Fget_all_stored_strings\u002F`**：获取所有字符串。从数据库中检索已计算嵌入的所有存储字符串列表。\n- **GET `\u002Fget_all_stored_documents\u002F`**：获取所有存储文档。从数据库中检索已计算嵌入的所有存储文档列表。\n- **GET `\u002Fshow_logs\u002F`**：默认显示最近 5 分钟的日志。也可以通过参数指定，例如 `\u002Fshow_logs\u002F{minutes}`，以获取最近 N 分钟的日志数据。\n- **POST `\u002Fadd_new_model\u002F`**：通过 URL 添加新模型。提交新模型的下载 URL 并使用。模型必须为 `.gguf` 格式且大于 100 MB，以确保其为有效模型文件（可以直接粘贴 Hugging Face 的 URL）。\n- **POST `\u002Fget_embedding_vector_for_string\u002F`**：获取给定文本字符串的嵌入向量。使用指定的模型为给定输入文本字符串检索嵌入向量。\n- **POST `\u002Fcompute_similarity_between_strings\u002F`**：计算两个字符串之间的相似度。利用 `fast_vector_similarity` 库，使用指定模型的嵌入和选定的相似度度量来计算两个输入字符串之间的相似度。\n- **POST `\u002Fsearch_stored_embeddings_with_query_string_for_semantic_similarity\u002F`**：从数据库中存储的嵌入中获取最相似的字符串。查找数据库中与给定“查询”文本最相似的字符串。\n- **POST `\u002Fadvanced_search_stored_embeddings_with_query_string_for_semantic_similarity\u002F`**：执行两步高级语义搜索。首先使用 FAISS 和余弦相似度缩小最相似的字符串范围，然后应用其他相似度度量进行更精细的比较。\n- **POST `\u002Fget_all_embedding_vectors_for_document\u002F`**：获取文档的嵌入。提取文档的文本嵌入。该端点支持纯文本、.doc\u002F.docx（MS Word）、PDF 文件、图像（使用 Tesseract OCR）以及 textract 库支持的许多其他文件类型。\n- **POST `\u002Fcompute_transcript_with_whisper_from_audio\u002F`**：使用 Whisper 和 LLM 转录并嵌入音频。该端点接受一个音频文件，并可选择同时计算文档嵌入。转录和嵌入会被存储，用户可以下载包含嵌入的 ZIP 文件。\n- **POST `\u002Fget_text_completions_from_input_prompt\u002F`**：从指定的 LLM 模型获取多个补全结果，并可指定一个语法文件，以强制响应采用特定格式，例如 JSON。\n- **POST `\u002Fclear_ramdisk\u002F`**：清空内存磁盘端点。如果启用了内存磁盘，则将其清空。\n\n有关详细的请求和响应模式，请参阅根 URL 或本 `README` 文档末尾部分提供的 Swagger UI。\n\n## 异常处理\n\n该应用程序具有强大的异常处理机制，能够应对各种类型的错误，包括数据库错误和一般异常。针对 `SQLAlchemyError` 和一般 `Exception` 定义了自定义异常处理器。\n\n## 日志记录\n\n日志记录配置为 INFO 级别，以便提供详细的调试和监控日志。日志记录器会记录应用程序的状态、错误和活动信息。\n\n日志文件名为 `swiss_army_llama.log`，并实现了日志文件轮转机制以处理日志备份。轮转文件处理器的最大文件大小设置为 10 MB，最多保留 5 个备份文件。\n\n当日志文件达到最大大小时，它会被移动到 `old_logs` 目录，并创建一个新的日志文件。日志条目也会输出到标准输出流。\n\n以下是日志配置的一些详细信息：\n\n- 日志级别：INFO\n- 日志格式：`%(asctime)s - %(levelname)s - %(message)s`\n- 最大日志文件大小：10 MB\n- 备份数量：5\n- 旧日志目录：`old_logs`\n\n此外，SQLAlchemy 引擎的日志级别被设置为 WARNING，以抑制冗长的数据库日志。\n\n## 数据库结构\n\n该应用程序使用 SQLAlchemy ORM 通过 SQLite 数据库进行操作。以下是使用的数据模型，可在 `embeddings_data_models.py` 文件中找到：\n\n### TextEmbedding 表\n\n- `id`：主键\n- `text`：已计算嵌入的文本\n- `text_hash`：使用 SHA3-256 计算的文本哈希值\n- `embedding_pooling_method`：用于池化嵌入的方法\n- `embedding_hash`：计算出的嵌入哈希值\n- `llm_model_name`：用于计算嵌入的模型名称\n- `corpus_identifier_string`：用于将嵌入分组到特定语料库的可选字符串标识符\n- `embedding_json`：计算出的嵌入，以 JSON 格式存储\n- `ip_address`：客户端 IP 地址\n- `request_time`：请求时间戳\n- `response_time`：响应时间戳\n- `total_time`：处理请求所花费的总时间\n- `document_file_hash`：引用 DocumentEmbedding 表的外键\n- `document`：与 DocumentEmbedding 的关系\n\n### DocumentEmbedding 表\n\n- `id`：主键\n- `document_hash`：引用 Documents 表的外键\n- `filename`：文档文件名\n- `mimetype`：文档文件的 MIME 类型\n- `document_file_hash`：文件的哈希值\n- `embedding_pooling_method`：用于池化嵌入的方法\n- `llm_model_name`：用于计算嵌入的模型名称\n- `corpus_identifier_string`：用于将文档分组到特定语料库的可选字符串标识符\n- `file_data`：原始文件的二进制数据\n- `sentences`：从文档中提取的句子\n- `document_embedding_results_json_compressed_binary`：使用 Z-standard 压缩算法压缩后的 JSON 格式嵌入结果\n- `ip_address`：客户端 IP 地址\n- `request_time`：请求时间戳\n- `response_time`：响应时间戳\n- `total_time`：处理请求所花费的总时间\n- `embeddings`：与 TextEmbedding 的关系\n- `document`：与 Document 的关系\n\n### 文档表\n\n- `id`: 主键\n- `llm_model_name`: 与文档关联的模型名称\n- `corpus_identifier_string`: 可选的字符串标识符，用于将文档分组到特定语料库中\n- `document_hash`: 文档的计算哈希值\n- `document_embeddings`: 与 DocumentEmbedding 的关系\n\n### 音频转录表\n\n- `audio_file_hash`: 主键\n- `audio_file_name`: 音频文件名\n- `audio_file_size_mb`: 文件大小（MB）\n- `segments_json`: 转录片段的 JSON 格式\n- `combined_transcript_text`: 合并后的转录文本\n- `combined_transcript_text_list_of_metadata_dicts`: 合并转录中每个片段的元数据字典列表\n- `info_json`: 转录信息的 JSON 格式\n- `ip_address`: 客户端 IP 地址\n- `request_time`: 请求的时间戳\n- `response_time`: 响应的时间戳\n- `total_time`: 处理请求所用的总时间\n- `corpus_identifier_string`: 可选的字符串标识符，用于将转录分组到特定语料库中\n\n### 数据库关系\n\n1. **TextEmbedding - DocumentEmbedding**:\n   - `TextEmbedding` 表有一个外键 `document_file_hash`，引用 `DocumentEmbedding` 表的 `document_file_hash`。\n   - 这意味着多个文本嵌入可以属于同一个文档嵌入，从而建立了一对多的关系。\n\n2. **DocumentEmbedding - Document**:\n   - `DocumentEmbedding` 表有一个外键 `document_hash`，引用 `Document` 表的 `document_hash`。\n   - 这建立了 `Document` 和 `DocumentEmbedding` 之间的一对多关系。\n\n3. **AudioTranscript**:\n   - 根据提供的代码，该表与其他表之间没有直接关系。\n\n4. **请求\u002F响应模型**:\n   - 这些模型与数据库表没有直接关系，但用于处理 API 请求和响应。\n   - 以下 Pydantic 模型用于请求和响应验证：\n     - EmbeddingRequest\n     - SimilarityRequest\n     - SemanticSearchRequest\n     - SemanticSearchResponse\n     - AdvancedSemanticSearchRequest\n     - AdvancedSemanticSearchResponse\n     - EmbeddingResponse\n     - SimilarityResponse\n     - AllStringsResponse\n     - AllDocumentsResponse\n     - TextCompletionRequest\n     - TextCompletionResponse\n     - ImageQuestionResponse\n     - AudioTranscriptResponse\n     - ShowLogsIncrementalModel\n     - AddGrammarRequest\n     - AddGrammarResponse\n\n有关详细字段说明和验证，请参阅 `embeddings_data_models.py` 文件。\n\n## 性能优化\n\n本节重点介绍了集成到所提供代码中的主要性能增强措施，以确保快速响应和最佳资源管理。\n\n### 1. 异步编程：\n\n- **优势**：能够并发处理多项任务，提升 I\u002FO 密集型操作（如数据库事务和网络请求）的效率。\n- **实现**：使用 Python 的 `asyncio` 库进行异步数据库操作。\n\n### 2. 数据库优化：\n\n- **预写日志（WAL）模式**：支持并发读写，特别适用于频繁写入的应用场景。\n- **指数退避重试逻辑**：通过逐步增加等待时间来重试操作，以应对数据库锁定情况。\n- **批量写入**：将写入操作聚合在一起，提高数据库交互效率。\n- **数据库写入队列**：使用异步队列序列化写入操作，确保数据库写入的一致性和无冲突性。\n\n### 3. RAM 盘利用：\n\n- **优势**：通过优先在内存中执行 I\u002FO 密集型任务，而不是在磁盘上，从而加快处理速度。\n- **实现**：检测并优先使用 RAM 盘（`\u002Fmnt\u002Framdisk`），如果可用；否则默认使用标准文件系统。\n\n### 4. 模型缓存：\n\n- **优势**：通过将加载的模型保留在内存中，减少后续请求的开销。\n- **实现**：使用全局 `model_cache` 字典存储和检索模型。\n\n### 5. 并行推理：\n\n- **优势**：提升对多个数据单元（如文档中的句子）的处理速度。\n- **实现**：采用 `asyncio.gather` 进行并发推理，并通过信号量（`MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS`）进行控制。\n\n### 6. 嵌入缓存：\n\n- **优势**：一旦为特定文本计算出嵌入，就会将其存储在数据库中，从而避免在后续请求中重复计算。\n- **实现**：当收到计算嵌入的请求时，系统会首先检查数据库。如果找到该文本的嵌入，则立即返回，从而确保更快的响应时间。\n\n---\n\n### 容器化版本\n\n本仓库包含一个 Bash 脚本 `setup_dockerized_app_on_fresh_machine.sh`，它可以自动为您完成所有操作，包括使用 apt install 安装 Docker。\n\n要使用该脚本，请先将其设置为可执行文件，然后按如下方式运行：\n\n```bash\nchmod +x setup_dockerized_app_on_fresh_machine.sh\nsudo .\u002Fsetup_dockerized_app_on_fresh_machine.sh\n```\n\n如果您更倾向于手动安装，请阅读以下说明：\n\n#### 先决条件\n\n请确保您的系统上已安装 Docker。如果没有，请按照以下步骤在 Ubuntu 上安装 Docker：\n\n```bash\nsudo apt-get update\nsudo apt-get install docker.io\nsudo systemctl start docker\nsudo docker --version\nsudo usermod -aG docker $USER\n```\n\n您可能需要注销并重新登录，或重启系统以使新的组权限生效；否则，在后续步骤中构建和运行容器时，需使用 sudo。\n\n#### 应用程序的设置与运行\n\n1. **克隆仓库：**\n\n   将 Swiss Army Llama 仓库克隆到本地机器：\n\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\n   cd swiss_army_llama\n   ```\n\n2. **构建 Docker 镜像：**\n\n   使用提供的 Dockerfile 构建 Docker 镜像：\n\n   ```bash\n   sudo docker build -t llama-embeddings .\n   ```\n\n3. **运行 Docker 容器：**\n\n   运行 Docker 容器，并将容器的 8089 端口映射到主机的 8089 端口：\n\n   ```bash\n   sudo docker run -p 8089:8089 llama-embeddings\n   ```\n\n4. **访问应用程序：**\n\n   此时，FastAPI 应用程序可通过 `http:\u002F\u002Flocalhost:8089` 访问，或者如果您在 VPS 实例上运行，则可通过其静态 IP 地址访问（您可以在 Contabo 以约 30 美元\u002F月的价格租用一台配备 10 核 CPU、30GB 内存、1TB SSD 并带有静态 IP 的 Ubuntu 22.04 服务器，这是我目前找到的最便宜的选择）。\n\n   您可以使用 `curl` 等工具与 API 交互，或通过 `http:\u002F\u002Flocalhost:8089\u002Fdocs` 访问 FastAPI 文档。\n\n5. **查看日志：**\n\n   应用程序的日志可以直接在运行 `docker run` 命令的终端中查看。\n\n#### 停止与管理容器\n\n- 要停止正在运行的容器，可在终端中按 `Ctrl+C`，或使用 `docker ps` 查找容器 ID，然后运行 `sudo docker stop \u003Ccontainer_id>`。\n- 若要删除已构建的镜像，可使用 `sudo docker rmi llama-embeddings`。\n\n---\n\n## 启动流程\n\n在启动过程中，应用程序会执行以下任务：\n\n1. **数据库初始化**：\n    - 应用程序会初始化 SQLite 数据库，创建表并执行重要的 PRAGMA 命令以优化性能。\n    - 一些重要的 SQLite PRAGMA 包括将数据库设置为预写日志（WAL）模式、同步模式设为 NORMAL、缓存大小增加至 1GB、忙超时时间设为 2 秒，以及 WAL 自动检查点设为 100。\n2. **初始化数据库写入器**：\n    - 会初始化一个专用的数据库写入器 (`DatabaseWriter`) 和一个异步队列来处理写入操作。\n    - 创建一组哈希值，用于记录当前正在处理或已完成的操作，从而避免队列中出现重复操作。\n3. **RAM 盘设置**：\n    - 如果启用了 `USE_RAMDISK` 变量且用户具有相应权限，应用程序将设置 RAM 盘。\n    - 应用程序会检查指定路径是否已存在 RAM 盘；如果不存在，则计算最佳 RAM 盘大小并进行设置。\n    - 如果启用了 RAM 盘功能但用户缺乏必要权限，RAM 盘功能将被禁用，应用程序将继续运行而不使用 RAM 盘。\n4. **模型下载**：\n    - 应用程序会下载所需的模型。\n5. **模型加载**：\n    - 每个下载的模型都会被加载到内存中。如果发现任何模型文件缺失，则会记录错误日志。\n6. **构建 FAISS 索引**：\n    - 应用程序会基于数据库中的嵌入向量创建 FAISS 索引，以便高效地进行相似度搜索。\n    - 相关文本会按模型名称存储，以供后续使用。\n\n注意：\n- 如果启用了 RAM 盘功能但用户缺乏必要权限，应用程序将禁用 RAM 盘功能并继续运行。\n- 对于任何数据库操作，如果数据库被锁定，应用程序将尝试采用指数退避和抖动机制重试几次。\n\n---\n\n## 端点功能与工作流程概述\n\n以下是 FastAPI 服务器提供的主要端点的详细说明，包括其功能、输入参数以及它们如何与底层模型和系统交互：\n\n### 1. `\u002Fget_embedding_vector_for_string\u002F` (POST)\n\n#### 目的\n使用指定模型获取给定文本字符串的嵌入向量。\n\n#### 参数\n- `text`: 要获取嵌入向量的输入文本。\n- `model_name`: 用于计算嵌入的模型名称（可选；未提供时将使用默认模型）。\n- `token`: 安全令牌（可选）。\n- `client_ip`: 客户端 IP 地址（可选）。\n\n#### 流程\n1. **获取嵌入**：函数会使用指定或默认模型检索或计算所提供文本的嵌入向量。\n2. **返回结果**：响应中将返回输入文本字符串的嵌入向量。\n\n### 2. `\u002Fcompute_similarity_between_strings\u002F` (POST)\n\n#### 目的\n使用指定模型的嵌入向量和选定的相似度度量方法，计算两个输入字符串之间的相似度。\n\n#### 参数\n- `text1`: 第一个输入文本。\n- `text2`: 第二个输入文本。\n- `llm_model_name`: 用于计算嵌入的模型名称（可选）。\n- `similarity_measure`: 要使用的相似度度量方法。支持的方法包括 `all`、`spearman_rho`、`kendall_tau`、`approximate_distance_correlation`、`jensen_shannon_dependency_measure` 和 `hoeffding_d`（可选；默认为 `all`）。\n\n#### 流程\n1. **获取嵌入**：使用指定或默认模型检索或计算 `text1` 和 `text2` 的嵌入向量。\n2. **计算相似度**：根据指定的相似度度量方法计算两个嵌入向量之间的相似度。\n3. **返回结果**：响应中将返回相似度分数，以及嵌入向量和输入文本。\n\n### 3. `\u002Fsearch_stored_embeddings_with_query_string_for_semantic_similarity\u002F` (POST)\n\n#### 目的\n在数据库中查找与给定输入“query”文本最相似的字符串。该端点使用预先计算好的 FAISS 索引，快速搜索最接近的匹配字符串。\n\n#### 参数\n- `query_text`: 用于查找最相似字符串的输入文本。\n- `model_name`: 用于计算嵌入的模型。\n- `number_of_most_similar_strings_to_return`:（可选）返回的最相似字符串数量，默认为 10。\n- `token`: 安全令牌（可选）。\n\n#### 工作流程\n1. **搜索 FAISS 索引**：通过基于存储嵌入构建的 FAISS 索引，查找与 `query_text` 最相似的嵌入。\n2. **返回结果**：将数据库中找到的最相似字符串及其相似度分数返回到响应中。\n\n### 4. `\u002Fadvanced_search_stored_embeddings_with_query_string_for_semantic_similarity\u002F` (POST)\n\n#### 目的\n执行两步高级语义搜索。首先利用 FAISS 和余弦相似度进行初步筛选，随后再使用其他相似度度量进行精细化比较。\n\n#### 参数\n- `query_text`: 用于查找最相似字符串的输入文本。\n- `llm_model_name`: 用于计算嵌入的模型。\n- `similarity_filter_percentage`:（可选）根据余弦相似度筛选的嵌入百分比；默认为 0.02（即前 2%）。\n- `number_of_most_similar_strings_to_return`:（可选）在第二次相似度度量后返回的最相似字符串数量；默认为 10。\n\n#### 工作流程\n1. **初始筛选**：使用 FAISS 和余弦相似度查找一组相似字符串。\n2. **精细化比较**：对筛选后的集合应用额外的相似度度量。\n3. **返回结果**：返回最相似的字符串及其多个相似度分数。\n\n#### 示例请求\n```json\n{\n  \"query_text\": \"帮我找到最相似的字符串！\",\n  \"llm_model_name\": \"openchat_v3.2_super\",\n  \"similarity_filter_percentage\": 0.02,\n  \"number_of_most_similar_strings_to_return\": 5\n}\n```\n\n### 5. `\u002Fget_all_embedding_vectors_for_document\u002F` (POST)\n\n#### 目的\n提取文档的文本嵌入。该库现在支持多种文件类型，包括纯文本、.doc\u002F.docx、PDF 文件、图像（使用 Tesseract OCR）以及 `textract` 库支持的其他多种类型。\n\n#### 参数\n- `file`: 上传的文档文件（可以是纯文本、.doc\u002F.docx、PDF 等）。\n- `llm_model_name`:（可选）用于计算嵌入的模型。\n- `json_format`:（可选）JSON 响应的格式。\n- `send_back_json_or_zip_file`: 是否返回 JSON 文件或包含嵌入文件的 ZIP 文件（可选，默认为 `zip`）。\n- `token`: 安全令牌（可选）。\n\n### 6. `\u002Fcompute_transcript_with_whisper_from_audio\u002F` (POST)\n\n#### 目的\n转录音频文件，并可选地为生成的转录文本文档计算嵌入。该端点使用 Whisper 模型进行转录，并使用语言模型生成嵌入。转录和嵌入随后可以被存储，并提供一个包含嵌入的 ZIP 文件供下载。\n\n#### 参数\n- `file`: 需要上传以进行转录的音频文件。\n- `compute_embeddings_for_resulting_transcript_document`: 布尔值，指示是否应计算文档嵌入（可选，默认为 False）。\n- `llm_model_name`: 用于计算嵌入的语言模型（可选，默认为默认模型名称）。\n- `req`: HTTP 请求对象，用于获取额外的请求元数据（可选）。\n- `token`: 安全令牌（可选）。\n- `client_ip`: 客户端 IP 地址（可选）。\n\n#### 请求文件及参数\n您需要使用 multipart\u002Fform-data 请求上传音频文件。其他参数，如 `compute_embeddings_for_resulting_transcript_document` 和 `llm_model_name`，可以作为表单字段一并发送。\n\n#### 示例请求\n```bash\ncurl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fcompute_transcript_with_whisper_from_audio\u002F' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Authorization: Bearer YOUR_ACCESS_TOKEN' \\\n  -F 'file=@your_audio_file.wav' \\\n  -F 'compute_embeddings_for_resulting_transcript_document=true' \\\n  -F 'llm_model_name=custom-llm-model'\n```\n\n### 7. `\u002Fget_text_completions_from_input_prompt\u002F` (POST)\n\n#### 目的\n使用指定模型为给定输入提示生成文本补全。\n\n#### 参数\n- `request`: 包含各种选项的 JSON 对象，如 `input_prompt`、`llm_model_name` 等。\n- `token`: 安全令牌（可选）。\n- `req`: HTTP 请求对象（可选）。\n- `client_ip`: 客户端 IP 地址（可选）。\n\n#### 请求 JSON 格式\nJSON 对象应包含以下键：\n- `input_prompt`\n- `llm_model_name`\n- `temperature`\n- `grammar_file_string`\n- `number_of_completions_to_generate`\n- `number_of_tokens_to_generate`\n\n#### 示例请求\n```json\n{\n  \"input_prompt\": \"17世纪的法国国王：\",\n  \"llm_model_name\": \"phind-codellama-34b-python-v1\",\n  \"temperature\": 0.95,\n  \"grammar_file_string\": \"json\",\n  \"number_of_tokens_to_generate\": 500,\n  \"number_of_completions_to_generate\": 3\n}\n```\n\n### 8. `\u002Fget_list_of_available_model_names\u002F` (GET)\n\n#### 目的\n获取可用于生成嵌入的可用模型名称列表。\n\n#### 参数\n- `token`: 安全令牌（可选）。\n\n### 9. `\u002Fget_all_stored_strings\u002F` (GET)\n\n#### 目的\n从数据库中检索所有已存储且已计算嵌入的字符串列表。\n\n#### 参数\n- `token`: 安全令牌（可选）。\n\n### 10. `\u002Fget_all_stored_documents\u002F` (GET)\n\n#### 目的\n从数据库中检索所有已存储且已计算嵌入的文档列表。\n\n#### 参数\n- `token`: 安全令牌（可选）。\n\n### 11. `\u002Fclear_ramdisk\u002F` (POST)\n\n#### 目的\n清空 RAM 磁盘以释放内存。\n\n#### 参数\n- `token`: 安全令牌（可选）。\n\n### 12. `\u002Fdownload\u002F{file_name}` (GET)\n\n#### 目的\n下载包含通过 `\u002Fcompute_transcript_with_whisper_from_audio\u002F` 端点生成的文档嵌入的 ZIP 文件。该下载的 URL 将在音频文件转录端点的 JSON 响应中提供。\n\n#### 参数\n- `file_name`: 您想要下载的 ZIP 文件的名称。\n\n### 13. `\u002Fadd_new_model\u002F` (POST)\n\n#### 目的\n提交新的模型 URL 进行下载和使用。该模型必须为 `.gguf` 格式，且大小需大于 100 MB，以确保其为有效的模型文件。\n\n#### 参数\n- `model_url`: 模型权重文件的 URL，必须以 `.gguf` 结尾。\n- `token`: 安全令牌（可选）。\n\n### 词元级嵌入向量池化\n\n池化方法旨在聚合词元级嵌入，而这些嵌入通常由于句子或文档中词元数量的不同而长度不一。通过将这些词元级嵌入转换为一个固定长度的单一向量，我们可以确保无论输入文本的长度如何，都能以一致的方式进行表示。随后，这个固定长度的向量可以用于各种需要固定大小输入的机器学习模型中。\n\n这些池化方法的主要目标是在保证变换具有确定性且不会扭曲数据的前提下，尽可能多地保留原始词元级嵌入中的有用信息。每种方法都通过应用不同的统计或数学技术来总结词元嵌入，从而实现这一目标。\n\n#### 池化方法说明\n\n1. **SVD（奇异值分解）**：\n   - **工作原理**：将从词元嵌入矩阵的奇异值分解中得到的前两个奇异向量拼接在一起。\n   - **原理**：奇异值分解是一种降维技术，能够捕捉数据中最重要特征。使用前两个奇异向量可以提供一种紧凑的表示形式，同时保留大量关键信息。\n\n2. **SVD_First_Four**：\n   - **工作原理**：使用从词元嵌入矩阵的奇异值分解中得到的前四个奇异向量。\n   - **原理**：通过使用更多的奇异向量，该方法能够捕捉到数据中更多的方差，从而在降维的同时提供更丰富的表示。\n\n3. **ICA（独立成分分析）**：\n    - **工作原理**：对嵌入矩阵应用独立成分分析，找出统计上相互独立的成分，然后将其展平。\n    - **原理**：ICA有助于识别数据中的独立源，提供一种突出这些独立特征的表示。\n\n4. **Factor_Analysis（因子分析）**：\n    - **工作原理**：对嵌入矩阵应用因子分析，识别潜在因子，然后将其展平。\n    - **原理**：因子分析以潜在因子的形式建模数据，提供一种能够捕捉这些潜在影响的摘要表示。\n\n5. **Gaussian_Random_Projection**：\n    - **工作原理**：应用高斯随机投影来降低嵌入的维度，然后将其展平。\n    - **原理**：这种方法能够在快速高效地降低维度的同时，保持点与点之间的成对距离不变，非常适合处理大规模数据集。\n  \n---\n\n感谢您对我开源项目的关注！希望对您有所帮助。另外，我的一些商业Web应用也可能对您有帮助，如果您有兴趣的话，欢迎去看看：\n\n**[YoutubeTranscriptOptimizer.com](https:\u002F\u002Fyoutubetranscriptoptimizer.com)** 让您只需简单粘贴一个YouTube视频链接，就能自动生成不仅非常准确的直接转录文本，还能生成一份经过精心润色、格式优美、可独立于视频使用的书面文档。\n\n这份文档的内容基本与视频中讨论的一致，但读起来更像是正式的文章，而不是单纯的转录稿。此外，您还可以选择根据文档内容生成测验，包括选择题和简答题两种形式。其中，选择题会转化为交互式的HTML文件，方便您发布和分享，用户可以直接在线作答，系统还会自动批改并给出分数。\n\n**[FixMyDocuments.com](https:\u002F\u002Ffixmydocuments.com\u002F)** 允许您上传任何类型的文档——PDF文件（包括需要OCR处理的扫描版PDF）、MS Word和PowerPoint文件、图片、音频文件（如mp3、m4a等）——并将其转换为采用优质Markdown格式的高度优化版本，同时自动生成HTML和PDF版本。转换完成后，您还可以直接在网站上使用内置的Markdown编辑器对文档进行编辑，系统会保存每次修改的历史记录，并自动重新生成PDF和HTML版本。\n\n除了获得优化后的文档外，您还可以基于原始文档生成多种“衍生文档”：可供在线作答并自动评分的互动选择题测验；美观的演示文稿幻灯片（以PDF或HTML形式呈现，使用LaTeX和Reveal.js制作）；深度摘要、概念思维导图（利用Mermaid图表绘制）及大纲；可根据目标受众定制的教学计划；阅读难度分析以及针对不同年级水平的简化版本（适合为学生简化复杂概念）；还可生成Anki闪卡，既可直接导入Anki应用，也可在网站提供的友好界面中使用，等等。","# Swiss Army Llama 快速上手指南\n\nSwiss Army Llama 是一个基于 FastAPI 构建的本地大语言模型（LLM）工具集。它提供了便捷的 REST API，支持文本嵌入（Embedding）、文档处理（含 OCR）、音频转录（Whisper）、高级语义相似度计算以及结构化文本生成等功能。所有计算结果可自动缓存至 SQLite，并可通过 Swagger UI 直接交互。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：推荐 Ubuntu 22.04 或更高版本。\n- **硬件**：建议具备 GPU 以加速 LLM 推理（可选），需足够内存加载模型。\n- **网络**：首次运行需下载模型文件，请确保网络通畅。\n\n### 前置依赖\n在安装 Python 依赖前，需先安装系统级库以支持文档解析、OCR 及音频处理：\n\n```bash\nsudo apt-get update\nsudo apt-get install build-essential libxml2-dev libxslt1-dev antiword unrtf poppler-utils pstotext tesseract-ocr flac ffmpeg lame libmad0 libsox-fmt-mp3 sox libjpeg-dev swig redis-server libpoppler-cpp-dev pkg-config -y\n```\n\n启动 Redis 服务（用于并发请求锁定）：\n\n```bash\nsudo systemctl enable redis-server\nsudo systemctl start redis\n```\n\n## 安装步骤\n\n你可以选择使用 **Python 虚拟环境**（推荐）或 **Docker** 进行部署。以下提供原生 Python 环境的安装流程。\n\n### 方式一：手动安装（推荐）\n\n1. **克隆项目代码**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\n   cd swiss_army_llama\n   ```\n\n2. **创建并激活虚拟环境**\n   ```bash\n   python3 -m venv venv\n   source venv\u002Fbin\u002Factivate\n   ```\n\n3. **升级构建工具并安装依赖**\n   *注：国内用户若遇 pip 下载缓慢，可临时添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 参数。*\n   ```bash\n   python3 -m pip install --upgrade pip wheel setuptools\n   pip install -r requirements.txt\n   ```\n\n### 方式二：一键脚本安装（仅限全新 Ubuntu 机器）\n\n如果你在一台全新的 Ubuntu 服务器上，可以使用官方提供的自动化脚本（会自动安装 PyEnv、Python 3.12 及所有依赖）：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama && cd swiss_army_llama && chmod +x install_swiss_army_llama.sh && .\u002Finstall_swiss_army_llama.sh && pyenv local 3.12 && source venv\u002Fbin\u002Factivate && python swiss_army_llama.py\n```\n\n## 基本使用\n\n### 1. 启动服务\n\n在项目根目录下运行主程序：\n\n```bash\npython swiss_army_llama.py\n```\n\n服务默认监听 `0.0.0.0:8089`（端口可通过 `.env` 文件中的 `SWISS_ARMY_LLAMA_SERVER_LISTEN_PORT` 配置）。\n\n### 2. 访问交互界面\n\n打开浏览器访问 Swagger UI 界面：\n- 本地访问：`http:\u002F\u002Flocalhost:8089`\n- 服务器访问：`http:\u002F\u002F\u003C你的服务器 IP>:8089`\n\n在界面中，你可以直接测试所有可用接口。\n\n### 3. 核心功能示例\n\n通过 Swagger UI 或 HTTP 客户端调用以下典型功能：\n\n#### A. 获取文本嵌入 (Text Embeddings)\n调用嵌入接口，输入文本即可返回向量。系统会自动缓存结果，重复请求秒级响应。\n- **Endpoint**: `\u002Fcompute_embeddings`\n- **特性**: 支持多种池化方法（mean, svd, ica 等）。\n\n#### B. 处理文档与 OCR\n上传 PDF、Word 或图片文件，系统自动提取文本（扫描件自动调用 Tesseract OCR），分句并计算嵌入。\n- **Endpoint**: `\u002Fprocess_document`\n- **返回**: JSON 数据或包含结果的 ZIP 文件链接。\n\n#### C. 音频转录与嵌入\n上传 MP3 或 WAV 文件，使用 Faster Whisper 模型转录为文本，并可进一步计算句子级嵌入。\n- **Endpoint**: `\u002Ftranscribe_and_embed_audio`\n\n#### D. 高级语义搜索\n利用 FAISS 进行初步检索，再结合 Rust 编写的高性能库计算深层相似度（如 Spearman, Hoeffding D 等）。\n- **Endpoint**: `\u002Fadvanced_semantic_search`\n\n#### E. 结构化文本生成\n指定 Grammar 文件（如 JSON 格式），让 LLM 生成符合特定结构的回复。\n- **Endpoint**: `\u002Fgenerate_completions`\n\n### 4. 配置说明\n编辑项目根目录下的 `.env` 文件可自定义行为，例如：\n- `USE_SECURITY_TOKEN`: 开启安全令牌验证。\n- `USE_PARALLEL_INFERENCE_QUEUE`: 启用并行推理队列。\n- `MAX_CONCURRENT_PARALLEL_INFERENCE_TASKS`: 设置最大并发任务数。\n\n修改配置后重启服务即可生效。","某法律科技团队需要构建一个能同时检索纸质合同扫描件、Word 文档及会议录音内容的智能知识库。\n\n### 没有 swiss_army_llama 时\n- 开发团队需分别集成 Tesseract OCR、Faster Whisper 和 textract 等多个独立库，环境配置繁琐且兼容性冲突频发。\n- 处理混合格式文件时缺乏统一接口，每次新增文件类型（如从 PDF 扩展到音频）都需重写大量数据预处理代码。\n- 重复计算相同文档的向量嵌入导致 GPU 资源浪费，且缺乏自动缓存机制，系统响应速度随数据量增加显著下降。\n- 仅能使用基础的余弦相似度进行检索，无法利用 Spearman 或 Hoeffding D 等高级统计指标优化长尾查询的准确率。\n\n### 使用 swiss_army_llama 后\n- 通过单一 FastAPI 服务即可自动处理扫描件 OCR、音频转录及多格式文档解析，无需手动编排复杂的外部依赖。\n- 统一的 REST 端点支持直接上传任意受支持文件，底层自动调用 Whisper 或 textract 并返回标准化向量，新格式接入零代码改动。\n- 内置 SQLite 缓存机制自动复用已计算的嵌入向量，结合可选的 RAM 磁盘加速模型加载，大幅降低延迟并节省算力。\n- 支持两阶段检索策略：先由 FAISS 快速筛选候选集，再运用 Jensen-Shannon 等高级相似度度量进行重排序，显著提升检索精度。\n\nswiss_army_llama 将原本分散复杂的非结构化数据处理流程整合为一把“瑞士军刀”，让开发者能专注于业务逻辑而非底层工程实现。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FDicklesworthstone_swiss_army_llama_42018885.png","Dicklesworthstone","Jeff Emanuel","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FDicklesworthstone_c96b6d22.jpg","Building in NY",null,"doodlestein","https:\u002F\u002Fwww.jeffreyemanuel.com\u002F","https:\u002F\u002Fgithub.com\u002FDicklesworthstone",[81,85,89],{"name":82,"color":83,"percentage":84},"Python","#3572A5",98,{"name":86,"color":87,"percentage":88},"Shell","#89e051",1.6,{"name":90,"color":91,"percentage":92},"Dockerfile","#384d54",0.4,1050,64,"2026-04-14T17:12:03",4,"Linux","未说明 (支持通过 RAM Disk 加速模型加载，主要依赖 CPU 运行 llama_cpp 和 faster-whisper)","未说明 (建议使用 RAM Disk 以加速多模型加载)",{"notes":101,"python":102,"dependencies":103},"1. 官方文档明确推荐在 Ubuntu 22+ 上运行，并提供了一键安装脚本，未提及 macOS 或 Windows 的原生支持（Windows 用户可能需要使用 Docker 或 WSL）。2. 需要安装大量系统级依赖库以支持文档解析（如 tesseract-ocr, poppler-utils, antiword 等）和音频处理。3. 使用 Redis 进行请求锁定以支持多 Worker 并发。4. 默认端口为 8089。5. 支持多种嵌入池化方法（如 mean, svd, ica 等）和高级相似度度量。","3.12",[104,105,106,107,108,109,110,111,112,113],"llama-cpp-python","faster-whisper","fastapi","faiss-cpu","textract-py3","redis","aiosqlite","pandas","scikit-learn","uvicorn",[14,36],[116,117,118,119,120,121],"embedding-similarity","embedding-vectors","embeddings","llama2","llamacpp","semantic-search","2026-03-27T02:49:30.150509","2026-04-20T21:12:54.343841",[125,130,135,140,145,149],{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},45758,"为什么启动时提示需要分配 40GB 内存，而我的设备只有 16GB？","该程序默认并不需要使用 40GB 内存。报错是因为启用了可选的 RAM Disk（内存盘）功能。如果您关闭 RAM Disk（在 .env 文件中禁用），程序只会占用加载所选模型所需的内存。对于内存较小的设备（如 16GB Mac），建议选择较小的模型或确保 RAM Disk 处于禁用状态。此外，llama-cpp-python 支持通过设置 CMAKE_ARGS 环境变量来启用硬件加速（如 MacOS 的 Metal、Linux 的 OpenBLAS 等），例如：CMAKE_ARGS=\"-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python。","https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\u002Fissues\u002F2",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},45759,"安装时遇到 'ImportError: cannot import name field_validator from pydantic' 错误怎么办？","这通常是由于 Pydantic 版本冲突引起的。强烈建议使用 Python 虚拟环境（venv）来隔离依赖。解决步骤如下：\n1. 克隆项目：git clone https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fllama_embeddings_fastapi_service\n2. 进入目录并创建虚拟环境：cd llama_embeddings_fastapi_service && python3 -m venv venv\n3. 激活环境：source venv\u002Fbin\u002Factivate\n4. 升级 pip 并安装依赖：python3 -m pip install --upgrade pip && python3 -m pip install wheel && pip install -r requirements.txt\n5. 运行服务：python3 llama_2_embeddings_fastapi_server.py","https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\u002Fissues\u002F5",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},45760,"安装 llama-cpp-python 时出现 'No CMAKE_C_COMPILER could be found' 编译错误如何解决？","此错误表明系统缺少必要的 C\u002FC++ 编译器。您需要先安装基础编译工具。\n- 在 Ubuntu\u002FDebian 上，运行：sudo apt-get install build-essential cmake\n- 在 MacOS 上，运行：xcode-select --install\n- 在 Windows 上，需要安装 Visual Studio Build Tools 并确保包含 C++ 组件。\n安装完成后，重新运行 pip install -r requirements.txt 即可。","https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\u002Fissues\u002F9",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},45761,"模型文件已下载但启动时报错 'failed to load model' 或找不到模型怎么办？","这个问题通常是因为底层库 ggml 已弃用并迁移到了新的 gguf 格式，而旧代码仍在寻找 .bin 文件。解决方案是更新代码以支持 .gguf 后缀，或者直接下载 gguf 格式的模型文件（如 yarn-llama-2-13b-64k.Q4_K_M.gguf）并替换原有模型。维护者已更新项目以默认支持带有 128k 上下文的 Yarn gguf 模型，并移除了不再适用的 bge base 模型。请确保拉取最新代码并使用正确的模型格式。","https:\u002F\u002Fgithub.com\u002FDicklesworthstone\u002Fswiss_army_llama\u002Fissues\u002F3",{"id":146,"question_zh":147,"answer_zh":148,"source_url":134},45762,"Llama 嵌入模型是否具有日期感知能力（Date Awareness）？","虽然 Llama 聊天模型能够从相对和绝对日期中推导时间信息，但这种能力不一定能直接转移到基于嵌入的 RAG（检索增强生成）系统中。如果您需要更复杂的语义相似度比较（包括日期感知），建议尝试使用 \u002Fadvanced_search_stored_embeddings_with_query_string_for_semantic_similarity\u002F 端点，该端点使用了 Hoeffding's D 统计量进行更高级的相似度计算。同时请注意，聊天模型的日期推理能力与嵌入模型的向量表示能力是不同的概念。",{"id":150,"question_zh":151,"answer_zh":152,"source_url":129},45763,"如何在不同操作系统上启用 GPU 或硬件加速？","该项目使用 llama-cpp-python，其硬件加速方式取决于操作系统和后端。默认情况下，Linux\u002FWindows 构建为 CPU 版本，MacOS 使用 Metal。要启用特定加速，需在安装前设置 CMAKE_ARGS 环境变量：\n- Linux (OpenBLAS): CMAKE_ARGS=\"-DLLAMA_BLAS=ON -DLLAMA_BLAS_VENDOR=OpenBLAS\" pip install llama-cpp-python\n- MacOS (Metal): 默认自动启用，也可显式设置 CMAKE_ARGS=\"-DLLAMA_METAL=ON\"\n- Windows (CUDA\u002FcuBLAS): $env:CMAKE_ARGS=\"-DLLAMA_CUBLAS=ON\"; pip install llama-cpp-python\n具体支持的后端包括 OpenBLAS, cuBLAS, CLBlast, HIPBLAS 和 Metal，详情可参考 llama.cpp 官方文档。",[]]