[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-aphrodite-engine--aphrodite-engine":3,"similar-aphrodite-engine--aphrodite-engine":217},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":6,"owner_name":14,"owner_avatar_url":15,"owner_bio":16,"owner_company":17,"owner_location":17,"owner_email":17,"owner_twitter":17,"owner_website":17,"owner_url":18,"languages":19,"stars":53,"forks":54,"last_commit_at":55,"license":56,"difficulty_score":57,"env_os":58,"env_gpu":59,"env_ram":60,"env_deps":61,"category_tags":69,"github_topics":72,"view_count":31,"oss_zip_url":17,"oss_zip_packed_at":17,"status":83,"created_at":84,"updated_at":85,"faqs":86,"releases":116},645,"aphrodite-engine\u002Faphrodite-engine","aphrodite-engine","Large-scale LLM inference engine","Aphrodite-engine 是一款专注于大规模场景的大语言模型推理引擎，旨在让 HuggingFace 兼容的模型服务更高效、更稳定。面对多用户并发访问时的性能瓶颈，它通过优化显存管理和调度策略，显著提升了推理速度与吞吐量。\n\n其核心亮点在于集成了 vLLM 的 PagedAttention 技术，支持连续批处理与高效的 KV Cache 管理。此外，它还广泛兼容各类量化方案（如 GGUF、AWQ 等），并具备分布式推理、推测解码及多 LoRA 支持能力，能在有限硬件资源下释放更大算力。\n\n这款引擎非常适合开发者、研究人员及企业团队使用。它直接提供 OpenAI 兼容的 API 接口，可无缝对接 SillyTavern 等主流 UI 界面，也支持 Docker 快速部署。无论你是想搭建个人聊天机器人，还是构建高并发的商业 API 服务，Aphrodite-engine 都能提供强劲的后端动力，帮助你在保证速度的同时降低资源成本。","\u003Ch1 align=\"center\">\nBreathing Life into Language\n\u003C\u002Fh1>\n\n\n![aphrodite](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faphrodite-engine_aphrodite-engine_readme_f04ede7090c7.png)\n\nAphrodite is an inference engine that optimizes the serving of HuggingFace-compatible models at scale. Built on vLLM's Paged Attention technology, it delivers high-performance model inference for multiple concurrent users. Developed through a collaboration between [PygmalionAI](https:\u002F\u002Fpygmalion.chat) and [Ruliad](https:\u002F\u002Fruliad.co), Aphrodite serves as the backend engine powering both organizations' chat platforms and API infrastructure.\n\nAphrodite builds upon and integrates the exceptional work from [various projects](#acknowledgements), primarily [vLLM](https:\u002F\u002Fvllm.ai).\n\n## Features\n\n- Continuous Batching\n- Efficient K\u002FV management with [PagedAttention](https:\u002F\u002Fvllm.ai) from vLLM\n- Optimized CUDA kernels for improved inference\n- Quantization support via [AQLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118), [AutoRound](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05516), [AWQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978), [BitNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11453), [Bitsandbytes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.07339), [EETQ](https:\u002F\u002Fgithub.com\u002FNetEase-FuXi\u002FEETQ), [GGUF](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp), [GPTQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323), [QuIP#](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04396), [SqueezeLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07629), [Marlin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11743), FP2-FP12 [[1]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14112) [[2]](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Ftransformer-engine\u002Fuser-guide\u002Fexamples\u002Ffp8_primer.html) [[3]](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-nvfp4-for-efficient-and-accurate-low-precision-inference\u002F), [NVIDIA ModelOpt](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-Model-Optimizer), [TorchAO](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao), [VPTQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17066), [compressed_tensors](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor), [MXFP4](https:\u002F\u002Fhuggingface.co\u002Fblog\u002FRakshitAralimatti\u002Flearn-ai-with-me), and more.\n- Distributed inference\n- 8-bit KV Cache for higher context lengths and throughput, at both FP8 E5M3 and E4M3 formats\n- Support for modern samplers such as DRY, XTC, Mirostat, and more\n- Disaggregated inference\n- Speculative decoding\n- Multimodal support\n- Multi-LoRA support\n\n\n## Quickstart\n\nInstall the engine:\n```sh\npip install -U aphrodite-engine\n```\n\n> [!TIP]\n> You will need to install the kernels separately. See the [installation guide](https:\u002F\u002Faphrodite.pygmalion.chat\u002Finstallation\u002Finstallation\u002F) for more details. Running Aphrodite without the kernels will give you the installation instructions as well.\n\nThen launch a model:\n\n```sh\naphrodite run Qwen\u002FQwen3-0.6B\n```\n\nIf you're not serving at scale, you can append the `--single-user-mode` flag to limit memory usage.\n\nThis will create a [OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002F)-compatible API server that can be accessed at port 2242 of the localhost. You can plug in the API into a UI that supports OpenAI, such as [SillyTavern](https:\u002F\u002Fgithub.com\u002FSillyTavern\u002FSillyTavern).\n\nPlease refer to the [documentation](https:\u002F\u002Faphrodite.pygmalion.chat) for the full list of arguments and flags you can pass to the engine, or simply run `aphrodite run -h` to see the full list of arguments.\n\nYou can play around with the engine in the demo here:\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FAlpinDale\u002Fmisc-scripts\u002Fblob\u002Fmain\u002FAphrodite.ipynb)\n\n### Docker\n\nAdditionally, we provide a Docker image for easy deployment. Here's a basic command to get you started:\n\n```sh\ndocker run --runtime nvidia --gpus all \\\n    -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n    #--env \"CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\" \\\n    -p 2242:2242 \\\n    --ipc=host \\\n    alpindale\u002Faphrodite-openai:latest \\\n    --model NousResearch\u002FMeta-Llama-3.1-8B-Instruct \\\n    --tensor-parallel-size 8 \\\n    --api-key \"sk-empty\"\n```\n\nThis will pull the Aphrodite Engine image, and launch the engine with the Llama-3.1-8B-Instruct model at port 2242.\n\n## Requirements\n\n- Operating System: Linux, Windows (WSL2)\n- Python: 3.9 to 3.12\n\n#### Build Requirements:\n- CUDA >= 12\n\nFor supported devices, see [here](https:\u002F\u002Faphrodite.pygmalion.chat\u002Fpages\u002Fquantization\u002Fsupport-matrix.html). Generally speaking, all semi-modern GPUs are supported - down to Pascal (GTX 10xx, P40, etc.) We also support AMD GPUs, Intel CPUs and GPUs, Google TPU, and AWS Inferentia.\n\n\n\n\n### Notes\n\n1. By design, Aphrodite takes up 90% of your GPU's VRAM. If you're not serving an LLM at scale, you may want to limit the amount of memory it takes up. You can do this in the API example by launching the server with the `--gpu-memory-utilization 0.6` (0.6 means 60%), or `--single-user-mode` to only allocate as much memory as needed for a single sequence.\n\n2. You can view the full list of commands by running `aphrodite run --help`.\n\n## Acknowledgements\nAphrodite Engine would have not been possible without the phenomenal work of other open-source projects. A (non-exhaustive) list:\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM)\n- [xFormers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers)\n- [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\n- [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)\n- [AutoAWQ](https:\u002F\u002Fgithub.com\u002Fcasper-hansen\u002FAutoAWQ)\n- [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ)\n- [SqueezeLLM](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM\u002F)\n- [Exllamav2](https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllamav2)\n- [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI)\n- [AQLM](https:\u002F\u002Fgithub.com\u002FVahe1994\u002FAQLM)\n- [KoboldAI](https:\u002F\u002Fgithub.com\u002Fhenk717\u002FKoboldAI)\n- [Text Generation WebUI](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui)\n- [Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM)\n- [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray)\n\n### Sponsors\nPast and present, in alphabetical order:\n\n- [Arc Compute](https:\u002F\u002Fwww.arccompute.io\u002F)\n- [Lium](https:\u002F\u002Flium.io)\n- [Prime Intellect](https:\u002F\u002Fwww.primeintellect.ai\u002F)\n- [PygmalionAI](https:\u002F\u002Fpygmalion.chat)\n- [Ruliad AI](https:\u002F\u002Fruliad.ai)\n\n\n## Contributing\nEveryone is welcome to contribute. You can support the project by opening Pull Requests for new features, fixes, or general UX improvements.\n","\u003Ch1 align=\"center\">\n赋予语言生命\n\u003C\u002Fh1>\n\n\n![aphrodite](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faphrodite-engine_aphrodite-engine_readme_f04ede7090c7.png)\n\nAphrodite 是一个推理引擎 (inference engine)，旨在大规模优化兼容 HuggingFace 模型的部署。基于 vLLM 的 Paged Attention（分页注意力）技术构建，它为多用户并发提供高性能模型推理服务。由 [PygmalionAI](https:\u002F\u002Fpygmalion.chat) 和 [Ruliad](https:\u002F\u002Fruliad.co) 合作开发，Aphrodite 作为后端引擎，为两家组织的聊天平台和 API (应用程序接口) 基础设施提供支持。\n\nAphrodite 建立在并整合了 [各种项目](#acknowledgements) 的卓越工作之上，主要是 [vLLM](https:\u002F\u002Fvllm.ai)。\n\n## 功能特性\n\n- 连续批处理 (Continuous Batching)\n- 利用来自 vLLM 的 [PagedAttention](https:\u002F\u002Fvllm.ai) 进行高效的键值对 (K\u002FV) 管理\n- 优化的 CUDA 内核以提升推理性能\n- 支持通过 [AQLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.06118), [AutoRound](https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.05516), [AWQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00978), [BitNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2310.11453), [Bitsandbytes](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.07339), [EETQ](https:\u002F\u002Fgithub.com\u002FNetEase-FuXi\u002FEETQ), [GGUF](https:\u002F\u002Fgithub.com\u002Fggml-org\u002Fllama.cpp), [GPTQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.17323), [QuIP#](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04396), [SqueezeLLM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07629), [Marlin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11743), FP2-FP12 [[1]](https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.14112) [[2]](https:\u002F\u002Fdocs.nvidia.com\u002Fdeeplearning\u002Ftransformer-engine\u002Fuser-guide\u002Fexamples\u002Ffp8_primer.html) [[3]](https:\u002F\u002Fdeveloper.nvidia.com\u002Fblog\u002Fintroducing-nvfp4-for-efficient-and-accurate-low-precision-inference\u002F), [NVIDIA ModelOpt](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-Model-Optimizer), [TorchAO](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fao), [VPTQ](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17066), [compressed_tensors](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fllm-compressor), [MXFP4](https:\u002F\u002Fhuggingface.co\u002Fblog\u002FRakshitAralimatti\u002Flearn-ai-with-me), 等进行量化 (Quantization) 支持，以及更多。\n- 分布式推理 (Distributed inference)\n- 8-bit KV Cache（键值缓存），支持更高的上下文长度和吞吐量，包括 FP8 E5M3 和 E4M3 格式\n- 支持现代采样器 (samplers)，如 DRY, XTC, Mirostat 等\n- 解耦推理 (Disaggregated inference)\n- 推测解码 (Speculative decoding)\n- 多模态 (Multimodal) 支持\n- 多 LoRA 支持\n\n\n## 快速开始\n\n安装引擎：\n```sh\npip install -U aphrodite-engine\n```\n\n> [!TIP]\n> 您需要单独安装内核 (kernels)。请参阅 [安装指南](https:\u002F\u002Faphrodite.pygmalion.chat\u002Finstallation\u002Finstallation\u002F) 了解更多详情。即使在不安装内核的情况下运行 Aphrodite，也会提示您安装说明。\n\n然后启动一个模型：\n\n```sh\naphrodite run Qwen\u002FQwen3-0.6B\n```\n\n如果您不是进行大规模部署，可以附加 `--single-user-mode` 标志来限制内存使用。\n\n这将创建一个与 [OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002F) 兼容的 API 服务器，可通过 localhost 的 2242 端口访问。您可以将 API 接入支持 OpenAI 的 UI，例如 [SillyTavern](https:\u002F\u002Fgithub.com\u002FSillyTavern\u002FSillyTavern)。\n\n请查阅 [文档](https:\u002F\u002Faphrodite.pygmalion.chat) 获取可传递给引擎的参数和标志完整列表，或直接运行 `aphrodite run -h` 查看参数列表。\n\n您可以在这里尝试引擎演示：\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FAlpinDale\u002Fmisc-scripts\u002Fblob\u002Fmain\u002FAphrodite.ipynb)\n\n### Docker\n\n此外，我们提供了一个 Docker 镜像以便轻松部署。这是一个基本命令供您开始：\n\n```sh\ndocker run --runtime nvidia --gpus all \\\n    -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n    #--env \"CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7\" \\\n    -p 2242:2242 \\\n    --ipc=host \\\n    alpindale\u002Faphrodite-openai:latest \\\n    --model NousResearch\u002FMeta-Llama-3.1-8B-Instruct \\\n    --tensor-parallel-size 8 \\\n    --api-key \"sk-empty\"\n```\n\n这将拉取 Aphrodite Engine 镜像，并在 2242 端口启动带有 Llama-3.1-8B-Instruct 模型的引擎。\n\n## 系统要求\n\n- 操作系统：Linux, Windows (WSL2)\n- Python：3.9 至 3.12\n\n#### 构建要求：\n- CUDA >= 12\n\n关于支持的设备，请参见 [此处](https:\u002F\u002Faphrodite.pygmalion.chat\u002Fpages\u002Fquantization\u002Fsupport-matrix.html)。一般来说，所有半现代 GPU 都受支持——低至 Pascal (GTX 10xx, P40 等)。我们也支持 AMD GPU、Intel CPU 和 GPU、Google TPU 以及 AWS Inferentia。\n\n\n\n\n### 注意事项\n\n1. 设计上，Aphrodite 会占用您 GPU 显存 (VRAM) 的 90%。如果您不是大规模部署大语言模型 (LLM)，您可能希望限制其占用的内存量。您可以在 API 示例中通过启动服务器时添加 `--gpu-memory-utilization 0.6`（0.6 表示 60%），或使用 `--single-user-mode` 仅分配单个序列所需的内存来实现这一点。\n\n2. 您可以通过运行 `aphrodite run --help` 查看所有命令列表。\n\n## 致谢\nAphrodite Engine 的实现离不开其他开源项目的杰出工作。以下是一份（非详尽）列表：\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM)\n- [xFormers](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fxformers)\n- [Flash Attention](https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention)\n- [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)\n- [AutoAWQ](https:\u002F\u002Fgithub.com\u002Fcasper-hansen\u002FAutoAWQ)\n- [AutoGPTQ](https:\u002F\u002Fgithub.com\u002FPanQiWei\u002FAutoGPTQ)\n- [SqueezeLLM](https:\u002F\u002Fgithub.com\u002FSqueezeAILab\u002FSqueezeLLM\u002F)\n- [Exllamav2](https:\u002F\u002Fgithub.com\u002Fturboderp\u002Fexllamav2)\n- [TabbyAPI](https:\u002F\u002Fgithub.com\u002Ftheroyallab\u002FtabbyAPI)\n- [AQLM](https:\u002F\u002Fgithub.com\u002FVahe1994\u002FAQLM)\n- [KoboldAI](https:\u002F\u002Fgithub.com\u002Fhenk717\u002FKoboldAI)\n- [Text Generation WebUI](https:\u002F\u002Fgithub.com\u002Foobabooga\u002Ftext-generation-webui)\n- [Megatron-LM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FMegatron-LM)\n- [Ray](https:\u002F\u002Fgithub.com\u002Fray-project\u002Fray)\n\n### 赞助者\n过去和现在的支持者，按字母顺序排列：\n\n- [Arc Compute](https:\u002F\u002Fwww.arccompute.io\u002F)\n- [Lium](https:\u002F\u002Flium.io)\n- [Prime Intellect](https:\u002F\u002Fwww.primeintellect.ai\u002F)\n- [PygmalionAI](https:\u002F\u002Fpygmalion.chat)\n- [Ruliad AI](https:\u002F\u002Fruliad.ai)\n\n\n## 贡献\n欢迎每个人贡献。您可以通过提交新功能、修复或一般用户体验改进的 Pull Requests 来支持该项目。","# Aphrodite Engine 快速上手指南\n\nAphrodite 是一个高性能的推理引擎，专为大规模部署兼容 HuggingFace 的模型而优化。它基于 vLLM 的 PagedAttention 技术构建，支持多用户并发、多种量化格式及分布式推理，可作为 Chat 平台或 API 的基础后端。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n- **操作系统**: Linux 或 Windows (需使用 WSL2)\n- **Python 版本**: 3.9 至 3.12\n- **CUDA 版本**: >= 12\n- **硬件支持**: \n  - 主要支持 NVIDIA GPU (从 Pascal\u002FGTX 10xx 系列起)\n  - 同时也支持 AMD GPU、Intel CPU\u002FGPU、Google TPU 及 AWS Inferentia\n\n> **注意**: Aphrodite 默认会占用 GPU 显存的 90%。如果您不是进行大规模服务，建议通过参数限制显存占用。\n\n## 安装步骤\n\n### 方式一：直接安装 (推荐)\n\n使用 pip 安装核心引擎：\n\n```sh\npip install -U aphrodite-engine\n```\n\n> [!TIP]\n> **重要提示**: 您需要单独安装算子内核 (kernels)。如果不安装内核，运行时会提示相关安装说明。具体请参考官方安装文档。\n\n### 方式二：Docker 部署\n\n如果您希望快速启动且避免环境冲突，可以使用提供的 Docker 镜像：\n\n```sh\ndocker run --runtime nvidia --gpus all \\\n    -v ~\u002F.cache\u002Fhuggingface:\u002Froot\u002F.cache\u002Fhuggingface \\\n    -p 2242:2242 \\\n    --ipc=host \\\n    alpindale\u002Faphrodite-openai:latest \\\n    --model NousResearch\u002FMeta-Llama-3.1-8B-Instruct \\\n    --tensor-parallel-size 8 \\\n    --api-key \"sk-empty\"\n```\n\n此命令将拉取镜像并启动 Llama-3.1-8B-Instruct 模型服务，监听端口为 2242。\n\n## 基本使用\n\n安装完成后，您可以直接在命令行启动模型服务。\n\n### 启动模型\n\n运行以下命令加载模型（示例使用 Qwen 小模型）：\n\n```sh\naphrodite run Qwen\u002FQwen3-0.6B\n```\n\n### 单用户模式\n\n如果您仅在本地测试或非大规模服务场景下使用，建议添加 `--single-user-mode` 标志以限制内存占用：\n\n```sh\naphrodite run Qwen\u002FQwen3-0.6B --single-user-mode\n```\n\n### 显存控制\n\n如需更精细地控制显存使用比例，可使用 `--gpu-memory-utilization` 参数（例如设置为 0.6 表示占用 60% 显存）：\n\n```sh\naphrodite run Qwen\u002FQwen3-0.6B --gpu-memory-utilization 0.6\n```\n\n### API 接入\n\n启动成功后，Aphrodite 会在本地创建一个兼容 [OpenAI](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fapi-reference\u002F) 的 API 服务器，默认监听端口为 **2242**。\n\n您可以将 API 地址配置到支持 OpenAI 协议的 UI 工具中（如 SillyTavern），或直接调用 API 接口进行测试。\n\n如需查看完整的启动参数和标志列表，可运行：\n\n```sh\naphrodite run -h\n```","某电商客服团队正在构建智能问答系统，旨在大促期间高效处理海量用户咨询。他们选择了开源大模型进行私有化部署，但面临性能瓶颈。\n\n### 没有 aphrodite-engine 时\n- 传统推理框架在并发高峰时显存碎片严重，导致 OOM 崩溃频繁，服务不可用。\n- 首字生成延迟高达数秒，用户体验极差，直接影响客户满意度和转化率。\n- 硬件资源浪费严重，单张显卡无法承载预期负载，需堆砌大量服务器增加成本。\n- 集成第三方应用困难，API 格式不统一导致前端适配工作量巨大且易出错。\n\n### 使用 aphrodite-engine 后\n- aphrodite-engine 的 PagedAttention 机制彻底消除显存碎片，单卡吞吐量提升 3 倍以上。\n- 连续批处理技术大幅降低首字延迟，响应速度提升至毫秒级水平，流畅度显著改善。\n- 支持量化与分布式推理，在同等硬件下支撑更多并发用户且整体运营成本大幅降低。\n- 原生兼容 OpenAI API 标准，无需修改代码即可无缝接入 SillyTavern 等现有对话界面。\n\naphrodite-engine 通过极致优化推理引擎，实现了低成本、高并发的企业级大模型服务落地。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Faphrodite-engine_aphrodite-engine_ecb97d82.png","Aphrodite Engine","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Faphrodite-engine_e3dc5c68.png","The Aphrodite Engine Project",null,"https:\u002F\u002Fgithub.com\u002Faphrodite-engine",[20,24,28,32,36,40,43,47,50],{"name":21,"color":22,"percentage":23},"C++","#f34b7d",65.9,{"name":25,"color":26,"percentage":27},"Python","#3572A5",29.5,{"name":29,"color":30,"percentage":31},"Cuda","#3A4E3A",4,{"name":33,"color":34,"percentage":35},"C","#555555",0.4,{"name":37,"color":38,"percentage":39},"CMake","#DA3434",0.1,{"name":41,"color":42,"percentage":39},"Shell","#89e051",{"name":44,"color":45,"percentage":46},"Dockerfile","#384d54",0,{"name":48,"color":49,"percentage":46},"PowerShell","#012456",{"name":51,"color":52,"percentage":46},"Jinja","#a52a22",1684,188,"2026-04-04T12:25:23","AGPL-3.0",3,"Linux, Windows","需要 GPU，CUDA >= 12，支持 NVIDIA (Pascal\u002FGTX 10xx+)、AMD、Intel、TPU，显存占用约 90%","未说明",{"notes":62,"python":63,"dependencies":64},"默认占用 90% 显存，可通过 --gpu-memory-utilization 或 --single-user-mode 调整；需单独安装 CUDA 算子内核；支持多种量化格式；提供 Docker 镜像","3.9 - 3.12",[6,65,66,67,68],"vllm","ray","torch","transformers",[70,71],"插件","开发框架",[73,74,75,76,77,78,79,80,81,82],"api-rest","inference-engine","machine-learning","cuda","inferentia","rocm","intel","lora","speculative-decoding","tpu","ready","2026-03-27T02:49:30.150509","2026-04-06T06:45:37.411778",[87,92,97,101,106,111],{"id":88,"question_zh":89,"answer_zh":90,"source_url":91},2662,"安装 Aphrodite Engine 耗时过长怎么办？","由于项目包含大量内核以保持高性能，无法直接移除。建议等待切换到 nightly wheels（每个 commit 构建），届时可直接使用预编译包而无需自行构建。目前仍需自行构建，耗时较长。","https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fissues\u002F593",{"id":93,"question_zh":94,"answer_zh":95,"source_url":96},2663,"Docker 镜像启动报错，提示找不到入口脚本路径？","维护者已修复 entrypoint 脚本中的路径问题。请拉取最新的 Docker 镜像以应用修复，旧镜像中可能存在路径指向错误的情况。","https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fissues\u002F310",{"id":98,"question_zh":99,"answer_zh":100,"source_url":96},2664,"Docker 启动时如何避免 Numba 缓存相关问题？","在启动 Docker 容器时，请设置环境变量 `NUMBA_CACHE_DIR=\u002Ftmp\u002Fnumba_cache`。维护者正在构建包含此修复的新镜像，也可手动设置该变量。",{"id":102,"question_zh":103,"answer_zh":104,"source_url":105},2665,"运行时报错 \"Device Side Assertion... index out of bounds\" 如何处理？","这通常是由分词器（tokenizer）不匹配导致的。对于基于 Mistral 的模型，请在 Ooba 设置中检查 tokenizer 选项，将其设置为 'mistral' 或 'mixtral'。例如 PiVoT-SOLAR-10.7B 模型需配合 mistral tokenizer 使用。","https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fissues\u002F199",{"id":107,"question_zh":108,"answer_zh":109,"source_url":110},2666,"模型转换时出现 \"GGMLQuantizationType\" 无效值错误？","确保本地代码版本与 pip 安装包版本一致。如果你安装了 v0.5.2 的 pip 包，克隆仓库后应执行 `git checkout v0.5.2` 再运行转换脚本，不要混用不同版本的工具链。","https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fissues\u002F448",{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},2667,"使用 Gemma 模型有哪些已知限制或警告？","Aphrodite 当前暂不完全支持 Gemma 2 的滑动窗口注意力机制（sliding window attention），系统会自动禁用该功能并将最大长度限制为 4096。同时注意 bitsandbytes 量化尚未完全优化，推理速度可能较慢。","https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fissues\u002F623",[117,122,127,132,137,142,147,152,157,162,167,172,177,182,187,192,197,202,207,212],{"id":118,"version":119,"summary_zh":120,"released_at":121},102163,"v0.6.2","## What's Changed\r\n* feat: FP8 quantization support for AMD ROCm by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F729\r\n* feat: add experts_int8 support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F730\r\n* chore: move update_flash_attn_metadata to attn backend by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F731\r\n* chore: register lora functions as torch ops by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F732\r\n* feat: dynamo support for ScalarType by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F733\r\n* fix: types in AQLM and GGUF for dynamo support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F736\r\n* fix: `custom_ar` check by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F737\r\n* fix: clear engine ref in RPC server by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F738\r\n* fix: use nvml to get consistent device names by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F739\r\n* feat: add Exaone model support by @shing100 in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F743\r\n* fix: minor bug fixes & clean-ups by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F744\r\n* chore: refactor `MultiModalConfig` initialization and profiling by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F745\r\n* chore: various TPU fixes and optimizations by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F746\r\n* fix: metrics endpoint with RPC server by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F747\r\n* chore: refactor llama3 rope by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F748\r\n* feat: add XTC Sampling by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F740\r\n* ci: fix dep install using pnpm by @ahme-dev in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F749\r\n* ci: fix docs deployment by @ahme-dev in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F750\r\n* chore: re-enable custom token bans by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F751\r\n* feat: bring back dynatemp by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F754\r\n* feat: quant_llm support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F755\r\n* fix: add pandas to requirements by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F756\r\n* docs: update readme and quant docs by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F757\r\n* ci: bump version to 0.6.2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F758\r\n\r\n## New Contributors\r\n* @shing100 made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F743\r\n* @ahme-dev made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F749\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.1.post1...v0.6.2","2024-09-22T01:51:34",{"id":123,"version":124,"summary_zh":125,"released_at":126},102164,"v0.6.1.post1","## What's Changed\n* chore: register custom torch ops for flash-attn and flashinfer by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F724\n* feat: launch API server with uvloop by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F725\n* chore: fix return statement in `Detokenizer.decode_sequence_inplace` by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F727\n* Fix tensor parallelism, libcudart path for some versions of pytorch by @miku448 in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F726\n* ci: bump to 0.6.1.post1 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F728\n\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.1...v0.6.1.post1","2024-09-13T08:09:40",{"id":128,"version":129,"summary_zh":130,"released_at":131},102152,"v0.10.0","## What's Changed\r\n* feat: qwen3-next tool parser by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1512\r\n* [Build] feat: add support for incremental cmake builds by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1515\r\n* chore: cleanup aphrodite FA directory by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1516\r\n* docs: update documentation on adding support for new models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1517\r\n* fix: multi-node serving with ray by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1518\r\n* chore: migrate whisper to TensorSchema by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1519\r\n* feat: add logging for model parameter count by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1525\r\n* [Attention] feat: add support for Context Parallelism by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1521\r\n* [Model] feat: support BailingMoe V2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1527\r\n* [API] chore: separate Kobold API code to its own serving class by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1529\r\n* Revert \"[API] chore: separate Kobold API code to its own serving class\" by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1530\r\n* fix: error propagation in chat completions by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1532\r\n* [API] chore: separate Kobold API code to its own serving class by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1531\r\n* [API] chore: remove dead code from the old kobold api module by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1533\r\n* [API] fix: anthropic messages API by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1534\r\n* [build] fix: relax xformers dependency version by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1536\r\n* [PP] fix: Qwen3-Next with Pipeline Parallelism by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1537\r\n* Update readme by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1538\r\n* [Kernel] chore: add tuned kernel configs for BailingMoEV2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1542\r\n* [config] fix: set the correct max_model_len with YaRN scaling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1543\r\n* [API] feat: add lightweight tokenizer-only API server by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1545\r\n* release: v0.10 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1549\r\n* [build] bump flashinfer to 0.5.0 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1551\r\n* [API] feat: add model management endpoints for loading and unloading models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1553\r\n* [core] feat: enable dynamic KV cache allocation by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1552\r\n* fix: quantization import for kimi-linear KDA by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1555\r\n* [API] feat: add multi-model support by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1554\r\n* fix: Kimi-Linear with AWQ quants by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1556\r\n* ci: make gemini PR reviews less verbose by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1557\r\n* fix: avoid GPU-CPU sync in MTP by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1558\r\n* [kernel] fix: use the same H200 config for both H200 and H200 NVL by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1559\r\n* [API] fix: task log when multi-model is not enabled by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1560\r\n* fix: ensure model_registry is not empty before accessing models in OpenAIServing by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1561\r\n* [ci] chore: update pre-commit scripts by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1562\r\n* [ci] chore: make all pre-commit checks pass by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1563\r\n* [core]: update `cu_num_accepted_tokens` for all `req_index` by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1564\r\n* [API] feat: enable DP-aware routing in OAI requests by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1565\r\n* [logger] fix: don't record sleep mode logs when not in dev mode by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1566\r\n* [dist","2025-11-08T13:51:05",{"id":133,"version":134,"summary_zh":135,"released_at":136},102153,"v0.9.1","## What's Changed\n* feat: implement Motif model by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1439\n* [V1] feat: support fp8 kv on ampere through flashinfer by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1440\n* fix: enable Kobold API by default by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1441\n* feat: add DeepConf sampling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1442\n* build: add option to disable flash attn compilation by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1443\n* [PP] feat: optimize PP performance through throttling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1447\n* chore: sync to upstream 4071c76 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1448\n* feat: support encoder data parallel for MiniCPM-V by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1449\n* feat: enable LoRA support for DeepSeek models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1450\n* fix: add `reorder_batch` to `AttentionMetadataBuilder` by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1451\n* chore: move out `freezing_value`\u002F`cuda_event` init outside try\u002Ffinally block by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1452\n* chore: better type hint for rearrange return value in eplbstate by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1453\n* chore: faster LoRA startup by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1454\n* fix: `truncate_prompt_tokens` type hint by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1455\n* feat: allow passing `multi_modal_uuids` as multimodal identifiers by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1456\n* fix: vocab_size check by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1457\n* feat: support KV events from connectors by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1458\n* fix: AutoGPTQ Qwen3-MoE models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1459\n* fix: avoid redundant copy for enc-only models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1460\n* fix: loading GPTQ 3-bit models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1461\n* chore: move fast prefill logic to a separate method by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1462\n* fix: update constructors and type hints for multimodal input handling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1463\n* chore: update import for torch inductor configuration in CPU model runner by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1464\n* chore: migrate Phi4 inputs to TensorSchema by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1465\n* feat: IO processor plugin for pooling models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1466\n* fix: add support for `\u003Ctool_call>` format in streaming mode for xlam by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1467\n* fix: allow FP16 inference on Turing and below by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1468\n* build: upgrade DeepGEMM by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1469\n* feat: Gemma3n audio endpoint support by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1470\n* feat: support KeyeVL-1.5 model by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1471\n* chore: minor code simplification for spec decode by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1472\n* fix: IO processor plugin fixes by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1473\n* feat: support DP for Kimi-VL ViT by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1474\n* chore: move logprob to a separate module by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1475\n* [kernel] feat: support FP32 for mamba by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1476\n* fix: blip2 inference by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1477\n* chore: remove runtime checks for pooling params by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1478\n* chore: migrate Ovis to TensorSchema by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1479\n* feat: add online FP8 support for XPU by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1480\n* chore: migrate interns to TensorSchema by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphr","2025-09-10T18:23:53",{"id":138,"version":139,"summary_zh":140,"released_at":141},102154,"v0.9.0","# v0.9.0\r\n\r\nIt has been a long time. There have been many, many changes between this release and v0.6.7. I'll try to summarize the most important ones, but I'll likely miss quite a lot.\r\n\r\n### New Models\r\n- [AIMv2](https:\u002F\u002Fhuggingface.co\u002Fapple\u002Faimv2-large-patch14-224)\r\n- [Arcee](https:\u002F\u002Fhuggingface.co\u002Farcee-ai\u002FAFM-4.5B)\r\n- [Aria](https:\u002F\u002Fhuggingface.co\u002Frhymes-ai\u002FAria)\r\n- [Aya Vision](https:\u002F\u002Fhuggingface.co\u002FCohereLabs\u002Faya-vision-8b)\r\n- [BaiLing](https:\u002F\u002Fhuggingface.co\u002FinclusionAI\u002FLing-plus)\r\n- [Bamba](https:\u002F\u002Fhuggingface.co\u002Fibm-ai-platform\u002FBamba-9B-v2)\r\n- [BertModel](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbge-base-en-v1.5) (encoder-only embedding)\r\n- [Command A Vision](https:\u002F\u002Fhuggingface.co\u002FCohereLabs\u002Fcommand-a-vision-07-2025)\r\n- [CLIP Text Model](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fclip-vit-large-patch14)\r\n- [DeepSeek-V3](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FDeepSeek-R1)\r\n- [dots.llm1](https:\u002F\u002Fhuggingface.co\u002Frednote-hilab\u002Fdots.llm1.inst)\r\n- [Ernie-4.5](https:\u002F\u002Fhuggingface.co\u002Fbaidu\u002FERNIE-4.5-0.3B-Base-PT)\r\n- [Ernie-4.5 MoE](https:\u002F\u002Fhuggingface.co\u002Fbaidu\u002FERNIE-4.5-300B-A47B-PT)\r\n- [EXAONE-4](https:\u002F\u002Fhuggingface.co\u002FLGAI-EXAONE\u002FEXAONE-4.0-32B)\r\n- [Falcon-H1](https:\u002F\u002Fhuggingface.co\u002Ftiiuae\u002FFalcon-H1-34B-Instruct)\r\n- [Florence-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FFlorence-2-large)\r\n- [Gemma-3](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-3-27b-it)\r\n- [Gemma-3n](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fgemma-3n-E4B-it)\r\n- [GLM-4](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002Fglm-4-9b-chat-1m)\r\n- [GLM-4V](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002Fglm-4v-9b)\r\n- [GLM-4.1V](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.1V-9B-Thinking)\r\n- [GLM-4.5](https:\u002F\u002Fhuggingface.co\u002Fzai-org\u002FGLM-4.5)\r\n- [GPT-Oss](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fgpt-oss-120b)\r\n- [Granite Speech](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-speech-3.3-8b)\r\n- [Granite MoE Hybrid](https:\u002F\u002Fhuggingface.co\u002Fibm-granite\u002Fgranite-4.0-tiny-preview)\r\n- [Granite MoE Shared](https:\u002F\u002Fhuggingface.co\u002Fibm-research\u002Fmoe-7b-1b-active-shared-experts)\r\n- [GritLM](https:\u002F\u002Fhuggingface.co\u002FGritLM\u002FGritLM-7B)\r\n- [Grok-1](https:\u002F\u002Fhuggingface.co\u002Fhpcai-tech\u002Fgrok-1)\r\n- [H2O-VL](https:\u002F\u002Fhuggingface.co\u002Fh2oai\u002Fh2ovl-mississippi-2b)\r\n- [Hunyuan V1](https:\u002F\u002Fhuggingface.co\u002Ftencent\u002FHunyuan-A13B-Instruct)\r\n- [HyperCLOVA X SEED](https:\u002F\u002Fhuggingface.co\u002Fnaver-hyperclovax\u002FHyperCLOVAX-SEED-Think-14B)\r\n- [Idefics3](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceM4\u002FIdefics3-8B-Llama3)\r\n- [Mono-InternVL](https:\u002F\u002Fhuggingface.co\u002FOpenGVLab\u002FMono-InternVL-2B)\r\n- [Intern-S1](https:\u002F\u002Fhuggingface.co\u002Finternlm\u002FIntern-S1)\r\n- [JinaVL Reranker](https:\u002F\u002Fhuggingface.co\u002Fjinaai\u002Fjina-reranker-m0)\r\n- [Keye](https:\u002F\u002Fhuggingface.co\u002FKwai-Keye\u002FKeye-VL-8B-Preview)\r\n- [Kimi-VL](https:\u002F\u002Fhuggingface.co\u002Fmoonshotai\u002FKimi-VL-A3B-Thinking-2506)\r\n- [Llama-4](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FLlama-4-Maverick-17B-128E-Instruct)\r\n- [Mamba-2](https:\u002F\u002Fhuggingface.co\u002Fstate-spaces\u002Fmamba-2.8b-hf)\r\n- [MiMo](https:\u002F\u002Fhuggingface.co\u002FXiaomiMiMo\u002FMiMo-7B-SFT)\r\n- [MiniCPM-O](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-o-2_6)\r\n- [MiniMax-M1](https:\u002F\u002Fhuggingface.co\u002FMiniMaxAI\u002FMiniMax-M1-80k)\r\n- [MiniMax-VL](https:\u002F\u002Fhuggingface.co\u002FMiniMaxAI\u002FMiniMax-VL-01)\r\n- [Mistral-3](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FMistral-Small-3.2-24B-Instruct-2506)\r\n- [ModernBERT](https:\u002F\u002Fhuggingface.co\u002Fanswerdotai\u002FModernBERT-base)\r\n- [Nemotron-H](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FNemotron-H-8B-Reasoning-128K-FP8)\r\n- [Nemotron-NAS (Super)](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FLlama-3_3-Nemotron-Super-49B-v1_5)\r\n- [Nemotron-VL](https:\u002F\u002Fhuggingface.co\u002Fnvidia\u002FLlama-3.1-Nemotron-Nano-VL-8B-V1)\r\n- [OLMo-2](https:\u002F\u002Fhuggingface.co\u002Fallenai\u002FOLMo-2-0325-32B-Instruct)\r\n- [Ovis](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis1.5-Llama3-8B)\r\n- [Ovis-2](https:\u002F\u002Fhuggingface.co\u002FAIDC-AI\u002FOvis2-16B)\r\n- [Phi-4-MM](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-multimodal-instruct)\r\n- [Phi-4 Flash](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-mini-flash-reasoning)\r\n- [Plamo-2](https:\u002F\u002Fhuggingface.co\u002Fpfnet\u002Fplamo-2-1b)\r\n- [Prithvi GeoSpatial Model](https:\u002F\u002Fhuggingface.co\u002Fibm-nasa-geospatial\u002FPrithvi-EO-2.0-300M)\r\n- [Qwen2.5 Omni Thinker](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-Omni-7B)\r\n- [Qwen2.5 VL](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2.5-VL-7B-Instruct)\r\n- [Qwen2 Audio](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen2-Audio-7B)\r\n- [Qwen3](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-4B-Instruct-2507)\r\n- [Qwen3 MoE](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen3-Coder-480B-A35B-Instruct)\r\n- [XLM Roberta](https:\u002F\u002Fhuggingface.co\u002Fintfloat\u002Fmultilingual-e5-large)\r\n- [Skyword-R1V](https:\u002F\u002Fhuggingface.co\u002FSkywork\u002FSkywork-R1V-38B)\r\n- [Smol-VLM](https:\u002F\u002Fhuggingface.co\u002FHuggingFaceTB\u002FSmolVLM-Instruct)\r\n- [Step-3](https:\u002F\u002Fhuggingface.co\u002Fstepfun-ai\u002Fstep3)\r\n- [Tarsier](https:\u002F\u002Fhuggingface.co\u002Fomni-research\u002FTarsier-7b)\r\n- [TeleChat-2](https:\u002F\u002Fhuggingface.co\u002FTele-AI\u002FTeleChat2-3B)\r\n- [Tele-FLM](https:\u002F\u002Fhuggingface.co\u002FCofeAI\u002FTele-FLM)\r\n- [Voxtral](https:\u002F\u002Fhuggingface.co\u002Fmistralai\u002FVoxtral-Mini-3B-2507)\r\n- [Whisper](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fwhisper-large-v3)\r\n- [Zamba2](https:\u002F\u002Fhuggingface.co\u002FZyphra\u002FZamba2-7B-Instruct)\r\n\r\n#### Transformers backend\r\nYou can now load any unsupported model using","2025-08-24T15:33:55",{"id":143,"version":144,"summary_zh":145,"released_at":146},102155,"v0.6.7","## What's Changed\n* Core: add output streaming support to multi-step + async by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1112\n* tests: update scheduler tests by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1113\n* (1\u002FN) XQA: integrate the XQA CUDA kernels within Aphrodite by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1115\n* chore: support loading weights by ID within models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1116\n* chore: expose phi3_v num_crops as an mm_processor_kwargs by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1117\n* fix: unsafe all-reduce sync by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1118\n* kernels: split marlin kernels for faster compile, fix MoE, temporarily remove HQQ by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1119\n* LLM: enable batched inference for llm.chat() API by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1120\n* Quantization: re-enable Marlin serialization for AWQ quants by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1121\n* fix: torch.compile dynamo fix by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1122\n* chore: bump bitsandbytes version to latest; enable cuda graphs for 4bit bnb by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1123\n* (1\u002FN) Triton Backend: integrate Triton layernorm kernels by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1125\n* (2\u002FN) Triton Backend: integrate Triton activation kernels by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1126\n* chore: remove trailing whitespaces by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1128\n* chore: support prompt_logprobs with speculative decoding by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1129\n* feat: add Priority-based Scheduling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1130\n* API: use heartbeats instead of health checks by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1131\n* kernel: fix custom all-reduce kernel compilation on Pascal GPUs by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1132\n* fix: propagate trust_remote_code in InternVL and MiniCPM-V by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1133\n* fix: load fully-connected layer bias for EAGLE models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1134\n* API: propagate usage accounting to FastAPI middleware layer by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1135\n* fix: ray 2.9.x does not expose available_resources_per_node by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1136\n* fix: multi-step scheduling with InternVL by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1137\n* chore: support FP8 MoE for compressed-tensors by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1138\n* model: add support for Mllama (Llama 3.2) models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1139\n* fix: quantization for Mllama models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1140\n* fix: include encoder prompt len to non-stream api usage response by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1141\n* fix: downgrade logger.warning for BOS fallback to print_warning_once by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1142\n* fix: only set tool_choice to auto if at least one tool is provided by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1143\n* API: add tool calling support for Llama 3.1 and 3.2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1144\n* fix: batched inference with fuyu by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1145\n* TPU: support Trillium by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1146\n* torch.compile: use empty tensor instead of None for profiling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1147\n* kernel: Integrate asymmetric quantization for INT8 activations by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1148\n* core: add support for chunked prefill + multi-step scheduling by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1149\n* distributed: add env var to force custom all-reduce by skipping p2p check by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1150\n* chore: add priority scheduling to async engine by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1","2025-03-07T11:36:20",{"id":148,"version":149,"summary_zh":150,"released_at":151},102162,"v0.6.2.post1","## What's Changed\r\n* fix: kobold api for horde by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F763\r\n* Fix for a crash from token bans by @Pyroserenus in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F764\r\n* Modified throughput benchmark to allow --max-num-seqs by @Pyroserenus in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F770\r\n* Simplify construction of sampling_metadata by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F766\r\n* Add OLMoE by @fizzAI in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F772\r\n* feat: ministral support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F776\r\n* Make amd usable by @Naomiusearch in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F775\r\n* docker: apply AMD patch in the dockerfile by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F777\r\n* fix: demote skip_special_tokens assertion to logger error by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F778\r\n* ci: bump version to 0.6.2.post1 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F779\r\n\r\n## New Contributors\r\n* @fizzAI made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F772\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.2...v0.6.2.post1","2024-10-16T16:39:10",{"id":153,"version":154,"summary_zh":155,"released_at":156},102156,"v0.6.6","## What's Changed\n* distributed: support pipeline parallelism for internvl and internlm2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F965\n* tpu: add support for async postprocessing by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F968\n* fix: prometheus.yaml path in monitoring example by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F969\n* tpu: support single and multi-host TPUs on GKE and RayServe by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F970\n* vlm: add tensor parallel support for vision transformer models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F971\n* tests: update internvl test for #971 by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F972\n* vlm: increase the default `max_num_batched_tokens` for multimodal models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F973\n* core: fix chunked prefill not being enabled by default for long contexts by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F974\n* tpu: fix TPU type api by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F975\n* fix: modelscope for VLMs by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F976\n* fix: crash when cancelling a request with multi-step by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F977\n* models: add support for IBM Granite (PowerLM) models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F978\n* tpu: align worker index with node boundary by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F979\n* fix: InternLM2 model with Tensor Parallel by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F980\n* core: slightly improve chunked prefill performance by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F981\n* vlm: fallback to SDPA for ViT models on CPU backend by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F982\n* core: improve async postproc + multi-step performance by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F983\n* fix: raise exception when accessing logger for disable_log_stats=True case by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F984\n* chore: rename `task_handler` to `worker` by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F985\n* tpu: fix outputs by correcting the next_token_ids shape by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F986\n* quants: add GPTQ and FBGEMM to AphroditeParameters by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F987\n* benchmarks: add `--async-engine` arg to throughput benchmark by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F988\n* tpu: use XLA rank for persistent cache path by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F989\n* vlm: support multiple audios per prompt for Ultravox by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F990\n* vlm: fix siglip layernorm and paligemma weight loading by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F991\n* vlm: enable multimodal inputs for the LLM class by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F992\n* api: implement OpenAI-compatible tools API for Hermes\u002FMistral models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F993\n* neuron: add 8bit quantization for Neuron by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F994\n* models: add support for QwenVL by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F995\n* fix: gptq_marlin exception on older GPUs by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F996\n* chore: use `ray[adag]` dep instead of cuda by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F997\n* quants: improve awq_triton throughput by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F998\n* fix: hermes tool call chat template by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F999\n* core: fix async postprocessor in case of preemption by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1000\n* vlm: add multi-input support for LLaVA and InternVL models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1002\n* tools: fix tool calls to more strictly follow OpenAI format by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1003\n* fix: LoRA support for Cohere and Jamba models by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphrodite-engine\u002Fpull\u002F1004\n* spec decode: move ops.advane_step to flash attention backend by @AlpinDale in https:\u002F\u002Fgithub.com\u002Faphrodite-engine\u002Faphro","2025-01-27T15:53:55",{"id":158,"version":159,"summary_zh":160,"released_at":161},102157,"v0.6.5","## What's Changed\r\n* xpu: refactor XPU worker & executor by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F861\r\n* build: add jinja2 to requirements file by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F862\r\n* attention: add `AttentionState` abstraction by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F863\r\n* xpu: disable punica kernels for XPU by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F864\r\n* executor: pipe `worker_class_fn` arg in executor by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F865\r\n* server: log the process occupying our port by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F866\r\n* feat: AWQ quantization for InternVL by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F867\r\n* Rewrite DRY sampler to be a lot faster by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F868\r\n* fix: ROCm build by @Naomiusearch in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F817\r\n* fix: temp_last warning being repeated for every output token by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F869\r\n* feat: add support for chunked prefill + prefix caching by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F871\r\n* async: avoid premature exit in the async generator by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F872\r\n* cpu: fix `mm_limits` initialization by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F873\r\n* spec decoding: set the draft model ctxlen to target model by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F874\r\n* sampler: pad dry sequence breakers tensor by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F875\r\n* fix: `add_generation_template` -> `add_generation_prompt` in llm by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F877\r\n* Update README.md by @NoahBPeterson in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F876\r\n* api: fix crashes under very high loads by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F878\r\n* build: pass `PYTHONPATH` from setup.py to cmake by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F879\r\n* async: disable multi-step scheduling for sync engine by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F880\r\n* api: better startup failure UX by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F881\r\n* chore: consolidate environment variables within one file by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F882\r\n* core: fix spec decode metrics and envs circular import by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F889\r\n* feat: add support for audio models by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F891\r\n* distributed: fix issue for when nodes have multiple network interfaces by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F892\r\n* rocm: fix compile issues with rocm 6.2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F893\r\n* build: fix invalid path for envs.py in setup by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F894\r\n* kernel: use `cub::BlockReduce` instead of custom impl by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F895\r\n* fix: Phi 3.5 Vision model loading by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F896\r\n* api: add client timeouts for the ZeroMQ server by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F897\r\n* feat: add torch.compile for GemmaRMSNorm by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F898\r\n* spec decode: add support for EAGLE by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F899\r\n* fix: `ShardedStateLoader` with fp8 quant by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F900\r\n* kernel: do not compile machete for cuda 11 and below by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F901\r\n* chore: add AphroditeParameter support for FP8 quant by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F902\r\n* spec decode: fix logprobs when using speculative decoding by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F904\r\n* api: error suppression cleanup + timeout suppression on aborts by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F905\r\n* ray: better error when placement group topology is incorrect by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F906\r\n* xpu: refactor the model runner for tensor parallelism by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F910\r\n* fix: empty prompt crashing the server by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F912\r\n* quantization: update marlin to use `AphroditeParameters` by @AlpinD","2024-12-22T06:43:14",{"id":163,"version":164,"summary_zh":165,"released_at":166},102158,"v0.6.4.post1","## What's Changed\n* add linux arm64\u002Faarch64\u002FGH200 installation tips by @qpwo in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F851\n* DRY Fix: Add output_tokens to sampler by @selalipop in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F849\n* sampler: fix DRY concurrency issue by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F852\n* sampler: add range parameter for DRY by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F855\n* sampler: optimize DRY performance using z-algorithm by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F856\n* sampler: allow parsing sampler order using strings by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F858\n\n## New Contributors\n* @qpwo made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F851\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.4...v0.6.4.post1","2024-12-03T01:51:42",{"id":168,"version":169,"summary_zh":170,"released_at":171},102159,"v0.6.4","## What's Changed\r\n* frontend: enable kobold api by default by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F803\r\n* feat: add serviceinfo endpoint by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F807\r\n* feat: update to serviceinfo v0.2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F808\r\n* Mask dynatemp using min\u002Fmax, rather than exp by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F813\r\n* fix: temperature issues by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F814\r\n* fix: --max-seq-len-to-capture arg by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F818\r\n* [IMPORTANT] updating test units by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F769\r\n* fix: tokenization api test by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F821\r\n* feat: add chat method for LLM class by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F822\r\n* feat: support chunked prefill with LoRA by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F823\r\n* SPMD optimizations by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F824\r\n* fix: sampler test with new transformers version by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F826\r\n* feat: add cuda sampling kernels for top_k and top_p by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F828\r\n* feat: add metrics for prefix cache hit rate by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F829\r\n* fix: unbound tokenizer error by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F830\r\n* feat: multi-step scheduling by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F831\r\n* feat: Add DRY (Do not Repeat Yourself) sampling by @selalipop in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F827\r\n* feat: add no_repeat_ngram sampler by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F832\r\n* feat: add skew sampling by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F834\r\n* fix: hidden states handling in batch expansion for spec decoding by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F839\r\n* chore: refactor executor classes for easier inheritance by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F840\r\n* fix: latency and serving benchmarks by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F841\r\n* feat: Machete Kernels for Hopper GPUs by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F842\r\n* feat: add sampler_priorty by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F837\r\n* fix: disable awq_marlin override for awq models by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F843\r\n* chore: bump mistral_common to 1.5.0 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F844\r\n* ci: bump version to 0.6.4 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F845\r\n\r\n## New Contributors\r\n* @dependabot made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F796\r\n* @selalipop made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F827\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.3...v0.6.4","2024-11-27T07:31:13",{"id":173,"version":174,"summary_zh":175,"released_at":176},102160,"v0.6.3.post1","## What's Changed\r\n* build(deps): bump rollup from 4.21.0 to 4.24.3 in \u002Fdocs by @dependabot in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F796\r\n* fix: compilation of gptq_marlin_gemm object by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F800\r\n* ci: bump to 0.6.3.post1 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F801\r\n\r\n## New Contributors\r\n* @dependabot made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F796\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.3...v0.6.3.post1","2024-11-02T19:11:43",{"id":178,"version":179,"summary_zh":180,"released_at":181},102161,"v0.6.3","## What's Changed\r\n* Stream models rather than load them completely into RAM. by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F785\r\n* feat: windows support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F790\r\n* fix: windows wheel url by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F794\r\n* fix: kobold lite embedded UI on windows by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F797\r\n* feat: add HQQ quantization support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F795\r\n* frontend: minor logging improvements by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F787\r\n* ci: bump version to 0.6.3 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F799\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.2.post1...v0.6.3","2024-11-02T13:21:02",{"id":183,"version":184,"summary_zh":185,"released_at":186},102165,"v0.6.1","# Aphrodite Engine - v0.6.1\r\n\r\n## What's Changed\r\n* ci: exclude cu118 from build and add py_limited_api by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F639\r\n* fix: better async request cancellation by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F641\r\n* fix: gracefully handle missing chat template by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F642\r\n* chore: deduplicate nvlink check to cuda platform by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F643\r\n* fix: hardcoded float16 in embedding mode check by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F645\r\n* quadratic sampling: separate diff from logits to filter out NaNs. by @50h100a in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F644\r\n* fix: RSLoRA support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F647\r\n* feat: introduce `BaseAphroditeParameter` by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F646\r\n* fix: move zeromq rpc frontend to IPC instead of TCP by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F652\r\n* fix: input processor in internvl2 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F653\r\n* fix: multiprocessing timeout by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F654\r\n* fix: GPTQ\u002FAWQ on Colab by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F655\r\n* fix: make `merge_async_iterators.is_cancelled()` optional by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F656\r\n* fix: flashinfer outputs by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F657\r\n* fix: max_num_batched_tokens should not be limited for lora by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F658\r\n* fix: lora with pipeline parallel by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F659\r\n* fix: kill api server when pinging dead engine by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F660\r\n* fix: `get_num_blocks_touched` logic by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F661\r\n* chore: update the env.py script and the bug report template by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F662\r\n* feat: add INT8 W8A16 quant for TPU by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F663\r\n* feat: allow serving encoder-decoder models in the API server by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F664\r\n* fix: deps with TPU dockerfile by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F665\r\n* optimization: reduce end-to-end overhead from python obj allocation by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F666\r\n* fix: minor adjustments to scheduler and block manager by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F667\r\n* feat: enable using fp8 kv and prefix caching with chunked prefill by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F668\r\n* fix: mlpspeculator with padded vocab by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F669\r\n* feat: option to apply temperature scaling last by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F670\r\n* chore: decouple `should_modify_greedy_probs_inplace` by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F671\r\n* chore: better stream termination in async engine by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F672\r\n* chore: mamba cache single buffer by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F673\r\n* feat: mamba model support by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F674\r\n* fix: reinit procedure in `ModelInputForGPUBuilder` by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F675\r\n* feat: embeddings support for batched OAI endpoint by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F676\r\n* fix: fp8 checkpoints with fused linear modules by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F677\r\n* feat: add numpy implementation of `compute_slot_mapping` by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F678\r\n* fix: chunked prefill with v2 block manager by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F679\r\n* fix: phi3v batch inference with different aspect ratio images by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F680\r\n* chore: use mark_dynamic to reduce TPU compile times by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F681\r\n* chore: bump lmfe to v0.10.6 and include triton for tpu and xpu dockefiles by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F682\r\n* refactor: base worker input refactor for multi-step by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F683\r\n* build: add","2024-09-12T03:48:13",{"id":188,"version":189,"summary_zh":190,"released_at":191},102166,"v0.6.0.post1","## What's Changed\n* feat: add siglip encoder for llava family by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F626\n* readme: fix model name typo by @Trapper4888 in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F627\n* feat: multi-image input for minicpmv by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F628\n* feat: Add support for GPU device selection in SpecDecodeBaseSampler by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F629\n* feat: per-tensor token epilogue kernels by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F630\n* chore: optimize evictor v2 performance by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F631\n* feat: initial encoder-decoder support with BART model by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F633\n* fix: default api port and attention selector by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F634\n* fix: clean up incorrect log in worker by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F636\n* bump to v0.6.0.post1 by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F635\n\n## New Contributors\n* @Trapper4888 made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F627\n\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.6.0...v0.6.0.post1","2024-09-06T05:08:24",{"id":193,"version":194,"summary_zh":195,"released_at":196},102167,"v0.6.0","# v0.6.0 - \"Kept you waiting, huh?\" Edition\r\n\r\n![fixed-actual1](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fde43de9b-95e8-45a0-a6ea-a09df0d4207d)\r\n\r\n\r\n## What's Changed\r\n* Fix quants installation on ROCM by @Naomiusearch in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F469\r\n* chore: add contribution guidelines + Code of Conduct by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F507\r\n* Remove `$` from the shell code blocks in README by @matthusby in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F538\r\n* [0.6.0] Release Candidate by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F481\r\n\r\n## New Contributors\r\n* @matthusby made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F538\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.5.3...v0.6.0","2024-09-03T07:11:20",{"id":198,"version":199,"summary_zh":200,"released_at":201},102168,"v0.5.3","## What's Changed\r\nA new release, one that took too long again. We have some cool new features, however.\r\n\r\n- **ExllamaV2 tensor parallel**: You can now run ExllamaV2 quantized models on multiple GPUs. This should be the fastest multi-gpu experience with exllamav2 models.\r\n- **Support for Command-R+**\r\n- **Support for DBRX**\r\n- **Support for Llama-3**\r\n- **Support for Qwen 2 MoE**\r\n- **`min_tokens` sampling param**: You can now set a minimum amount of tokens to generate.\r\n- **Fused MoE for AWQ and GPTQ quants**: AWQ and GPTQ kernels have been updated with optimized fused MoE code. They should be significantly faster now.\r\n- **CMake build system**: Slightly faster, much cleaner builds.\r\n- **CPU support**: You can now run aphrodite on CPU only systems! Needs an AVX512-compatible CPU for now.\r\n- **Speculative Decoding**: Speculative Decoding is finally here! You can either use a draft model, or use prompt lookup decoding with an ngram model (built-in).\r\n- **Chunked Prefill**: Before this, Aphrodite would process prompts in chunks equal to the model's context length. Now, you can enable this option (via `--enable-chunked-prefill`) to process in chunks of 768 by default, massively increasing the amount of context you can fit. Does not currently work with context shift or FP8 KV cache.\r\n- **Context Shift reworked**: Context shift finally works now. Enable it with `--context-shift` and Aphrodite will cache processed prompts and re-use them.\r\n- **FP8 E4M3 KV Cache**: This is for ROCm only. Support will be extended to NVIDIA soon. E4M3 has higher quality compared to E5M2, but doesn't lead to any throughput increase.\r\n- **Auto-truncation in API**: The API server can now optionally left-truncate your prompts. Simply pass `truncate_prompt_tokens=1024` to truncate any prompt larger than 1024 tokens.\r\n- **Support for Llava vision models**: Currently 1.5 is supported. With the next release, we should have 1.6 along with a proper GPT4-V compatible API.\r\n- **LM Format Enforcer**: You can now use LMFE for guided generations.\r\n- **EETQ Quantization**: EETQ support has been added - a SOTA 8bit quantization method.\r\n- **Arbitrary GGUF model support**: We were limited to only Llama models for GGUF, now any GGUF is supported. You will need to convert the model beforehand for them, however.\r\n- **Aphrodite CLI app**: You no longer have to type `python -m aphrodite...`. Simply type `aphrodite run meta-llama\u002FMeta-Llama-3-8B` to get started. Pass extra flags as normal.\r\n- **Sharded GGUF support**: You can now load sharded GGUF models. Pre-conversion needed.\r\n- **NVIDIA P100\u002FGP100 support**: Support has been restored.\r\n\r\nThanks to all the new contributors!\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.5.2...v0.5.3","2024-05-11T22:34:52",{"id":203,"version":204,"summary_zh":205,"released_at":206},102169,"v0.5.2","## What's Changed\r\n\r\nA few fixes and new additions:\r\n\r\n- **Support for CohereAI's command-r model**: Currently, GGUF is unsupported. You can load the base model with `--load-in-4bit` or `--load-in-smooth` if you have an RTX 20xx series (or sm_75).\r\n- Fix an issue where some GPU blocks were missing. This should give a significant boost to how much context you can use.\r\n- Fix logprobs when -inf with some models.\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.5.1...v0.5.2","2024-03-16T22:50:54",{"id":208,"version":209,"summary_zh":210,"released_at":211},102170,"v0.5.1","## What's Changed\r\n* feat(openai): Apply chat template for GGUF loader by @drummerv in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F312\r\n* Calculate total memory usage. by @sgsdxzy in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F316\r\n* chore: add new iMatrix quants by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F320\r\n* fix: optimize AQLM dequantization by @AlpinDale in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F325\r\n\r\n## New Contributors\r\n* @drummerv made their first contribution in https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fpull\u002F312\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FPygmalionAI\u002Faphrodite-engine\u002Fcompare\u002Fv0.5.0...v0.5.1","2024-03-15T02:55:39",{"id":213,"version":214,"summary_zh":215,"released_at":216},102171,"v0.5.0","# Aphrodite Engine, Release v0.5.0: It's Quantin' Time Edition\r\n\r\nIt's been over a month since our last release. Below is re-written using Opus from my crude hand-written release notes.\r\n\r\n## New Features\r\n\r\n- **Exllamav2 Quantization**: Exllamav2 quantization has been added, although it's currently limited to a single GPU due to kernel constraints.\r\n\r\n- **On-the-Fly Quantization**: With the help of `bitsandbytes` and `smoothquant+`, we now support on-the-fly quantization of FP16 models. Use `--load-in-4bit` for lightning-fast 4-bit quantization with `smoothquant+`, `--load-in-smooth` for 8-bit quantization using `smoothquant+`, and `--load-in-8bit` for 8-bit quantization using the `bitsandbytes` library (note: this option is quite slow). `--load-in-4bit` needs Ampere GPUs and above, the other two need Turing and above.\r\n\r\n- **Marlin Quantization**: Marlin quantization support has arrived, promising improved speeds at high batch sizes. Convert your GPTQ models to Marlin, but keep in mind that they must be 4-bit, with a group_size of -1 or 128, and act_order set to False.\r\n\r\n- **AQLM Quantization**: We now support the state-of-the-art 2-bit quantization scheme, AQLM. Please note that both quantization and inference are extremely slow with this method. Quantizing llama-2 70b on 8x A100s reportedly takes 12 days, and on a single 3090 it takes 70 seconds to reach the prompt processing phase. Use this option with caution, as the wait process may cause the engine to timeout (set to 60 seconds).\r\n\r\n- **INT8 KV Cache Quantization**: In addition to fp8_e5m2, we now support INT8 KV Cache. Unlike FP8, it doesn't speed up the throughput (it stays the same), but should offer higher quality, due to the calibration process. Uses the `smoothquant` algorithm for the quantization.\r\n\r\n- **Implicit GGUF Model Conversion**: Simply point the `--model` flag to your GGUF file, and it will work out of the box. Be aware that this process requires a considerable amount of RAM to load the model, convert tensors to a PyTorch state_dict, and then load them. Plan accordingly or convert first if you're short on RAM.\r\n\r\n- **LoRA support in the API**: The API now supports loading and inferencing LoRAs! Please refer to the wiki for detailed instructions.\r\n\r\n- **New Model Support**: We've added support for a wide range of models, including OPT, Baichuan, Bloom, ChatGLM, Falcon, Gemma, GPT2, GPT Bigcode, InternLM2, MPT, OLMo, Qwen, Qwen2, and StableLM.\r\n\r\n- **Fused Mixtral MoE**: Mixtral models (FP16 only) now utilize tensor parallelism with fused kernels, replacing the previous expert parallelism approach. Quantized Mixtrals still have this limitation, but we plan to address it by the next release.\r\n\r\n- **Fused Top-K Kernels for MoE**: This improvement benefits Mixtral and DeepSeek-MoE models by accelerating the top-k operation using custom CUDA kernels instead of `torch.topk`.\r\n\r\n- **Enhanced OpenAI Endpoint**: The OpenAI endpoint has been refactored, introducing JSON and Regex schemas, as well as a detokenization endpoint.\r\n\r\n- **LoRA Support for Mixtral Models**: You can now use LoRA with Mixtral models.\r\n\r\n- **Fine-Grained Seeds**: Introduce randomness to your requests with per-request seeds.\r\n\r\n- **Context Shift**: We have a naive context shifting mechanism. While it's not as effective as we'd like, it's available for experimentation purposes. Enable it using the `--context-shift` flag.\r\n\r\n- **Cubic Sampling**: Building upon quadratic sampling's smoothing_factor, we now support smoothing_curve.\r\n\r\n- **Navi AMD GPU Support**: GPUs like the 7900 XTX are now supported, although still experimental and requiring significant compilation efforts due to xformers.\r\n\r\n- **Kobold API Deprecation**: The Kobold API has been deprecated and merged into the OpenAI API. Launch the OpenAI API using the `--launch-kobold-api` flag. Please note that Kobold routes are not protected with the API key.\r\n\r\n- **LoRA Support for Quantized Models**: We've added LoRA support for GPTQ and AWQ quantized models.\r\n\r\n- **Logging Experience Overhaul**: We've revamped the logging experience using a custom `loguru` class, inspired by tabbyAPI's recent changes.\r\n\r\n- **Informative Logging Metrics**: Logging has been enhanced to display model memory usage and reduce display bloat, among other improvements.\r\n\r\n- **Ray Worker Health Check**: The engine now performs health checks on Ray workers, promptly reporting any silent failures or timeouts.\r\n\r\n## Bug Fixes\r\n\r\n- Resolved an issue where `smoothing_factor` would break at high batch sizes.\r\n- Fixed a bug with LoRA vocab embeddings.\r\n- Addressed the missing CUDA suffixes in the version number (e.g., `0.5.0+cu118`). The suffix is now appended when using a CUDA version other than 12.1.\r\n- Dynatemp has been split into min\u002Fmax from range. The Kobold endpoint still accepts a range as input.\r\n- Fixed worker initialization in WSL.\r\n- Removed the accidental inclusion of FP8 kernels in the ROCm build process.\r\n- The EOS token is now remov","2024-03-11T16:37:01",[218,228,238,246,254,266],{"id":219,"name":220,"github_repo":221,"description_zh":222,"stars":223,"difficulty_score":57,"last_commit_at":224,"category_tags":225,"status":83},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[71,226,227],"图像","Agent",{"id":229,"name":230,"github_repo":231,"description_zh":232,"stars":233,"difficulty_score":234,"last_commit_at":235,"category_tags":236,"status":83},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[71,227,237],"语言模型",{"id":239,"name":240,"github_repo":241,"description_zh":242,"stars":243,"difficulty_score":234,"last_commit_at":244,"category_tags":245,"status":83},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[71,226,227],{"id":247,"name":248,"github_repo":249,"description_zh":250,"stars":251,"difficulty_score":234,"last_commit_at":252,"category_tags":253,"status":83},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[71,237],{"id":255,"name":256,"github_repo":257,"description_zh":258,"stars":259,"difficulty_score":234,"last_commit_at":260,"category_tags":261,"status":83},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[226,262,263,70,227,264,237,71,265],"数据工具","视频","其他","音频",{"id":267,"name":268,"github_repo":269,"description_zh":270,"stars":271,"difficulty_score":57,"last_commit_at":272,"category_tags":273,"status":83},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[227,226,71,237,264]]