[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-flashinfer-ai--flashinfer":3,"similar-flashinfer-ai--flashinfer":211},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":9,"readme_en":10,"readme_zh":11,"quickstart_zh":12,"use_case_zh":13,"hero_image_url":14,"owner_login":15,"owner_name":16,"owner_avatar_url":17,"owner_bio":18,"owner_company":19,"owner_location":19,"owner_email":19,"owner_twitter":19,"owner_website":20,"owner_url":21,"languages":22,"stars":50,"forks":51,"last_commit_at":52,"license":53,"difficulty_score":54,"env_os":55,"env_gpu":56,"env_ram":57,"env_deps":58,"category_tags":63,"github_topics":66,"view_count":54,"oss_zip_url":19,"oss_zip_packed_at":19,"status":77,"created_at":78,"updated_at":79,"faqs":80,"releases":110},1098,"flashinfer-ai\u002Fflashinfer","flashinfer","FlashInfer: Kernel Library for LLM Serving","FlashInfer 是一个专注于大语言模型（LLM）推理的高性能GPU内核库，提供统一的API接口支持注意力、矩阵乘法（GEMM）和混合专家（MoE）等关键操作。它通过优化内核设计和多后端适配（如FlashAttention-2\u002F3、cuDNN、CUTLASS等），显著提升不同GPU架构下的推理效率。针对动态批处理、低精度计算（FP8\u002FFP4）和内存管理等场景，FlashInfer 提供了高效解决方案，帮助用户在保持模型精度的同时降低资源消耗。\n\n该工具解决了传统LLM服务中常见的性能瓶颈问题，例如在大规模推理任务中因内存不足导致的延迟、多硬件平台兼容性差以及高精度计算带来的算力浪费。其支持的SM75及以上架构覆盖主流GPU，结合CUDAGraph和torch.compile技术，可满足生产环境中对低延迟和高吞吐的需求。\n\n适合需要优化大模型推理性能的开发者、研究人员及企业用户。对于开发者而言，FlashInfer 提供了灵活的后端选择和高效的内核调优能力；研究人员可借此探索不同算法在硬件上的表现；而企业用户则能通过其低资源消耗特性实现更经济的模型部署。其独特的技术亮点包括：支持多种","FlashInfer 是一个专注于大语言模型（LLM）推理的高性能GPU内核库，提供统一的API接口支持注意力、矩阵乘法（GEMM）和混合专家（MoE）等关键操作。它通过优化内核设计和多后端适配（如FlashAttention-2\u002F3、cuDNN、CUTLASS等），显著提升不同GPU架构下的推理效率。针对动态批处理、低精度计算（FP8\u002FFP4）和内存管理等场景，FlashInfer 提供了高效解决方案，帮助用户在保持模型精度的同时降低资源消耗。\n\n该工具解决了传统LLM服务中常见的性能瓶颈问题，例如在大规模推理任务中因内存不足导致的延迟、多硬件平台兼容性差以及高精度计算带来的算力浪费。其支持的SM75及以上架构覆盖主流GPU，结合CUDAGraph和torch.compile技术，可满足生产环境中对低延迟和高吞吐的需求。\n\n适合需要优化大模型推理性能的开发者、研究人员及企业用户。对于开发者而言，FlashInfer 提供了灵活的后端选择和高效的内核调优能力；研究人员可借此探索不同算法在硬件上的表现；而企业用户则能通过其低资源消耗特性实现更经济的模型部署。其独特的技术亮点包括：支持多种注意力机制的高效实现、针对MoE的量化优化、以及跨节点的分布式通信能力，为大模型服务提供了全面的技术支撑。","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fweb-data\u002Fblob\u002Fmain\u002Flogo\u002FFlashInfer-black-background.png?raw=true\">\n    \u003Cimg alt=\"FlashInfer\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fflashinfer-ai_flashinfer_readme_60e70166cc9b.png\" width=55%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\u003Ch1 align=\"center\">\nHigh-Performance GPU Kernels for Inference\n\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fdocs.flashinfer.ai\">\u003Cb>Documentation\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest\">\u003Cb>Latest Release\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fflashinfer.ai\">\u003Cb>Blog\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fflashinfer\u002Fshared_invite\u002Fzt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA\">\u003Cb>Slack\u003C\u002Fb>\u003C\u002Fa> |  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Forgs\u002Fflashinfer-ai\u002Fdiscussions\">\u003Cb>Discussion Forum\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n[![Build Status](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fflashinfer-ai_flashinfer_readme_1ae479a4126b.png)](https:\u002F\u002Fci.tlcpack.ai\u002Fjob\u002Fflashinfer-ci\u002Fjob\u002Fmain\u002F)\n[![Documentation](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml)\n\n**FlashInfer** is a library and kernel generator for inference that delivers state-of-the-art performance across diverse GPU architectures. It provides unified APIs for attention, GEMM, and MoE operations with multiple backend implementations including FlashAttention-2\u002F3, cuDNN, CUTLASS, and TensorRT-LLM.\n\n## Why FlashInfer?\n\n- **State-of-the-art Performance**: Optimized kernels for prefill, decode, and mixed batching scenarios\n- **Multiple Backends**: Automatically selects the best backend for your hardware and workload\n- **Modern Architecture Support**: Support for SM75 (Turing) and later (through Blackwell)\n- **Low-Precision Compute**: FP8 and FP4 quantization for attention, GEMM, and MoE operations\n- **Production-Ready**: CUDAGraph and torch.compile compatible for low-latency serving\n\n## Core Features\n\n### Attention Kernels\n- **Paged and Ragged KV-Cache**: Efficient memory management for dynamic batch serving\n- **Decode, Prefill, and Append**: Optimized kernels for all attention phases\n- **MLA Attention**: Native support for DeepSeek's Multi-Latent Attention\n- **Cascade Attention**: Memory-efficient hierarchical KV-Cache for shared prefixes\n- **Sparse Attention**: Block-sparse and variable block-sparse patterns\n- **POD-Attention**: Fused prefill+decode for mixed batching\n\n### GEMM & Linear Operations\n- **BF16 GEMM**: BF16 matrix multiplication for SM10.0+ GPUs.\n- **FP8 GEMM**: Per-tensor and groupwise scaling\n- **FP4 GEMM**: NVFP4 and MXFP4 matrix multiplication for Blackwell GPUs\n- **Grouped GEMM**: Efficient batched matrix operations for LoRA and multi-expert routing\n\n### Mixture of Experts (MoE)\n- **Fused MoE Kernels**\n- **Multiple Routing Methods**: DeepSeek-V3, Llama-4, and standard top-k routing\n- **Quantized MoE**: FP8 and FP4 expert weights with block-wise scaling\n\n### Sampling & Decoding\n- **Sorting-Free Sampling**: Efficient Top-K, Top-P, and Min-P without sorting\n- **Speculative Decoding**: Chain speculative sampling support\n\n### Communication\n- **AllReduce**: Custom implementations\n- **Multi-Node NVLink**: MNNVL support for multi-node inference\n- **NVSHMEM Integration**: For distributed memory operations\n\n### Other Operators\n- **RoPE**: LLaMA-style rotary position embeddings (including LLaMA 3.1)\n- **Normalization**: RMSNorm, LayerNorm, Gemma-style fused operations\n- **Activations**: SiLU, GELU with fused gating\n\n## GPU Support\n\n| Architecture | Compute Capability | Example GPUs |\n|--------------|-------------------|------|\n| Turing | SM 7.5 | T4, RTX 20 series |\n| Ampere | SM 8.0, 8.6 | A100, A10, RTX 30 series |\n| Ada Lovelace | SM 8.9 | L4, L40, RTX 40 series |\n| Hopper | SM 9.0 | H100, H200 |\n| Blackwell | SM 10.0, 10.3 | B200, B300 |\n| Blackwell | SM 12.0, 12.1 | RTX 50 series, DGX Spark, Jetson Thor |\n\n> **Note:** Not all features are supported across all compute capabilities.\n\n## News\n\nLatest: [![GitHub Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fflashinfer-ai\u002Fflashinfer)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest)\n\nNotable updates:\n- [2025-10-08] Blackwell support added in [v0.4.0](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Ftag\u002Fv0.4.0)\n- [2025-03-10] [Blog Post](https:\u002F\u002Fflashinfer.ai\u002F2025\u002F03\u002F10\u002Fsampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.\n\n## Getting Started\n\n### Installation\n\n**Quickstart:**\n\n```bash\npip install flashinfer-python\n```\n\n**Package Options:**\n\n- **flashinfer-python**: Core package that compiles\u002Fdownloads kernels on first use\n- **flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures\n- **flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions\n\n**For faster initialization and offline usage**, install the optional packages to have most kernels pre-compiled:\n\n```bash\npip install flashinfer-python flashinfer-cubin\n# JIT cache (replace cu129 with your CUDA version)\npip install flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fcu129\n```\n\n### Verify Installation\n\n```bash\nflashinfer show-config\n```\n\n### Basic Usage\n\n```python\nimport torch\nimport flashinfer\n\n# Single decode attention\nq = torch.randn(32, 128, device=\"cuda\", dtype=torch.float16)  # [num_qo_heads, head_dim]\nk = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]\nv = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)\n\noutput = flashinfer.single_decode_with_kv_cache(q, k, v)\n```\n\nSee [documentation](https:\u002F\u002Fdocs.flashinfer.ai\u002F) for comprehensive API reference and tutorials.\n\n### Install from Source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer.git --recursive\ncd flashinfer\npython -m pip install -v .\n```\n\n**For development**, install in editable mode:\n\n```bash\npython -m pip install --no-build-isolation -e . -v\n```\n\n> **Note:** When using `--no-build-isolation`, pip does not automatically install build dependencies. FlashInfer requires `setuptools>=77`. If you encounter an error like `AttributeError: module 'setuptools.build_meta' has no attribute 'prepare_metadata_for_build_editable'`, upgrade pip and setuptools first:\n> ```bash\n> python -m pip install --upgrade pip setuptools\n> ```\n\nBuild optional packages:\n\n```bash\n# flashinfer-cubin\ncd flashinfer-cubin\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\n```bash\n# flashinfer-jit-cache (customize for your target GPUs)\nexport FLASHINFER_CUDA_ARCH_LIST=\"7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f\"\ncd flashinfer-jit-cache\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\nFor more details, see the [Install from Source documentation](https:\u002F\u002Fdocs.flashinfer.ai\u002Finstallation.html#install-from-source).\n\n### Nightly Builds\n\n```bash\npip install -U --pre flashinfer-python --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F --no-deps\npip install flashinfer-python  # Install dependencies from PyPI\npip install -U --pre flashinfer-cubin --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F\n# JIT cache (replace cu129 with your CUDA version)\npip install -U --pre flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002Fcu129\n```\n\n### CLI Tools\n\nFlashInfer provides several CLI commands for configuration, module management, and development:\n\n```bash\n# Verify installation and view configuration\nflashinfer show-config\n\n# List and inspect modules\nflashinfer list-modules\nflashinfer module-status\n\n# Manage artifacts and cache\nflashinfer download-cubin\nflashinfer clear-cache\n\n# For developers: generate compile_commands.json for IDE integration\nflashinfer export-compile-commands [output_path]\n```\n\nFor complete documentation, see the [CLI reference](https:\u002F\u002Fdocs.flashinfer.ai\u002Fcli.html).\n\n## API Logging\n\nFlashInfer provides comprehensive API logging for debugging. Enable it using environment variables:\n\n```bash\n# Enable logging (levels: 0=off (default), 1=basic, 3=detailed, 5=statistics)\nexport FLASHINFER_LOGLEVEL=3\n\n# Set log destination (stdout (default), stderr, or file path)\nexport FLASHINFER_LOGDEST=stdout\n```\n\nFor detailed information about logging levels, configuration, and advanced features, see [Logging](https:\u002F\u002Fdocs.flashinfer.ai\u002Flogging.html) in our documentation.\n\n## Custom Attention Variants\n\nUsers can customize their own attention variants with additional parameters. For more details, refer to our [JIT examples](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fblob\u002Fmain\u002Ftests\u002Futils\u002Ftest_jit_example.py).\n\n## CUDA Support\n\n**Supported CUDA Versions:** 12.6, 12.8, 13.0, 13.1\n\n> **Note:** FlashInfer strives to follow PyTorch's supported CUDA versions plus the latest CUDA release.\n\n## Adoption\n\nFlashInfer powers inference in:\n\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM)\n- [TGI (Text Generation Inference)](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference)\n- [MLC-LLM](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm)\n- [LightLLM](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm)\n- [lorax](https:\u002F\u002Fgithub.com\u002Fpredibase\u002Florax)\n- [ScaleLLM](https:\u002F\u002Fgithub.com\u002Fvectorch-ai\u002FScaleLLM)\n\n## Acknowledgement\n\nFlashInfer is inspired by [FlashAttention](https:\u002F\u002Fgithub.com\u002Fdao-AILab\u002Fflash-attention\u002F), [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [stream-K](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.03598), [CUTLASS](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcutlass), and [AITemplate](https:\u002F\u002Fgithub.com\u002Ffacebookincubator\u002FAITemplate).\n\n## Citation\n\nIf you find FlashInfer helpful in your project or research, please consider citing our [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005):\n\n```bibtex\n@article{ye2025flashinfer,\n    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},\n    author = {\n      Ye, Zihao and\n      Chen, Lequn and\n      Lai, Ruihang and\n      Lin, Wuwei and\n      Zhang, Yineng and\n      Wang, Stephanie and\n      Chen, Tianqi and\n      Kasikci, Baris and\n      Grover, Vinod and\n      Krishnamurthy, Arvind and\n      Ceze, Luis\n    },\n    journal = {arXiv preprint arXiv:2501.01005},\n    year = {2025},\n    url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005}\n}\n```\n","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fweb-data\u002Fblob\u002Fmain\u002Flogo\u002FFlashInfer-black-background.png?raw=true\">\n    \u003Cimg alt=\"FlashInfer\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fflashinfer-ai_flashinfer_readme_60e70166cc9b.png\" width=55%>\n  \u003C\u002Fpicture>\n\u003C\u002Fp>\n\u003Ch1 align=\"center\">\n推理高性能GPU内核\n\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n| \u003Ca href=\"https:\u002F\u002Fdocs.flashinfer.ai\">\u003Cb>文档\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest\">\u003Cb>最新版本\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fflashinfer.ai\">\u003Cb>博客\u003C\u002Fb>\u003C\u002Fa> | \u003Ca href=\"https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fflashinfer\u002Fshared_invite\u002Fzt-379wct3hc-D5jR~1ZKQcU00WHsXhgvtA\">\u003Cb>Slack\u003C\u002Fb>\u003C\u002Fa> |  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Forgs\u002Fflashinfer-ai\u002Fdiscussions\">\u003Cb>讨论论坛\u003C\u002Fb>\u003C\u002Fa> |\n\u003C\u002Fp>\n\n[![Build Status](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fflashinfer-ai_flashinfer_readme_1ae479a4126b.png)](https:\u002F\u002Fci.tlcpack.ai\u002Fjob\u002Fflashinfer-ci\u002Fjob\u002Fmain\u002F)\n[![Documentation](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Factions\u002Fworkflows\u002Fbuild-doc.yml)\n\n**FlashInfer** 是一个推理库和内核生成器，在多种GPU架构上提供最先进的性能。它为注意力机制、GEMM和MoE操作提供统一的API，支持包括FlashAttention-2\u002F3、cuDNN、CUTLASS和TensorRT-LLM在内的多种后端实现。\n\n## 为什么选择FlashInfer？\n\n- **先进性能**：针对预填充、解码和混合批处理场景优化的内核\n- **多后端支持**：自动选择最适合您硬件和工作负载的后端\n- **现代架构支持**：支持SM75（Turing）及更新架构（包括Blackwell）\n- **低精度计算**：支持注意力机制、GEMM和MoE操作的FP8和FP4量化\n- **生产就绪**：兼容CUDAGraph和torch.compile的低延迟服务\n\n## 核心功能\n\n### 注意力机制内核\n- **分页与不规则KV缓存**：动态批处理服务的高效内存管理\n- **解码、预填充与追加**：所有注意力阶段的优化内核\n- **MLA注意力机制**：深度求索（DeepSeek）多隐层注意力机制原生支持\n- **级联注意力机制**：共享前缀的内存高效分层KV缓存\n- **稀疏注意力机制**：块稀疏和可变块稀疏模式\n- **POD注意力机制**：混合批处理的融合预填充+解码\n\n### GEMM与线性操作\n- **BF16 GEMM**：面向SM10.0+ GPU的BF16矩阵乘法\n- **FP8 GEMM**：张量级和分组缩放\n- **FP4 GEMM**：面向Blackwell GPU的NVFP4和MXFP4矩阵乘法\n- **分组GEMM**：面向LoRA和多专家路由的高效批量矩阵操作\n\n### 混合专家（MoE）\n- **融合MoE内核**\n- **多种路由方法**：支持DeepSeek-V3、Llama-4和标准top-k路由\n- **量化MoE**：使用分块缩放的FP8和FP4专家权重\n\n### 采样与解码\n- **无排序采样**：无需排序的高效Top-K、Top-P和Min-P\n- **推测式解码**：支持链式推测采样\n\n### 通信\n- **AllReduce**：自定义实现\n- **多节点NVLink**：支持多节点推理的MNNVL\n- **NVSHMEM集成**：面向分布式内存操作\n\n### 其他算子\n- **RoPE**：LLaMA风格的旋转位置嵌入（包含LLaMA 3.1）\n- **归一化**：RMSNorm、LayerNorm、Gemma风格的融合操作\n- **激活函数**：支持融合门控的SiLU、GELU\n\n## GPU支持\n\n| 架构 | 计算能力 | 示例GPU |\n|--------------|-------------------|------|\n| Turing (图灵) | SM 7.5 | T4, RTX 20系列 |\n| Ampere (安培) | SM 8.0, 8.6 | A100, A10, RTX 30系列 |\n| Ada Lovelace (洛夫莱斯) | SM 8.9 | L4, L40, RTX 40系列 |\n| Hopper (霍普) | SM 9.0 | H100, H200 |\n| Blackwell (布莱克威尔) | SM 10.0, 10.3 | B200, B300 |\n| Blackwell (布莱克威尔) | SM 12.0, 12.1 | RTX 50系列, DGX Spark, Jetson Thor |\n\n> **注意**：并非所有功能都支持全部计算能力。\n\n## 最新动态\n\n最新版本: [![GitHub Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fflashinfer-ai\u002Fflashinfer)](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Flatest)\n\n重要更新：\n- [2025-10-08] 在[v0.4.0版本](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Freleases\u002Ftag\u002Fv0.4.0)中新增对Blackwell的支持\n- [2025-03-10] [博客文章](https:\u002F\u002Fflashinfer.ai\u002F2025\u002F03\u002F10\u002Fsampling.html) 介绍LLM采样的无排序GPU内核设计\n\n## 快速开始\n\n### 安装\n\n**快速开始：**\n\n```bash\npip install flashinfer-python\n```\n\n**包选项：**\n\n- **flashinfer-python**：核心包，首次使用时编译\u002F下载内核\n- **flashinfer-cubin**：为所有支持的GPU架构预编译的内核二进制文件\n- **flashinfer-jit-cache**：特定CUDA版本的预构建内核缓存\n\n**为加快初始化和离线使用**，安装可选包可预编译大部分内核：\n\n```bash\npip install flashinfer-python flashinfer-cubin\n# JIT缓存（将cu129替换为您的CUDA版本）\npip install flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fcu129\n```\n\n### 验证安装\n\n```bash\nflashinfer show-config\n```\n\n### 基本用法\n\n```python\nimport torch\nimport flashinfer\n\n# 单次解码注意力机制\nq = torch.randn(32, 128, device=\"cuda\", dtype=torch.float16)  # [num_qo_heads, head_dim]\nk = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]\nv = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)\n\noutput = flashinfer.single_decode_with_kv_cache(q, k, v)\n```\n\n完整API参考和教程请参见[文档](https:\u002F\u002Fdocs.flashinfer.ai\u002F)。\n\n### 源码安装\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer.git --recursive\ncd flashinfer\npython -m pip install -v .\n```\n\n**开发模式安装**：\n\n```bash\npython -m pip install --no-build-isolation -e . -v\n```\n\n> **注意**：使用`--no-build-isolation`时，pip不会自动安装构建依赖。FlashInfer需要`setuptools>=77`。如果遇到类似`AttributeError: module 'setuptools.build_meta' has no attribute 'prepare_metadata_for_build_editable'`的错误，请先升级pip和setuptools：\n> ```bash\n> python -m pip install --upgrade pip setuptools\n> ```\n\n构建可选包：\n\n```bash\n# flashinfer-cubin\ncd flashinfer-cubin\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\n```bash\n# flashinfer-jit-cache（根据目标GPU定制）\nexport FLASHINFER_CUDA_ARCH_LIST=\"7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f\"\ncd flashinfer-jit-cache\npython -m build --no-isolation --wheel\npython -m pip install dist\u002F*.whl\n```\n\n更多细节请参见[源码安装文档](https:\u002F\u002Fdocs.flashinfer.ai\u002Finstallation.html#install-from-source)。\n\n### 开发版安装\n\n```bash\npip install -U --pre flashinfer-python --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F --no-deps\npip install flashinfer-python  # 从PyPI安装依赖\npip install -U --pre flashinfer-cubin --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002F\n# JIT缓存（将cu129替换为您的CUDA版本）\npip install -U --pre flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fnightly\u002Fcu129\n```\n\n### 命令行工具\n\nFlashInfer提供多个CLI命令用于配置管理、模块管理和开发：\n\n```bash\n# 验证安装并查看配置\nflashinfer show-config\n\n# 列出并检查模块\nflashinfer list-modules\nflashinfer module-status\n\n# 管理二进制文件和缓存\nflashinfer download-cubin\nflashinfer clear-cache\n\n# 开发者工具：生成IDE集成所需的compile_commands.json\nflashinfer export-compile-commands [output_path]\n```\n\n完整文档请参考[CLI参考手册](https:\u002F\u002Fdocs.flashinfer.ai\u002Fcli.html)。\n\n## API日志记录\n\nFlashInfer提供全面的API日志功能用于调试。通过环境变量启用：\n\n```bash\n# 启用日志（级别：0=关闭（默认），1=基础，3=详细，5=统计信息）\nexport FLASHINFER_LOGLEVEL=3\n\n# 设置日志目标（stdout（默认），stderr，或文件路径）\nexport FLASHINFER_LOGDEST=stdout\n```\n\n关于日志级别、配置和高级功能的详细信息，请参考文档中的[日志记录](https:\u002F\u002Fdocs.flashinfer.ai\u002Flogging.html)章节。\n\n## 自定义注意力机制变体\n\n用户可通过添加参数自定义注意力机制变体。具体实现请参考我们的[JIT示例](https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fblob\u002Fmain\u002Ftests\u002Futils\u002Ftest_jit_example.py)。\n\n## CUDA支持\n\n**支持的CUDA版本：** 12.6、12.8、13.0、13.1\n\n> **注意：** FlashInfer遵循PyTorch支持的CUDA版本，并包含最新CUDA发行版。\n\n## 应用案例\n\nFlashInfer为以下项目提供推理支持：\n\n- [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)\n- [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)\n- [TensorRT-LLM](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FTensorRT-LLM)\n- [TGI（文本生成推理）](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftext-generation-inference)\n- [MLC-LLM](https:\u002F\u002Fgithub.com\u002Fmlc-ai\u002Fmlc-llm)\n- [LightLLM](https:\u002F\u002Fgithub.com\u002FModelTC\u002Flightllm)\n- [lorax](https:\u002F\u002Fgithub.com\u002Fpredibase\u002Florax)\n- [ScaleLLM](https:\u002F\u002Fgithub.com\u002Fvectorch-ai\u002FScaleLLM)\n\n## 致谢\n\nFlashInfer受以下项目启发：[FlashAttention](https:\u002F\u002Fgithub.com\u002Fdao-AILab\u002Fflash-attention\u002F)、[vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)、[stream-K](https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.03598)、[CUTLASS](https:\u002F\u002Fgithub.com\u002Fnvidia\u002Fcutlass)和[AITemplate](https:\u002F\u002Fgithub.com\u002Ffacebookincubator\u002FAITemplate)。\n\n## 引用\n\n若您在项目或研究中使用FlashInfer，请引用我们的[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005)：\n\n```bibtex\n@article{ye2025flashinfer,\n    title = {FlashInfer: Efficient and Customizable Attention Engine for LLM Inference Serving},\n    author = {\n      Ye, Zihao and\n      Chen, Lequn and\n      Lai, Ruihang and\n      Lin, Wuwei and\n      Zhang, Yineng and\n      Wang, Stephanie and\n      Chen, Tianqi and\n      Kasikci, Baris and\n      Grover, Vinod and\n      Krishnamurthy, Arvind and\n      Ceze, Luis\n    },\n    journal = {arXiv preprint arXiv:2501.01005},\n    year = {2025},\n    url = {https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01005}\n}\n```","# FlashInfer 中文快速上手指南\n\n---\n\n## 环境准备\n\n### 系统要求\n- **GPU架构**：支持 Turing(SM75) 及更新架构（如 RTX 30\u002F40 系列、A100\u002FH100\u002FB200 等）\n- **CUDA 版本**：12.6、12.8、13.0、13.1（需与 PyTorch 版本匹配）\n- **操作系统**：Linux（推荐 Ubuntu 20.04+）\n\n### 前置依赖\n```bash\n# 安装基础依赖\nsudo apt-get update && sudo apt-get install -y git python3-pip\n# 升级 pip 和 setuptools（源码安装时必需）\npython -m pip install --upgrade pip setuptools\n# 安装 PyTorch（需与 CUDA 版本匹配）\npip install torch --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu129\n```\n\n---\n\n## 安装步骤\n\n### 推荐安装（含预编译内核）\n```bash\n# 安装核心包 + 预编译内核（推荐国内用户使用清华源加速）\npip install flashinfer-python flashinfer-cubin -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 安装 JIT 缓存（替换 cu129 为实际 CUDA 版本）\npip install flashinfer-jit-cache --index-url https:\u002F\u002Fflashinfer.ai\u002Fwhl\u002Fcu129\n```\n\n### 从源码安装（需科学上网）\n```bash\n# 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer.git --recursive\ncd flashinfer\n\n# 安装构建依赖（国内用户建议使用清华源）\npython -m pip install -v .\n# 开发模式安装\npython -m pip install --no-build-isolation -e . -v\n```\n\n### 验证安装\n```bash\nflashinfer show-config\n# 输出应包含 GPU 架构信息和 CUDA 版本\n```\n\n---\n\n## 基本使用\n\n### 单个解码注意力示例\n```python\nimport torch\nimport flashinfer\n\n# 创建测试张量\nq = torch.randn(32, 128, device=\"cuda\", dtype=torch.float16)  # [num_qo_heads, head_dim]\nk = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)  # [kv_len, num_kv_heads, head_dim]\nv = torch.randn(2048, 32, 128, device=\"cuda\", dtype=torch.float16)\n\n# 执行解码注意力计算\noutput = flashinfer.single_decode_with_kv_cache(q, k, v)\nprint(output.shape)  # 输出: torch.Size([32, 128])\n```\n\n---\n\n> **提示**：  \n> 1. 国内用户安装时建议在 pip 命令后添加 `-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple` 使用清华源加速  \n> 2. CUDA 版本需与 PyTorch 安装版本严格匹配  \n> 3. 更多高级用法请参考[官方文档](https:\u002F\u002Fdocs.flashinfer.ai\u002F)","某AI客服团队正在部署基于LLaMA-3-8B模型的实时问答系统，需在Triton推理服务中处理动态用户请求流。系统需同时支持长上下文对话和突发流量，对延迟与吞吐量有严格要求。\n\n### 没有 flashinfer 时\n- 使用HuggingFace Transformers的默认解码器导致平均延迟达230ms\u002F请求，长文本生成时GPU显存占用高达18GB\n- 手动管理KV缓存导致内存碎片化，批量处理时经常触发OOM错误\n- 无法有效利用A100的FP8计算能力，模型精度损失与性能提升难以平衡\n- 多专家路由逻辑需自定义实现，导致专家切换开销占总计算时间的15%\n\n### 使用 flashinfer 后\n- 通过内置的POD-Attention内核实现prefill+decode融合计算，端到端延迟降至95ms，显存占用降低至9.2GB\n- 采用Paged KV-Cache机制后内存利用率提升40%，动态批次大小可扩展至256而不触发OOM\n- 启用FP8量化后矩阵乘法吞吐量提升2.3倍，配合block-wise缩放策略保持模型精度无明显损失\n- 集成DeepSeek-V3路由算法后专家切换开销降至3%，结合Grouped GEMM优化实现LoRA适配器的零拷贝加载\n\nFlashInfer通过细粒度内存优化和混合精度计算，使LLM服务吞吐量提升2.8倍，同时将单卡部署成本降低至原有方案的58%。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fflashinfer-ai_flashinfer_60e70166.png","flashinfer-ai","FlashInfer","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fflashinfer-ai_53c4553c.png","",null,"https:\u002F\u002Fflashinfer.ai\u002F","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai",[23,27,31,35,39,43,47],{"name":24,"color":25,"percentage":26},"Python","#3572A5",46.2,{"name":28,"color":29,"percentage":30},"Cuda","#3A4E3A",28.5,{"name":32,"color":33,"percentage":34},"C++","#f34b7d",24.1,{"name":36,"color":37,"percentage":38},"Jinja","#a52a22",0.7,{"name":40,"color":41,"percentage":42},"Shell","#89e051",0.4,{"name":44,"color":45,"percentage":46},"C","#555555",0.1,{"name":48,"color":19,"percentage":49},"NASL",0,5306,866,"2026-04-05T21:51:29","Apache-2.0",3,"Linux, macOS","需要 NVIDIA GPU，显存 8GB+，CUDA 12.6+","未说明",{"notes":59,"python":60,"dependencies":61},"需要安装 NVIDIA 驱动和 CUDA 工具包，建议安装 flashinfer-cubin 以获得预编译内核支持。首次运行需下载约 5GB 内核文件，需网络连接。","3.8+",[62],"torch>=2.0",[64,65],"开发框架","语言模型",[67,68,69,70,71,72,73,74,75,76],"gpu","large-large-models","cuda","pytorch","llm-inference","jit","attention","nvidia","distributed-inference","moe","ready","2026-03-27T02:49:30.150509","2026-04-06T08:40:47.269755",[81,86,91,96,101,106],{"id":82,"question_zh":83,"answer_zh":84,"source_url":85},4940,"加载gemma-2-27b模型时出现'Out of workspace memory'错误，但GPU仍有大量内存可用，如何解决？","请更新vLLM到最新主分支版本，该问题已在commit 0badabb846315980380635baf1ad5dd0aba59137中修复。参考命令：`pip install git+https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm`","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fissues\u002F362",{"id":87,"question_zh":88,"answer_zh":89,"source_url":90},4941,"使用flashinfer的rmsnorm时出现性能下降，如何优化？","经测试确认是测试方法问题，使用cudagraph时性能与0.1.6版本一致。建议启用cudagraph并参考最新代码优化，相关改进在PR #969中实现。","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fissues\u002F960",{"id":92,"question_zh":93,"answer_zh":94,"source_url":95},4942,"不同Python虚拟环境导致JIT缓存冲突并报错如何解决？","需要修改flashinfer的自定义操作注册代码，将`torch.library.custom_op`的封装函数替换为兼容实现。建议隔离不同环境的缓存目录或升级到支持多环境隔离的版本。","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fissues\u002F1906",{"id":97,"question_zh":98,"answer_zh":99,"source_url":100},4943,"使用GDN prefill内核时输出出现NaN值，如何排查？","需检查接口一致性：prefill使用`chunk_gated_delta_rule`而解码使用不同接口，建议对齐接口实现。参考FLA的`fused_recurrent_gated_delta_rule`实现进行修改。","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fissues\u002F2490",{"id":102,"question_zh":103,"answer_zh":104,"source_url":105},4944,"POD Attention性能显著低于BatchPrefillWithPagedKVCache，如何优化？","建议优先使用BatchPrefill方案。对于POD Attention，可参考Edenzzzz的分支进行批量优化，或调整共享内存配置避免非法内存访问。","https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fissues\u002F1022",{"id":107,"question_zh":108,"answer_zh":109,"source_url":95},4945,"如何正确使用flashinfer的自定义操作注册避免AttributeError？","需修改`register_custom_op`实现，解除对torch.library.custom_op的封装，或确保不同Python版本环境使用独立的JIT缓存目录。",[111,116,121,126,131,136,141,146,151,156,161,166,171,176,181,186,191,196,201,206],{"id":112,"version":113,"summary_zh":114,"released_at":115},114154,"nightly-v0.6.7-20260405","Automated nightly build for version 0.6.7 (dev20260405)","2026-04-05T07:50:54",{"id":117,"version":118,"summary_zh":119,"released_at":120},114155,"nightly-v0.6.7-20260404","Automated nightly build for version 0.6.7 (dev20260404)","2026-04-04T07:14:34",{"id":122,"version":123,"summary_zh":124,"released_at":125},114156,"v0.6.7.post2","**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fcompare\u002Fv0.6.7.post1...v0.6.7.post2","2026-04-04T06:53:53",{"id":127,"version":128,"summary_zh":129,"released_at":130},114157,"v0.6.7.post1","**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fcompare\u002Fv0.6.7...v0.6.7.post1","2026-04-03T09:18:57",{"id":132,"version":133,"summary_zh":134,"released_at":135},114158,"nightly-v0.6.7-20260402","Automated nightly build for version 0.6.7 (dev20260402)","2026-04-02T07:38:26",{"id":137,"version":138,"summary_zh":139,"released_at":140},114159,"nightly-v0.6.7-20260401","Automated nightly build for version 0.6.7 (dev20260401)","2026-04-01T07:54:02",{"id":142,"version":143,"summary_zh":144,"released_at":145},114160,"nightly-v0.6.7-20260331","Automated nightly build for version 0.6.7 (dev20260331)","2026-03-31T07:42:24",{"id":147,"version":148,"summary_zh":149,"released_at":150},114161,"nightly-v0.6.7-20260328","Automated nightly build for version 0.6.7 (dev20260328)","2026-03-28T06:10:28",{"id":152,"version":153,"summary_zh":154,"released_at":155},114162,"nightly-v0.6.7-20260326","Automated nightly build for version 0.6.7 (dev20260326)","2026-03-26T07:06:52",{"id":157,"version":158,"summary_zh":159,"released_at":160},114163,"v0.6.7","## What's Changed\n* perf(gdn): optimize MTP kernel with ILP rows and SMEM v caching by @ameynaik-hub in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2618\n* Feat\u002Fgdn decode pooled by @xutizhou in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2521\n* fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops by @voipmonitor in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2716\n* Support NVFP4 KV cache decode on SM120 by @Tom-Zheng in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2520\n* feat: Add TRTLLM fmha_v2 library for SM90 attention with Skip-Softmax  by @jimmyzho in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2446\n* bump version to 0.6.6 by @aleozlx in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2724\n* [benchmark] Add All Reduce benchmark by @jiahanc in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2696\n* Revert \"fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops\" by @aleozlx in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2737\n* refactor: refactoring cuda code to cute-dsl (part 1) by @yzh119 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2428\n* Added missing padding by @nvjullin in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2726\n* docker: add CUDA 13.1 Dockerfiles with cuda-tile by @yongwww in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2774\n* [BugFix] guard against uint32 underflow in multi-CTA TopK chunk calculation by @LopezCastroRoberto in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2592\n* fix: guard CUTLASS FMHA against SM12x and fix fmha_v2 SM121a check by @blake-snc in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2560\n* fix: fix illegal memory access for NaN input in sampling kernels by @zack041 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2456\n* Add cuda-tile to package dependencies by @yzh119 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2758\n* tests: skip sliding window + fp8 to prevent hang in fmha_v2 unit tests by @jimmyzho in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2781\n* feat: Add autotuner config caching, thread safety, and documentation by @bkryu in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2554\n* fix: block PR merge when CI is skipped due to pending authorization by @yongwww in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2761\n* [feat] Add air top-p algorithm by @qsang-nv in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2752\n* [chore] Add jiahanc to moe related code owner by @jiahanc in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2748\n* fix: Fix cute dsl moe failure with nvidia-cutlass-dsl >= 4.4.0 by @nv-yunzheq in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2735\n* [Spark unit test debugging] Fix for tests\u002Fattention\u002Ftest_trtllm_gen_mla.py by @kahyunnam in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2750\n* [Spark unit test debugging] Fix for tests\u002Fgemm\u002Ftest_groupwise_scaled_gemm_fp8.py by @kahyunnam in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2751\n* [feat] Add 2048 experts and 32 Top K  by @jiahanc in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2744\n* perf: Performance tune cute dsl RMSNorm variants by @bkryu in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2777\n* feat: Add FP4 KV cache quant\u002Fdequant kernels  by @samuellees in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2757\n* Add cute-dsl backends to mxfp[8,4]_quantization for future refactor by @bkryu in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2443\n* feat: FP32 dtype output for BF16 matmuls (CUTLASS & cuDNN) by @raayandhar in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2644\n* Create separate cuDNN handle per GPU by @dhiraj113 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2688\n* CuteDSL MoE fix redundant output buffer zeroing by @leejnau in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2811\n* Add NVFP4 KV cache quantization support for SM100 by @sychen52 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2702\n* [fix] Bugfix 1367: fix VariableBlockSparseAttention buffer overflow by dynamically resizing kv_lens_buffer by @qsang-nv in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2802\n* fix: Workaround org teams perm issue for approval purposes by @aleozlx in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2816\n* Implement override shape support for cuDNN GEMM operations by @yanqinz2 in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2790\n* feat: Add support for TRTLLM MXFP8 non-gated MoE with ReLU2 by @danisereb in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2707\n* Upgrade cutlass 4.2.1 -> 4.4.2 by @kahyunnam in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2798\n* chore: cute dsl nvfp4 moe clean up by @nv-yunzheq in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2775\n* fix: Add SM120 (RTX Blackwell desktop) support for NVFP4 MoE kernels by @brandonmmusic-max in https:\u002F\u002Fgithub.com\u002Fflashinfer-ai\u002Fflashinfer\u002Fpull\u002F2725\n* Protect agains","2026-03-25T04:25:39",{"id":162,"version":163,"summary_zh":164,"released_at":165},114164,"nightly-v0.6.7-20260324","Automated nightly build for version 0.6.7 (dev20260324)","2026-03-24T06:08:17",{"id":167,"version":168,"summary_zh":169,"released_at":170},114165,"nightly-v0.6.6-20260323","Automated nightly build for version 0.6.6 (dev20260323)","2026-03-23T06:13:44",{"id":172,"version":173,"summary_zh":174,"released_at":175},114166,"nightly-v0.6.6-20260322","Automated nightly build for version 0.6.6 (dev20260322)","2026-03-22T06:21:28",{"id":177,"version":178,"summary_zh":179,"released_at":180},114167,"nightly-v0.6.6-20260321","Automated nightly build for version 0.6.6 (dev20260321)","2026-03-21T05:59:32",{"id":182,"version":183,"summary_zh":184,"released_at":185},114168,"nightly-v0.6.6-20260320","Automated nightly build for version 0.6.6 (dev20260320)","2026-03-20T05:44:48",{"id":187,"version":188,"summary_zh":189,"released_at":190},114169,"nightly-v0.6.6-20260319","Automated nightly build for version 0.6.6 (dev20260319)","2026-03-19T07:25:51",{"id":192,"version":193,"summary_zh":194,"released_at":195},114170,"nightly-v0.6.6-20260318","Automated nightly build for version 0.6.6 (dev20260318)","2026-03-18T05:55:46",{"id":197,"version":198,"summary_zh":199,"released_at":200},114171,"nightly-v0.6.6-20260317","Automated nightly build for version 0.6.6 (dev20260317)","2026-03-17T05:52:23",{"id":202,"version":203,"summary_zh":204,"released_at":205},114172,"nightly-v0.6.6-20260316","Automated nightly build for version 0.6.6 (dev20260316)","2026-03-16T07:43:32",{"id":207,"version":208,"summary_zh":209,"released_at":210},114173,"nightly-v0.6.6-20260315","Automated nightly build for version 0.6.6 (dev20260315)","2026-03-15T06:26:49",[212,222,231,239,247,260],{"id":213,"name":214,"github_repo":215,"description_zh":216,"stars":217,"difficulty_score":54,"last_commit_at":218,"category_tags":219,"status":77},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[64,220,221],"图像","Agent",{"id":223,"name":224,"github_repo":225,"description_zh":226,"stars":227,"difficulty_score":228,"last_commit_at":229,"category_tags":230,"status":77},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[64,221,65],{"id":232,"name":233,"github_repo":234,"description_zh":235,"stars":236,"difficulty_score":228,"last_commit_at":237,"category_tags":238,"status":77},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[64,220,221],{"id":240,"name":241,"github_repo":242,"description_zh":243,"stars":244,"difficulty_score":228,"last_commit_at":245,"category_tags":246,"status":77},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[64,65],{"id":248,"name":249,"github_repo":250,"description_zh":251,"stars":252,"difficulty_score":228,"last_commit_at":253,"category_tags":254,"status":77},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[220,255,256,257,221,258,65,64,259],"数据工具","视频","插件","其他","音频",{"id":261,"name":262,"github_repo":263,"description_zh":264,"stars":265,"difficulty_score":54,"last_commit_at":266,"category_tags":267,"status":77},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[221,220,64,65,258]]