[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-vectara--hallucination-leaderboard":3,"tool-vectara--hallucination-leaderboard":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156804,2,"2026-04-15T11:34:33",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":73,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":84,"last_commit_at":86,"license":87,"difficulty_score":88,"env_os":89,"env_gpu":90,"env_ram":90,"env_deps":91,"category_tags":94,"github_topics":95,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":99,"updated_at":100,"faqs":101,"releases":102},7806,"vectara\u002Fhallucination-leaderboard","hallucination-leaderboard","Leaderboard Comparing LLM Performance at Producing Hallucinations when Summarizing Short Documents","hallucination-leaderboard 是一个专注于评估大语言模型（LLM）在文档摘要任务中“幻觉”表现的公开排行榜。所谓“幻觉”，指的是模型在生成内容时编造事实或偏离原文的现象，这是当前 AI 应用落地的一大痛点。该榜单通过 Vectara 研发的专用评估模型 HHEM，量化测试了各大主流模型在总结短文时的胡说八道频率，并据此计算出幻觉率、事实一致性比率等关键指标。\n\n这一工具主要解决了开发者与研究者在选型时缺乏客观、统一的事实准确性评估标准的难题。以往大家往往只关注模型的流畅度或推理能力，而忽视了其输出内容的真实可靠性。hallucination-leaderboard 提供了直观的数据对比，帮助用户识别哪些模型更值得信赖，从而降低因 AI 胡编乱造带来的业务风险。\n\n它非常适合 AI 研究人员、大模型应用开发者以及需要高准确度信息处理的企业技术团队使用。无论是构建智能客服、新闻摘要系统还是法律文档分析工具，都能从中获得宝贵的参考数据。其独特之处在于采用了自动化的专业评估模型 HHEM 进行大规模测试，并承诺随模型迭代定期更新榜单，确保数据的时效性与权威性。对于追求高质","hallucination-leaderboard 是一个专注于评估大语言模型（LLM）在文档摘要任务中“幻觉”表现的公开排行榜。所谓“幻觉”，指的是模型在生成内容时编造事实或偏离原文的现象，这是当前 AI 应用落地的一大痛点。该榜单通过 Vectara 研发的专用评估模型 HHEM，量化测试了各大主流模型在总结短文时的胡说八道频率，并据此计算出幻觉率、事实一致性比率等关键指标。\n\n这一工具主要解决了开发者与研究者在选型时缺乏客观、统一的事实准确性评估标准的难题。以往大家往往只关注模型的流畅度或推理能力，而忽视了其输出内容的真实可靠性。hallucination-leaderboard 提供了直观的数据对比，帮助用户识别哪些模型更值得信赖，从而降低因 AI 胡编乱造带来的业务风险。\n\n它非常适合 AI 研究人员、大模型应用开发者以及需要高准确度信息处理的企业技术团队使用。无论是构建智能客服、新闻摘要系统还是法律文档分析工具，都能从中获得宝贵的参考数据。其独特之处在于采用了自动化的专业评估模型 HHEM 进行大规模测试，并承诺随模型迭代定期更新榜单，确保数据的时效性与权威性。对于追求高质量输出的普通用户而言，这也是一个了解不同模型“诚实度”的有趣窗口。","# Hallucination Leaderboard\n\nPublic LLM leaderboard computed using Vectara's Hallucination Evaluation Model, also known as HHEM. This evaluates how often an LLM introduces hallucinations when summarizing a document. We plan to update this regularly as our model and the LLMs get updated over time.\n\nFeel free to check out the [interactive hallucination leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvectara\u002Fleaderboard) on Hugging Face. \n\nIf you are interested in previous versions os this leaderboard:\n1. First version based on HHEM-1.0, it is available [here](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fhallucination-leaderboard\u002Ftree\u002Fhhem-1.0-final)\n2. Most recent version, based on the previous dataset is available [here](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fhallucination-leaderboard\u002Ftree\u002Fhhem-2.3-old-dataset)\n\n\u003Ctable style=\"border-collapse: collapse;\">\n  \u003Ctr>\n    \u003Ctd style=\"text-align: center; vertical-align: middle; border: none;\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fvectara_hallucination-leaderboard_readme_bb7efaedaefe.png\" width=\"50\" height=\"50\">\n    \u003C\u002Ftd>\n    \u003Ctd style=\"text-align: left; vertical-align: middle; border: none;\">\n      In loving memory of \u003Ca href=\"https:\u002F\u002Fwww.ivinsfuneralhome.com\u002Fobituaries\u002FSimon-Mark-Hughes?obId=30000023\">Simon Mark Hughes\u003C\u002Fa>...\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C!-- LEADERBOARD_START -->\nLast updated on March 20, 2026\n\n![Plot: hallucination rates of various LLMs](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fvectara_hallucination-leaderboard_readme_a6baf24d4e13.png)\n\n|Model|Hallucination Rate|Factual Consistency Rate|Answer Rate|Average Summary Length (Words)|\n|----|----:|----:|----:|----:|\n|antgroup\u002Ffinix_s1_32b|1.8 %|98.2 %|99.5 %|172.4|\n|openai\u002Fgpt-5.4-nano-2026-03-17|3.1 %|96.9 %|100.0 %|144.4|\n|google\u002Fgemini-2.5-flash-lite|3.3 %|96.7 %|99.5 %|95.7|\n|microsoft\u002FPhi-4|3.7 %|96.3 %|80.7 %|120.9|\n|meta-llama\u002FLlama-3.3-70B-Instruct-Turbo|4.1 %|95.9 %|99.5 %|64.6|\n|snowflake\u002Fsnowflake-arctic-instruct|4.3 %|95.7 %|62.7 %|81.4|\n|google\u002Fgemma-3-12b-it|4.4 %|95.6 %|97.4 %|89.7|\n|mistralai\u002Fmistral-large-2411|4.5 %|95.5 %|99.9 %|85.0|\n|qwen\u002Fqwen3-8b|4.8 %|95.2 %|99.9 %|83.6|\n|amazon\u002Fnova-pro-v1:0|5.1 %|94.9 %|99.3 %|66.2|\n|amazon\u002Fnova-2-lite-v1:0|5.1 %|94.9 %|99.6 %|94.1|\n|mistralai\u002Fmistral-small-2501|5.1 %|94.9 %|97.9 %|98.8|\n|ibm-granite\u002Fgranite-4.0-h-small|5.2 %|94.8 %|100.0 %|107.4|\n|ai21labs\u002Fjamba-mini-2|5.3 %|94.7 %|99.6 %|109.4|\n|deepseek-ai\u002FDeepSeek-V3.2-Exp|5.3 %|94.7 %|96.6 %|64.6|\n|qwen\u002Fqwen3-14b|5.4 %|94.6 %|99.9 %|111.1|\n|amazon\u002Fnova-micro-v1:0|5.5 %|94.5 %|100.0 %|100.0|\n|deepseek-ai\u002FDeepSeek-V3.1|5.5 %|94.5 %|94.5 %|63.7|\n|openai\u002Fgpt-5.4-mini-2026-03-17|5.5 %|94.5 %|100.0 %|54.7|\n|openai\u002Fgpt-4.1-2025-04-14|5.6 %|94.4 %|99.9 %|91.7|\n|qwen\u002Fqwen3-4b|5.7 %|94.3 %|99.9 %|104.7|\n|xai-org\u002Fgrok-3|5.8 %|94.2 %|93.0 %|95.9|\n|qwen\u002Fqwen3-32b|5.9 %|94.1 %|99.9 %|115.8|\n|amazon\u002Fnova-lite-v1:0|6.1 %|93.9 %|99.9 %|91.8|\n|deepseek-ai\u002FDeepSeek-V3|6.1 %|93.9 %|97.5 %|81.7|\n|deepseek-ai\u002FDeepSeek-V3.2|6.3 %|93.7 %|92.6 %|62.0|\n|google\u002Fgemma-3-4b-it|6.4 %|93.6 %|67.3 %|77.4|\n|CohereLabs\u002Fcommand-r-plus-08-2024|6.9 %|93.1 %|95.0 %|91.5|\n|arcee-ai\u002Ftrinity-large-preview|6.9 %|93.1 %|99.0 %|117.3|\n|openai\u002Fgpt-5.4-2026-03-05|7.0 %|93.0 %|99.9 %|81.7|\n|google\u002Fgemini-2.5-pro|7.0 %|93.0 %|99.1 %|106.4|\n|mistralai\u002Fministral-3b-2410|7.3 %|92.7 %|99.9 %|167.9|\n|google\u002Fgemma-3-27b-it|7.4 %|92.6 %|98.8 %|96.4|\n|mistralai\u002Fministral-8b-2410|7.4 %|92.6 %|99.9 %|196.0|\n|meta-llama\u002FLlama-4-Scout-17B-16E-Instruct|7.7 %|92.3 %|99.0 %|137.3|\n|google\u002Fgemini-2.5-flash|7.8 %|92.2 %|99.0 %|101.5|\n|meta-llama\u002FLlama-4-Maverick-17B-128E-Instruct-FP8|8.2 %|91.8 %|100.0 %|106.0|\n|google\u002Fgemini-3.1-flash-lite-preview|8.2 %|91.8 %|99.6 %|62.6|\n|openai\u002Fgpt-5.4-pro-2026-03-05|8.3 %|91.7 %|100.0 %|148.5|\n|openai\u002Fgpt-5.2-low-2025-12-11|8.4 %|91.6 %|100.0 %|126.5|\n|MiniMaxAI\u002Fminimax-m2p5|9.1 %|90.9 %|98.2 %|137.2|\n|CohereLabs\u002Fcommand-a-03-2025|9.3 %|90.7 %|97.6 %|101.7|\n|zai-org\u002FGLM-4.5-AIR-FP8|9.3 %|90.7 %|98.1 %|70.6|\n|qwen\u002Fqwen3-235b-a22b|9.3 %|90.7 %|94.9 %|105.6|\n|qwen\u002Fqwen3-next-80b-a3b-thinking|9.3 %|90.7 %|94.4 %|70.9|\n|zai-org\u002FGLM-4.7-flash|9.3 %|90.7 %|91.6 %|71.8|\n|CohereLabs\u002Fc4ai-aya-expanse-8b|9.5 %|90.5 %|77.5 %|88.2|\n|zai-org\u002FGLM-4.6|9.5 %|90.5 %|94.5 %|77.2|\n|nvidia\u002FNemotron-3-Nano-30B-A3B|9.6 %|90.4 %|99.6 %|104.2|\n|openai\u002Fgpt-4o-2024-08-06|9.6 %|90.4 %|93.8 %|86.6|\n|ai21labs\u002Fjamba-large-1.7-2025-07|9.7 %|90.3 %|98.9 %|124.8|\n|anthropic\u002Fclaude-haiku-4-5-20251001|9.8 %|90.2 %|99.5 %|115.1|\n|zai-org\u002Fglm-5|10.1 %|89.9 %|99.7 %|74.4|\n|anthropic\u002Fclaude-sonnet-4-20250514|10.3 %|89.7 %|98.6 %|145.8|\n|google\u002Fgemini-3.1-pro-preview|10.4 %|89.6 %|99.4 %|107.7|\n|qwen\u002Fqwen3.5-flash-2026-02-23|10.5 %|89.5 %|99.8 %|95.0|\n|qwen\u002Fqwen3.5-35b-a3b|10.5 %|89.5 %|99.8 %|94.9|\n|openai\u002Fgpt-5-nano-2025-08-07|10.5 %|89.5 %|100.0 %|105.7|\n|anthropic\u002Fclaude-sonnet-4-6|10.6 %|89.4 %|99.9 %|114.7|\n|ibm-granite\u002Fgranite-3.3-8b-instruct|10.6 %|89.4 %|100.0 %|131.4|\n|qwen\u002Fqwen3.5-plus-2026-02-15|10.7 %|89.3 %|99.8 %|92.1|\n|openai\u002Fgpt-5.2-high-2025-12-11|10.8 %|89.2 %|100.0 %|186.3|\n|anthropic\u002Fclaude-opus-4-5-20251101|10.9 %|89.1 %|98.7 %|114.5|\n|CohereLabs\u002Fc4ai-aya-expanse-32b|10.9 %|89.1 %|99.8 %|112.7|\n|openai\u002Fgpt-5.1-low-2025-11-13|10.9 %|89.1 %|100.0 %|165.5|\n|qwen\u002Fqwen3.5-122b-a10b|11.2 %|88.8 %|99.8 %|86.4|\n|deepseek-ai\u002FDeepSeek-R1|11.3 %|88.7 %|97.0 %|93.5|\n|zai-org\u002Fglm-4p7|11.7 %|88.3 %|99.8 %|70.6|\n|anthropic\u002Fclaude-opus-4-1-20250805|11.8 %|88.2 %|92.4 %|129.1|\n|MiniMaxAI\u002Fminimax-m2p1|11.8 %|88.2 %|98.5 %|106.9|\n|anthropic\u002Fclaude-opus-4-20250514|12.0 %|88.0 %|91.0 %|123.2|\n|anthropic\u002Fclaude-sonnet-4-5-20250929|12.0 %|88.0 %|95.6 %|127.8|\n|qwen\u002Fqwen3.5-27b|12.1 %|87.9 %|99.8 %|94.4|\n|openai\u002Fgpt-5.1-high-2025-11-13|12.1 %|87.9 %|100.0 %|254.4|\n|anthropic\u002Fclaude-opus-4-6|12.2 %|87.8 %|99.8 %|137.6|\n|inceptionlabs\u002Fmercury-2|12.3 %|87.7 %|100.0 %|149.1|\n|openai\u002Fgpt-5-mini-2025-08-07|12.9 %|87.1 %|99.9 %|169.7|\n|google\u002Fgemini-3-flash-preview|13.5 %|86.5 %|99.8 %|90.2|\n|google\u002Fgemini-3-pro-preview|13.6 %|86.4 %|99.4 %|101.9|\n|moonshotai\u002FKimi-K2.5|14.2 %|85.8 %|92.2 %|112.0|\n|openai\u002Fgpt-oss-120b|14.2 %|85.8 %|99.9 %|135.2|\n|mistralai\u002Fmistral-3-large-2512|14.5 %|85.5 %|98.8 %|112.7|\n|ai21labs\u002Fjamba-mini-1.7-2025-07|14.7 %|85.3 %|99.1 %|136.4|\n|openai\u002Fgpt-5-minimal-2025-08-07|14.7 %|85.3 %|99.9 %|109.7|\n|openai\u002Fgpt-5-high-2025-08-07|15.1 %|84.9 %|99.9 %|162.7|\n|xai-org\u002Fgrok-4-1-fast-non-reasoning|17.8 %|82.2 %|98.5 %|87.5|\n|moonshotai\u002FKimi-K2-Instruct-0905|17.9 %|82.1 %|98.6 %|59.2|\n|openai\u002Fo4-mini-low-2025-04-16|18.6 %|81.4 %|98.7 %|130.9|\n|openai\u002Fo4-mini-high-2025-04-16|18.6 %|81.4 %|99.2 %|127.7|\n|xai-org\u002Fgrok-4-1-fast-reasoning|19.2 %|80.8 %|99.7 %|99.5|\n|mistralai\u002Fministral-3-14b-2512|19.4 %|80.6 %|99.6 %|135.8|\n|xai-org\u002Fgrok-4-fast-non-reasoning|19.7 %|80.3 %|99.2 %|141.9|\n|xai-org\u002Fgrok-4-fast-reasoning|20.2 %|79.8 %|99.5 %|173.9|\n|mistralai\u002Fministral-3-8b-2512|21.7 %|78.3 %|99.1 %|139.4|\n|mistralai\u002Fmistral-medium-2508|22.7 %|77.3 %|99.7 %|142.9|\n|openai\u002Fo3-pro|23.3 %|76.7 %|100.0 %|127.4|\n|microsoft\u002FPhi-4-mini-instruct|23.5 %|76.5 %|92.5 %|420.2|\n|mistralai\u002Fministral-3-3b-2512|24.2 %|75.8 %|74.3 %|119.4|\n\n\u003C!-- LEADERBOARD_END -->\n\n## Model\nThis leaderboard uses HHEM-2.3, Vectara's commercial hallucination evaluation model, to compute the LLM rankings. You can find an open-source variant of that model, HHEM-2.1-Open on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fvectara\u002Fhallucination_evaluation_model) and [Kaggle](https:\u002F\u002Fwww.kaggle.com\u002Fmodels\u002Fvectara\u002Fhallucination_evaluation_model).\n\n## Dataset\nThe dataset used for this leaderboard is carefully curated as follows:\n* Not publicly available to avoid overfitting by any LLM\n* Contains over 7700 articles from a variery of sources includeing: news, technology, science, medicine, legal, sports, sports, business and education.\n* Contains articles in both low and high complexity, from as short as 50 words to as long as 24K words.\n\n## Prior Research\nMuch prior work in this area has been done. For some of the top papers in this area (factual consistency in summarization) please see here:\n\n* [SUMMAC: Re-Visiting NLI-based Models for Inconsistency Detection in Summarization](https:\u002F\u002Faclanthology.org\u002F2022.tacl-1.10.pdf)\n* [TRUE: Re-evaluating Factual Consistency Evaluation](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04991.pdf)\n* [TrueTeacher: Learning Factual Consistency Evaluation with Large Language Models](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2305.11171v1.pdf)\n* [ALIGNSCORE: Evaluating Factual Consistency with A Unified Alignment Function](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16739.pdf)\n* [MiniCheck: Efficient Fact-Checking of LLMs on Grounding Documents](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.10774)\n* [TOFUEVAL: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.13249)\n* [RAGTruth: A Hallucination Corpus for Developing Trustworthy Retrieval-Augmented Language Models](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.585.pdf)\n* [FaithBench: A Diverse Hallucination Benchmark for Summarization by Modern LLMs](https:\u002F\u002Faclanthology.org\u002F2025.naacl-short.38.pdf)\n\nFor a very comprehensive list, please see here - https:\u002F\u002Fgithub.com\u002FEdinburghNLP\u002Fawesome-hallucination-detection. The methods described in the following section use protocols established in those papers, amongst many others.\n\n## Methodology\nFor a detailed explanation of the work that went into this model please refer to our blog posts\n* [Cut the Bull…. Detecting Hallucinations in Large Language Models](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F).\n* [Introducing the Next Generation of Vectara's Hallucination Leaderboard](https:\u002F\u002Fvectara.com\u002Fblog\u002FTBD)\n\nTo build this leaderboard, fed the full set of documents in the dataset to each of the LLMs and asked them to summarize each document, using only the facts presented in the document. We then computed the overall factual consistency rate (no hallucinations) and hallucination rate (100 - accuracy) for each model. The rate at which each model refuses to respond to the prompt is detailed in the 'Answer Rate' column. None of the content sent to the models contained illicit or 'not safe for work' content but the present of trigger words was enough to trigger some of the content filters. We used a **temperature of 0** when calling the LLMs, except where that was impossible or not available.\n\nWe evaluate summarization factual consistency rate instead of overall factual accuracy because it allows us to compare the model's response to the provided information. In other words, is the summary provided 'factually consistent' with the source document. Determining hallucinations is impossible to do for any ad hoc question as it's not known precisely what data every LLM is trained on. In addition, having a model that can determine whether any response was hallucinated without a reference source requires solving the hallucination problem and presumably training a model as large or larger than these LLMs being evaluated. So we instead chose to look at the hallucination rate within the summarization task as this is a good analogue to determine how truthful the models are overall. \n\nIn addition, LLMs are increasingly used in RAG (Retrieval Augmented Generation) and Agentic pipelines to answer user queries, where the model is being deployed as a summarizer of the search results, so this leaderboard is also a good indicator for the accuracy of the models when used in RAG or agentic systems.\n\n## Prompt Used\n> Your task is to provide a concise and factual summary for the given passage.\n> Rules\n> 1. Summarize using only the information in the given passage. Do not infer. Do not use your internal knowledge.\n> 2. Do not provide a preamble or explanation, output only the summary.\n> 3. Summaries should never exceed 20 percent of the original text's length.\n> 4. Maintain the tone of the passage.\n> If you are unable to summarize the text due to missing, unreadable, irrelevant or insufficient content, respond only with:\n> \"I am unable to summarize this text.\"\n> Here is the passage:\n> &lt;PASSAGE&gt;\n\n\nWhen calling the API, the &lt;PASSAGE&gt; variable was then replaced with the source document.\n\n## API Integration Details\nBelow is a detailed overview of the models integrated and their specific endpoints:\n\n### Anthropic Model\n- **Claude Sonnet 4, Claude Opus 4**: Invoked the model using `claude-sonnet-4-20250514` and `claude-opus-4-20250514`\n- **Claude Opus 4.1**: Invoked the model using `claude-opus-4-1-20250805`.\n- **Claude Sonnet 4.5, Claude Haiku 4.5**: Invoked the model using `claude-haiku-4-5-20251001` and `claude-sonnet-4-5-20250929`\nDetails on each model can be found on their [website](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Fdocs\u002Fmodels-overview).\n\n### Cohere Models\n- **Cohere Command R**: Employed using the model `command-r-08-2024` and the `\u002Fchat` endpoint.\n- **Cohere Command R Plus**: Employed using the model `command-r-plus-08-2024` and the `\u002Fchat` endpoint.\n- **Aya Expanse 8B, 32B**: Accessed using models `c4ai-aya-expanse-8b` and `c4ai-aya-expanse-32b`.\n- **Cohere Command A**: Employed using the model `command-a-03-2025` and the `\u002Fchat` endpoint.\nFor more information about Cohere's models, refer to their [website](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fmodels).\n\n### DeepSeek Model\n- **DeepSeek V3**: Accessed via huggingface inference provider.\n- **DeepSeek V3.1**: Accessed via huggingface inference provider.\n- **DeepSeek V3.2-Exp**: Accessed via huggingface inference provider.\n- **DeepSeek R1**: Accessed via huggingface inference provider.\n\n### Google Closed-Source Models via Vertex AI\n- **Gemini 2.5 pro, Gemini 2.5 flash and Gemini 2.5 Flash lite**: Accessed using model `gemini-2.5-pro`, `gemini-2.5-flash` and `gemini-2.5-flash-lite` on Vertex AI. \n\nFor an in-depth understanding of each model's version and lifecycle, especially those offered by Google, please refer to [Model Versions and Lifecycles](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fdocs\u002Fgenerative-ai\u002Flearn\u002Fmodel-versioning) on Vertex AI.\n\n### IBM Models\n- **Granite-3.3-Instruct 8B**: The model is accessed via Replicate API.\n- **Granite-4.0-h-small**: The model is accessed via Replicate API.\n\n\n### Llama Models\n- **Llama 3.3 70B Instruct Turbo**: Accessed via Together AI.\n- **Llama 4 Maverick 17B 128E Instruct FP8**: Accessed via Together AI.\n- **Llama 4 Scout 17B 16E Instruct**: Accessed via Together AI.\n\n\n### Microsoft Models\n- **Microsoft Phi-4\u002FPhi-4-Mini**: The [phi-4](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-4) and [phi-4-mini](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-mini-instruct) are accesed via Azure.\n\n\n### Mistral AI Models\n- **Mistral Ministral 3B**: Accessed via Mistral AI's API using the model `ministral-3b-2410`.\n- **Mistral Ministral 8B**: Accessed via Mistral AI's API using the model `ministral-8b-2410`.\n- **Mistral Large**: Accessed via Mistral AI's API using the model `mistral-large-2411`.\n- **Mistral Medium**: Accessed via Mistral AI's API using the model `mistral-large-2508`.\n- **Mistral Small**: Accessed via Mistral AI's API using the model `mistral-small-2501`.\n\n### Moonshot AI Models\n- **Kimi-K2-Instruct-0905**: Accessed via Moonshot AI API.\n\n### OpenAI Models\n- **GPT-4.1 2025-04-14**: Accessed via OpenAI API.\n- **GPT-4o 2024-08-06**: Accessed via OpenAI API.\n- **GPT-5-High 2025-08-07**: Accessed via OpenAI API.\n- **GPT-5-Mini 2025-08-07**: Accessed via OpenAI API.\n- **GPT-5-Minimal 2025-08-07**: Accessed via OpenAI API.\n- **GPT-5-Nano 2025-08-07**: Accessed via OpenAI API.\n- **GPT-OSS-120B**: Accessed via Together AI API.\n- **o3-Pro**: Accessed via OpenAI API.\n- **o4-Mini-High 2025-04-16**: Accessed via OpenAI API.\n- **o4-Mini-Low 2025-04-16**: Accessed via OpenAI API.\n\n### Qwen Models\n- **Qwen3-4b, Qwen3-8b, Qwen3-14b, Qwen3-32b**: Accessed through dashscope API\n- **Qwen3-80b-a3b-thinking**: Accessed through dashscope API\n\n### Snowflake Models\n- **Snowflake-Arctic-Instruct**: The model is accessed via Replicate API.\n\n### XAI Model\n- **Grok-3**: Accessed via xAI's API.\n- **Grok-4-Fast-Reasoning**: Accessed via xAI's API.\n- **Grok-4-Fast-Non-Reasoning**: Accessed via xAI's API.\n\n### Zhipu AI Model\n- **GLM-4.5-AIR-FP8**: Accessed via Together AI.\n- **GLM-4.6**: Accessed via deepinfra.\n\n## Frequently Asked Questions\n* **Question** Why are you are using a model to evaluate a model?\n* **Answer** There are several reasons we chose to do this over a human evaluation. While we could have crowdsourced a large human scale evaluation, that's a one time thing, it does not scale in a way that allows us to constantly update the leaderboard as new APIs come online or models get updated. We work in a fast moving field so any such process would be out of data as soon as it published. Secondly, we wanted a repeatable process that we can share with others so they can use it themselves as one of many LLM quality scores they use when evaluating their own models. This would not be possible with a human annotation process, where the only things that could be shared are the process and the human labels. It's also worth pointing out that building a model for detecting hallucinations is **much easier** than building a generative model that never produces hallucinations. So long as the hallucination evaluation model is highly correlated with human raters' judgements, it can stand in as a good proxy for human judges. As we are specifically targetting summarization and not general 'closed book' question answering, the LLM we trained does not need to have memorized a large proportion of human knowledge, it just needs to have a solid grasp and understanding of the languages it support (currently just english, but we plan to expand language coverage over time).\n\n* **Question** What if the LLM refuses to summarize the document or provides a one or two word answer?\n* **Answer** We explicitly filter these out. See our [blog post](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F) for more information. You can see the 'Answer Rate' column on the leaderboard that indicates the percentage of documents summarized, and the 'Average Summary Length' column detailing the summary lengths, showing we didn't get very short answers for most documents.\n\n* **Question** What version of model XYZ did you use?\n* **Answer** Please see the API details section for specifics about the model versions used and how they were called, as well as the date the leaderboard was last updated. Please contact us (create an issue in the repo) if you need more clarity. \n\n* **Question** Can't a model just score a 100% by providing either no answers or very short answers?\n* **Answer** We explicitly filtered out such responses from every model, doing the final evaluation only on documents that all models provided a summary for. You can find out more technical details in our [blog post]([https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F) on the topic. See also the 'Answer Rate' and 'Average Summary Length' columns in the table above.\n\n* **Question** Wouldn't an extractive summarizer model that just copies and pastes from the original summary score 100% (0 hallucination) on this task?\n* **Answer** Absolutely as by definition such a model would have no hallucinations and provide a faithful summary. We do not claim to be evaluating summarization quality, that is a separate and **orthogonal** task, and should be evaluated independently. We are **not** evaluating the quality of the summaries, only the **factual consistency** of them, as we point out in the [blog post](https:\u002F\u002Fvectara.com\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F).\n\n* **Question** This seems a very hackable metric, as you could just copy the original text as the summary\n* **Answer** That's true but we are not evaluating arbitrary models on this approach, e.g. like in a Kaggle competition. Any model that does so would perform poorly at any other task you care about. So I would consider this as quality metric that you'd run alongside whatever other evaluations you have for your model, such as summarization quality, question answering accuracy, etc. But we do not recommend using this as a standalone metric. None of the models chosen were trained on our model's output. That may happen in future but as we plan to update the model and also the source documents so this is a living leaderboard, that will be an unlikely occurrence. That is however also an issue with any LLM benchmark. We should also point out this builds on a large body of work on factual consistency where many other academics invented and refined this protocol. See our references to the SummaC and True papers in this [blog post](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F), as well as this excellent compilation of resources - https:\u002F\u002Fgithub.com\u002FEdinburghNLP\u002Fawesome-hallucination-detection to read more.\n\n* **Question** This does not definitively measure all the ways a model can hallucinate\n* **Answer** Agreed. We do not claim to have solved the problem of hallucination detection, and plan to expand and enhance this process further. But we do believe it is a move in the right direction, and provides a much needed starting point that everyone can build on top of.\n\n* **Question** Some models could hallucinate only while summarizing. Couldn't you just provide it a list of well known facts and check how well it can recall them?\n* **Answer** That would be a poor test in my opinion. For one thing, unless you trained the model you don't know the data it was trained on, so you can't be sure the model is grounding its response in real data it has seen on or whether it is guessing. Additionally, there is no clear definition of 'well known', and these types of data are typically easy for most models to accurately recall. Most hallucinations, in my admittedly subjective experience, come from the model fetching information that is very rarely known or discussed, or facts for which the model has seen conflicting information. Without knowing the source data the model was trained on, again it's impossible to validate these sort of hallucinations as you won't know which data fits this criterion. I also think its unlikely the model would only hallucinate while summarizing. We are asking the model to take information and transform it in a way that is still faithful to the source. This is analogous to a lot of generative tasks aside from summarization (e.g. write an email covering these points...), and if the model deviates from the prompt then that is a failure to follow instructions, indicating the model would struggle on other instruction following tasks also.\n\n* **Question** This is a good start but far from definitive\n* **Answer** We agree. There's a lot more that needs to be done, and the problem is far from solved. But a 'good start' means that hopefully progress will start to be made in this area, and by open sourcing the model, we hope to involve the community into taking this to the next level.\n\n## Additional Resources\n* Check our [Open-RAG-Eval](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fopen-rag-eval): an open source RAG evaluation framework that uses HHEM but also provides metrics for retreival, groundedness, and citations\n* Check out the commercial version of the HHEM model (see [API docs](https:\u002F\u002Fdocs.vectara.com\u002Fdocs\u002Frest-api\u002Fevaluate-factual-consistency), which has better detection characteristics and supports multiple languages.\n* For more information about Vectara or if you want to schedule a demo, contact us [here](https:\u002F\u002Fwww.vectara.com\u002Fcontact-us).\n\n\u003Cimg referrerpolicy=\"no-referrer-when-downgrade\" src=\"https:\u002F\u002Fstatic.scarf.sh\u002Fa.png?x-pxid=5f53f560-5ba6-4e73-917b-c7049e9aea2c\" \u002F>\n","# 幻觉排行榜\n\n由 Vectara 的幻觉评估模型（HHEM）计算得出的公开大型语言模型排行榜。该模型用于评估大型语言模型在总结文档时引入幻觉的频率。随着我们的模型和各语言模型的不断更新，我们计划定期对本排行榜进行更新。\n\n欢迎前往 Hugging Face 查看[交互式幻觉排行榜](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvectara\u002Fleaderboard)。\n\n如果您对本排行榜的早期版本感兴趣：\n\n1. 基于 HHEM-1.0 的第一个版本可在此处查看：[HHEM-1.0 最终版](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fhallucination-leaderboard\u002Ftree\u002Fhhem-1.0-final)\n2. 基于旧数据集的最新版本可在此处查看：[HHEM-2.3 旧数据集版](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fhallucination-leaderboard\u002Ftree\u002Fhhem-2.3-old-dataset)\n\n\u003Ctable style=\"border-collapse: collapse;\">\n  \u003Ctr>\n    \u003Ctd style=\"text-align: center; vertical-align: middle; border: none;\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fvectara_hallucination-leaderboard_readme_bb7efaedaefe.png\" width=\"50\" height=\"50\">\n    \u003C\u002Ftd>\n    \u003Ctd style=\"text-align: left; vertical-align: middle; border: none;\">\n      深切缅怀 \u003Ca href=\"https:\u002F\u002Fwww.ivinsfuneralhome.com\u002Fobituaries\u002FSimon-Mark-Hughes?obId=30000023\">西蒙·马克·休斯\u003C\u002Fa>……\n    \u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n\u003C!-- LEADERBOARD_START -->\n最后更新日期：2026年3月20日\n\n![图表：各类 LLM 的幻觉率](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fvectara_hallucination-leaderboard_readme_a6baf24d4e13.png)\n\n|模型|幻觉率|事实一致性率|回答率|平均摘要长度（词数）|\n|----|----:|----:|----:|----:|\n|antgroup\u002Ffinix_s1_32b|1.8 %|98.2 %|99.5 %|172.4|\n|openai\u002Fgpt-5.4-nano-2026-03-17|3.1 %|96.9 %|100.0 %|144.4|\n|google\u002Fgemini-2.5-flash-lite|3.3 %|96.7 %|99.5 %|95.7|\n|microsoft\u002FPhi-4|3.7 %|96.3 %|80.7 %|120.9|\n|meta-llama\u002FLlama-3.3-70B-Instruct-Turbo|4.1 %|95.9 %|99.5 %|64.6|\n|snowflake\u002Fsnowflake-arctic-instruct|4.3 %|95.7 %|62.7 %|81.4|\n|google\u002Fgemma-3-12b-it|4.4 %|95.6 %|97.4 %|89.7|\n|mistralai\u002Fmistral-large-2411|4.5 %|95.5 %|99.9 %|85.0|\n|qwen\u002Fqwen3-8b|4.8 %|95.2 %|99.9 %|83.6|\n|amazon\u002Fnova-pro-v1:0|5.1 %|94.9 %|99.3 %|66.2|\n|amazon\u002Fnova-2-lite-v1:0|5.1 %|94.9 %|99.6 %|94.1|\n|mistralai\u002Fmistral-small-2501|5.1 %|94.9 %|97.9 %|98.8|\n|ibm-granite\u002Fgranite-4.0-h-small|5.2 %|94.8 %|100.0 %|107.4|\n|ai21labs\u002Fjamba-mini-2|5.3 %|94.7 %|99.6 %|109.4|\n|deepseek-ai\u002FDeepSeek-V3.2-Exp|5.3 %|94.7 %|96.6 %|64.6|\n|qwen\u002Fqwen3-14b|5.4 %|94.6 %|99.9 %|111.1|\n|amazon\u002Fnova-micro-v1:0|5.5 %|94.5 %|100.0 %|100.0|\n|deepseek-ai\u002FDeepSeek-V3.1|5.5 %|94.5 %|94.5 %|63.7|\n|openai\u002Fgpt-5.4-mini-2026-03-17|5.5 %|94.5 %|100.0 %|54.7|\n|openai\u002Fgpt-4.1-2025-04-14|5.6 %|94.4 %|99.9 %|91.7|\n|qwen\u002Fqwen3-4b|5.7 %|94.3 %|99.9 %|104.7|\n|xai-org\u002Fgrok-3|5.8 %|94.2 %|93.0 %|95.9|\n|qwen\u002Fqwen3-32b|5.9 %|94.1 %|99.9 %|115.8|\n|amazon\u002Fnova-lite-v1:0|6.1 %|93.9 %|99.9 %|91.8|\n|deepseek-ai\u002FDeepSeek-V3|6.1 %|93.9 %|97.5 %|81.7|\n|deepseek-ai\u002FDeepSeek-V3.2|6.3 %|93.7 %|92.6 %|62.0|\n|google\u002Fgemma-3-4b-it|6.4 %|93.6 %|67.3 %|77.4|\n|CohereLabs\u002Fcommand-r-plus-08-2024|6.9 %|93.1 %|95.0 %|91.5|\n|arcee-ai\u002Ftrinity-large-preview|6.9 %|93.1 %|99.0 %|117.3|\n|openai\u002Fgpt-5.4-2026-03-05|7.0 %|93.0 %|99.9 %|81.7|\n|google\u002Fgemini-2.5-pro|7.0 %|93.0 %|99.1 %|106.4|\n|mistralai\u002Fministral-3b-2410|7.3 %|92.7 %|99.9 %|167.9|\n|google\u002Fgemma-3-27b-it|7.4 %|92.6 %|98.8 %|96.4|\n|mistralai\u002Fministral-8b-2410|7.4 %|92.6 %|99.9 %|196.0|\n|meta-llama\u002FLlama-4-Scout-17B-16E-Instruct|7.7 %|92.3 %|99.0 %|137.3|\n|google\u002Fgemini-2.5-flash|7.8 %|92.2 %|99.0 %|101.5|\n|meta-llama\u002FLlama-4-Maverick-17B-128E-Instruct-FP8|8.2 %|91.8 %|100.0 %|106.0|\n|google\u002Fgemini-3.1-flash-lite-preview|8.2 %|91.8 %|99.6 %|62.6|\n|openai\u002Fgpt-5.4-pro-2026-03-05|8.3 %|91.7 %|100.0 %|148.5|\n|openai\u002Fgpt-5.2-low-2025-12-11|8.4 %|91.6 %|100.0 %|126.5|\n|MiniMaxAI\u002Fminimax-m2p5|9.1 %|90.9 %|98.2 %|137.2|\n|CohereLabs\u002Fcommand-a-03-2025|9.3 %|90.7 %|97.6 %|101.7|\n|zai-org\u002FGLM-4.5-AIR-FP8|9.3 %|90.7 %|98.1 %|70.6|\n|qwen\u002Fqwen3-235b-a22b|9.3 %|90.7 %|94.9 %|105.6|\n|qwen\u002Fqwen3-next-80b-a3b-thinking|9.3 %|90.7 %|94.4 %|70.9|\n|zai-org\u002FGLM-4.7-flash|9.3 %|90.7 %|91.6 %|71.8|\n|CohereLabs\u002Fc4ai-aya-expanse-8b|9.5 %|90.5 %|77.5 %|88.2|\n|zai-org\u002FGLM-4.6|9.5 %|90.5 %|94.5 %|77.2|\n|nvidia\u002FNemotron-3-Nano-30B-A3B|9.6 %|90.4 %|99.6 %|104.2|\n|openai\u002Fgpt-4o-2024-08-06|9.6 %|90.4 %|93.8 %|86.6|\n|ai21labs\u002Fjamba-large-1.7-2025-07|9.7 %|90.3 %|98.9 %|124.8|\n|anthropic\u002Fclaude-haiku-4-5-20251001|9.8 %|90.2 %|99.5 %|115.1|\n|zai-org\u002Fglm-5|10.1 %|89.9 %|99.7 %|74.4|\n|anthropic\u002Fclaude-sonnet-4-20250514|10.3 %|89.7 %|98.6 %|145.8|\n|google\u002Fgemini-3.1-pro-preview|10.4 %|89.6 %|99.4 %|107.7|\n|qwen\u002Fqwen3.5-flash-2026-02-23|10.5 %|89.5 %|99.8 %|95.0|\n|qwen\u002Fqwen3.5-35b-a3b|10.5 %|89.5 %|99.8 %|94.9|\n|openai\u002Fgpt-5-nano-2025-08-07|10.5 %|89.5 %|100.0 %|105.7|\n|anthropic\u002Fclaude-sonnet-4-6|10.6 %|89.4 %|99.9 %|114.7|\n|ibm-granite\u002Fgranite-3.3-8b-instruct|10.6 %|89.4 %|100.0 %|131.4|\n|qwen\u002Fqwen3.5-plus-2026-02-15|10.7 %|89.3 %|99.8 %|92.1|\n|openai\u002Fgpt-5.2-high-2025-12-11|10.8 %|89.2 %|100.0 %|186.3|\n|anthropic\u002Fclaude-opus-4-5-20251101|10.9 %|89.1 %|98.7 %|114.5|\n|CohereLabs\u002Fc4ai-aya-expanse-32b|10.9 %|89.1 %|99.8 %|112.7|\n|openai\u002Fgpt-5.1-low-2025-11-13|10.9 %|89.1 %|100.0 %|165.5|\n|qwen\u002Fqwen3.5-122b-a10b|11.2 %|88.8 %|99.8 %|86.4|\n|deepseek-ai\u002FDeepSeek-R1|11.3 %|88.7 %|97.0 %|93.5|\n|zai-org\u002Fglm-4p7|11.7 %|88.3 %|99.8 %|70.6|\n|anthropic\u002Fclaude-opus-4-1-20250805|11.8 %|88.2 %|92.4 %|129.1|\n|MiniMaxAI\u002Fminimax-m2p1|11.8 %|88.2 %|98.5 %|106.9|\n|anthropic\u002Fclaude-opus-4-20250514|12.0 %|88.0 %|91.0 %|123.2|\n|anthropic\u002Fclaude-sonnet-4-5-20250929|12.0 %|88.0 %|95.6 %|127.8|\n|qwen\u002Fqwen3.5-27b|12.1 %|87.9 %|99.8 %|94.4|\n|openai\u002Fgpt-5.1-high-2025-11-13|12.1 %|87.9 %|100.0 %|254.4|\n|anthropic\u002Fclaude-opus-4-6|12.2 %|87.8 %|99.8 %|137.6|\n|inceptionlabs\u002Fmercury-2|12.3 %|87.7 %|100.0 %|149.1|\n|openai\u002Fgpt-5-mini-2025-08-07|12.9 %|87.1 %|99.9 %|169.7|\n|google\u002Fgemini-3-flash-preview|13.5 %|86.5 %|99.8 %|90.2|\n|google\u002Fgemini-3-pro-preview|13.6 %|86.4 %|99.4 %|101.9|\n|moonshotai\u002FKimi-K2.5|14.2 %|85.8 %|92.2 %|112.0|\n|openai\u002Fgpt-oss-120b|14.2 %|85.8 %|99.9 %|135.2|\n|mistralai\u002Fmistral-3-large-2512|14.5 %|85.5 %|98.8 %|112.7|\n|ai21labs\u002Fjamba-mini-1.7-2025-07|14.7 %|85.3 %|99.1 %|136.4|\n|openai\u002Fgpt-5-minimal-2025-08-07|14.7 %|85.3 %|99.9 %|109.7|\n|openai\u002Fgpt-5-high-2025-08-07|15.1 %|84.9 %|99.9 %|162.7|\n|xai-org\u002Fgrok-4-1-fast-non-reasoning|17.8 %|82.2 %|98.5 %|87.5|\n|moonshotai\u002FKimi-K2-Instruct-0905|17.9 %|82.1 %|98.6 %|59.2|\n|openai\u002Fo4-mini-low-2025-04-16|18.6 %|81.4 %|98.7 %|130.9|\n|openai\u002Fo4-mini-high-2025-04-16|18.6 %|81.4 %|99.2 %|127.7|\n|xai-org\u002Fgrok-4-1-fast-reasoning|19.2 %|80.8 %|99.7 %|99.5|\n|mistralai\u002Fministral-3-14b-2512|19.4 %|80.6 %|99.6 %|135.8|\n|xai-org\u002Fgrok-4-fast-non-reasoning|19.7 %|80.3 %|99.2 %|141.9|\n|xai-org\u002Fgrok-4-fast-reasoning|20.2 %|79.8 %|99.5 %|173.9|\n|mistralai\u002Fministral-3-8b-2512|21.7 %|78.3 %|99.1 %|139.4|\n|mistralai\u002Fmistral-medium-2508|22.7 %|77.3 %|99.7 %|142.9|\n|openai\u002Fo3-pro|23.3 %|76.7 %|100.0 %|127.4|\n|microsoft\u002FPhi-4-mini-instruct|23.5 %|76.5 %|92.5 %|420.2|\n|mistralai\u002Fministral-3-3b-2512|24.2 %|75.8 %|74.3 %|119.4|\n\n\u003C!-- LEADERBOARD_END -->\n\n## 模型\n本排行榜使用 Vectara 的商用幻觉评估模型 HHEM-2.3 来计算大语言模型的排名。您可以在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fvectara\u002Fhallucination_evaluation_model) 和 [Kaggle](https:\u002F\u002Fwww.kaggle.com\u002Fmodels\u002Fvectara\u002Fhallucination_evaluation_model) 上找到该模型的开源版本 HHEM-2.1-Open。\n\n## 数据集\n本排行榜所使用的数据集经过精心 curated，具体如下：\n* 不对外公开，以避免任何大语言模型过拟合。\n* 包含来自新闻、科技、科学、医学、法律、体育、商业和教育等多种来源的超过 7700 篇文章。\n* 文章涵盖低复杂度和高复杂度两种类型，字数从最短的 50 字到最长的 24,000 字不等。\n\n## 前人研究\n在这一领域已有大量前人研究。以下是该领域一些顶级论文（关于摘要中的事实一致性）的链接：\n\n* [SUMMAC：重新审视基于 NLI 的模型在摘要中检测不一致性的能力](https:\u002F\u002Faclanthology.org\u002F2022.tacl-1.10.pdf)\n* [TRUE：重新评估事实一致性评估](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.04991.pdf)\n* [TrueTeacher：利用大型语言模型学习事实一致性评估](https:\u002F\u002Fbrowse.arxiv.org\u002Fpdf\u002F2305.11171v1.pdf)\n* [ALIGNSCORE：通过统一对齐函数评估事实一致性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2305.16739.pdf)\n* [MiniCheck：高效检查大语言模型在基础文档上的事实准确性](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.10774)\n* [TOFUEVAL：评估大语言模型在主题聚焦对话摘要中的幻觉现象](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.13249)\n* [RAGTruth：用于开发可信检索增强型语言模型的幻觉语料库](https:\u002F\u002Faclanthology.org\u002F2024.acl-long.585.pdf)\n* [FaithBench：现代大语言模型摘要任务中的多样化幻觉基准测试](https:\u002F\u002Faclanthology.org\u002F2025.naacl-short.38.pdf)\n\n如需更全面的列表，请参阅此链接：https:\u002F\u002Fgithub.com\u002FEdinburghNLP\u002Fawesome-hallucination-detection。以下章节中描述的方法采用了这些论文以及其他许多文献中确立的协议。\n\n## 方法论\n有关本模型研发工作的详细说明，请参阅我们的博客文章：\n* [直面问题……检测大语言模型中的幻觉](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F)。\n* [推出 Vectara 幻觉排行榜新一代](https:\u002F\u002Fvectara.com\u002Fblog\u002FTBD)\n\n为构建本排行榜，我们将数据集中的所有文档输入到每个多语言模型中，并要求它们仅根据文档中提供的事实来总结每篇文章。随后，我们计算了每个模型的整体事实一致性率（即无幻觉率）以及幻觉率（100 减去准确率）。各模型拒绝回答提示的比例则记录在“回答率”列中。发送给模型的内容均不含非法或“不适合工作场所”的内容，但触发词的存在足以激活部分内容过滤器。我们在调用大语言模型时使用 **温度为 0**，除非无法实现或不可用。\n\n我们评估的是摘要的事实一致性率，而非整体事实准确性，因为这样可以比较模型对所提供信息的响应是否一致。换句话说，生成的摘要是否与源文档“事实一致”。对于任何临时性问题，要判断是否存在幻觉是不可能的，因为我们并不清楚每个大语言模型究竟训练了哪些数据。此外，若要让模型在没有参考文本的情况下判断其回应是否包含幻觉，就相当于需要解决幻觉问题本身，而这很可能需要训练一个与被评估的大语言模型规模相当甚至更大的模型。因此，我们选择关注摘要任务中的幻觉率，以此作为衡量模型整体真实性的良好指标。\n\n另外，大语言模型越来越多地应用于 RAG（检索增强生成）和代理式工作流中，用于回答用户查询，此时模型通常被部署为搜索结果的摘要生成器。因此，本排行榜也能很好地反映这些模型在 RAG 或代理系统中使用的准确性。\n\n## 使用的提示\n> 你的任务是为给定的文章提供简洁且基于事实的摘要。\n> 规则\n> 1. 仅使用给定文章中的信息进行摘要，不得进行推断，也不得使用你自身的知识储备。\n> 2. 不得添加前言或解释，只需输出摘要。\n> 3. 摘要长度不得超过原文长度的 20%。\n> 4. 保持原文的语气。\n> 如果因内容缺失、无法读取、无关或不足而无法完成摘要，请仅回复：\n> “我无法总结本文。”\n> 以下是文章：\n> &lt;PASSAGE&gt;\n\n\n在调用 API 时，&lt;PASSAGE&gt; 变量会被替换为相应的源文档。\n\n## API 集成详情\n以下是已集成模型及其具体端点的详细概述：\n\n### Anthropic 模型\n- **Claude Sonnet 4、Claude Opus 4**：分别使用 `claude-sonnet-4-20250514` 和 `claude-opus-4-20250514` 调用模型。\n- **Claude Opus 4.1**：使用 `claude-opus-4-1-20250805` 调用模型。\n- **Claude Sonnet 4.5、Claude Haiku 4.5**：分别使用 `claude-haiku-4-5-20251001` 和 `claude-sonnet-4-5-20250929` 调用模型。\n各模型的详细信息可在其 [官网](https:\u002F\u002Fdocs.anthropic.com\u002Fclaude\u002Fdocs\u002Fmodels-overview) 查阅。\n\n### Cohere 模型\n- **Cohere Command R**：使用模型 `command-r-08-2024` 和 `\u002Fchat` 端点调用。\n- **Cohere Command R Plus**：使用模型 `command-r-plus-08-2024` 和 `\u002Fchat` 端点调用。\n- **Aya Expanse 8B、32B**：分别使用模型 `c4ai-aya-expanse-8b` 和 `c4ai-aya-expanse-32b` 访问。\n- **Cohere Command A**：使用模型 `command-a-03-2025` 和 `\u002Fchat` 端点调用。\n有关 Cohere 模型的更多信息，请参阅其 [官网](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fmodels)。\n\n### DeepSeek 模型\n- **DeepSeek V3**：通过 Hugging Face 推理服务提供商访问。\n- **DeepSeek V3.1**：通过 Hugging Face 推理服务提供商访问。\n- **DeepSeek V3.2-Exp**：通过 Hugging Face 推理服务提供商访问。\n- **DeepSeek R1**：通过 Hugging Face 推理服务提供商访问。\n\n### Google 封闭源模型（通过 Vertex AI）\n- **Gemini 2.5 Pro、Gemini 2.5 Flash 和 Gemini 2.5 Flash Lite**：分别使用模型 `gemini-2.5-pro`、`gemini-2.5-flash` 和 `gemini-2.5-flash-lite` 在 Vertex AI 上访问。\n\n如需深入了解各模型的版本和生命周期，尤其是 Google 提供的模型，请参阅 Vertex AI 上的 [模型版本与生命周期](https:\u002F\u002Fcloud.google.com\u002Fvertex-ai\u002Fdocs\u002Fgenerative-ai\u002Flearn\u002Fmodel-versioning) 页面。\n\n### IBM 模型\n- **Granite-3.3-Instruct 8B**：通过 Replicate API 访问模型。\n- **Granite-4.0-h-small**：通过 Replicate API 访问模型。\n\n### Llama 模型\n- **Llama 3.3 70B Instruct Turbo**：通过 Together AI 访问。\n- **Llama 4 Maverick 17B 128E Instruct FP8**：通过 Together AI 访问。\n- **Llama 4 Scout 17B 16E Instruct**：通过 Together AI 访问。\n\n\n### 微软模型\n- **Microsoft Phi-4\u002FPhi-4-Mini**：[phi-4](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-4) 和 [phi-4-mini](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-4-mini-instruct) 通过 Azure 访问。\n\n\n### Mistral AI 模型\n- **Mistral Ministral 3B**：通过 Mistral AI 的 API，使用模型 `ministral-3b-2410` 访问。\n- **Mistral Ministral 8B**：通过 Mistral AI 的 API，使用模型 `ministral-8b-2410` 访问。\n- **Mistral Large**：通过 Mistral AI 的 API，使用模型 `mistral-large-2411` 访问。\n- **Mistral Medium**：通过 Mistral AI 的 API，使用模型 `mistral-large-2508` 访问。\n- **Mistral Small**：通过 Mistral AI 的 API，使用模型 `mistral-small-2501` 访问。\n\n### Moonshot AI 模型\n- **Kimi-K2-Instruct-0905**：通过 Moonshot AI API 访问。\n\n### OpenAI 模型\n- **GPT-4.1 2025-04-14**：通过 OpenAI API 访问。\n- **GPT-4o 2024-08-06**：通过 OpenAI API 访问。\n- **GPT-5-High 2025-08-07**：通过 OpenAI API 访问。\n- **GPT-5-Mini 2025-08-07**：通过 OpenAI API 访问。\n- **GPT-5-Minimal 2025-08-07**：通过 OpenAI API 访问。\n- **GPT-5-Nano 2025-08-07**：通过 OpenAI API 访问。\n- **GPT-OSS-120B**：通过 Together AI API 访问。\n- **o3-Pro**：通过 OpenAI API 访问。\n- **o4-Mini-High 2025-04-16**：通过 OpenAI API 访问。\n- **o4-Mini-Low 2025-04-16**：通过 OpenAI API 访问。\n\n### 通义千问模型\n- **Qwen3-4b、Qwen3-8b、Qwen3-14b、Qwen3-32b**：通过 DashScope API 访问。\n- **Qwen3-80b-a3b-thinking**：通过 DashScope API 访问。\n\n### Snowflake 模型\n- **Snowflake-Arctic-Instruct**：该模型通过 Replicate API 访问。\n\n### xAI 模型\n- **Grok-3**：通过 xAI 的 API 访问。\n- **Grok-4-Fast-Reasoning**：通过 xAI 的 API 访问。\n- **Grok-4-Fast-Non-Reasoning**：通过 xAI 的 API 访问。\n\n### 百川智能模型\n- **GLM-4.5-AIR-FP8**：通过 Together AI 访问。\n- **GLM-4.6**：通过 Deepinfra 访问。\n\n## 常见问题解答\n* **问** 为什么你们要用一个模型来评估另一个模型？\n* **答** 我们选择这种方法而非人工评估，主要有以下几个原因。虽然我们本可以通过众包方式进行大规模的人工评估，但这种评估是一次性的，无法随着新API的上线或模型的更新而持续更新排行榜。由于我们所处的领域发展迅速，任何此类流程在发布后很快就会过时。其次，我们希望有一个可重复的过程，可以与他人共享，以便他们在评估自身模型时将其作为众多LLM质量评分之一使用。而人工标注过程则无法做到这一点，因为能够共享的只有流程和人工标注结果。此外，值得注意的是，构建用于检测幻觉的模型，比构建一个完全不会产生幻觉的生成式模型要**容易得多**。只要幻觉评估模型与人类评价者的判断高度相关，它就可以很好地替代人类评价者。由于我们的目标是摘要生成任务，而非通用的“闭卷”问答任务，因此我们训练的LLM并不需要记住大量人类知识，只需对其支持的语言（目前仅限英语，但我们计划逐步扩展语言覆盖范围）有扎实的理解即可。\n\n* **问** 如果LLM拒绝总结文档，或者只给出一两个词的回答，该怎么办？\n* **答** 我们会明确地将这些情况过滤掉。更多信息请参阅我们的[博客文章](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F)。您可以在排行榜中看到“回答率”列，显示已成功总结的文档比例；以及“平均摘要长度”列，详细列出摘要的长度，这表明大多数文档并没有得到非常简短的答案。\n\n* **问** 你们使用了XYZ模型的哪个版本？\n* **答** 请查看API详情部分，了解所使用的模型版本及其调用方式，以及排行榜的最后更新日期。如果您需要更多说明，请联系我们（在仓库中创建一个问题）。\n\n* **问** 模型难道不能通过不提供答案或只提供极短的答案来获得100%的分数吗？\n* **答** 我们已经明确地从所有模型的结果中过滤掉了这类响应，最终的评估仅针对所有模型都提供了摘要的文档进行。更多技术细节请参阅我们的[博客文章](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F)。同时也可以参考上表中的“回答率”和“平均摘要长度”列。\n\n* **问** 那么，一个仅仅从原文复制粘贴内容的抽取式摘要模型，在这项任务中不是也能得到100分（无幻觉）吗？\n* **答** 当然可以，因为根据定义，这种模型确实不会产生幻觉，并且能够提供忠实的摘要。但我们并不声称是在评估摘要的质量，那是一个独立且**正交**的任务，应该单独进行评估。我们**并非**在评估摘要的质量，而只是评估其**事实一致性**，这一点我们在[博客文章](https:\u002F\u002Fvectara.com\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F)中也已明确指出。\n\n* **问** 这种指标似乎很容易被“破解”，比如直接把原文当作摘要提交不就行了吗？\n* **答** 确实如此，但我们并不是在这种方法上评估任意模型，例如像Kaggle竞赛那样。任何采用这种方式的模型，在您关心的其他任务中都会表现不佳。因此，我认为这个指标应该与其他针对您模型的评估一起使用，比如摘要质量、问答准确率等。不过，我们并不建议将其作为唯一的评估指标。所有入选的模型都没有基于我们的模型输出进行训练。未来可能会出现这种情况，但由于我们计划不断更新模型和源文档，使排行榜保持动态更新，这种情况不太可能发生。当然，这也是所有LLM基准测试面临的问题。此外，这一方法建立在大量关于事实一致性的研究基础之上，许多学者已经发明并完善了这一评估协议。请参阅我们在[博客文章](https:\u002F\u002Fvectara.com\u002Fblog\u002Fcut-the-bull-detecting-hallucinations-in-large-language-models\u002F)中引用的SummaC和True论文，以及这份优秀的资源汇总——https:\u002F\u002Fgithub.com\u002FEdinburghNLP\u002Fawesome-hallucination-detection，以获取更多信息。\n\n* **问** 这个指标并不能全面衡量模型产生幻觉的所有方式。\n* **答** 我们同意这一点。我们并不声称已经解决了幻觉检测的问题，未来还将继续扩展和改进这一方法。但我们相信，这确实是朝着正确方向迈出的一步，为所有人提供了一个亟需的起点，供大家在此基础上进一步发展。\n\n* **问** 有些模型可能只在摘要生成时才会产生幻觉。那么，您们能否直接给它一份广为人知的事实清单，然后检查它回忆这些事实的能力呢？\n* **答** 在我看来，这样的测试并不理想。首先，除非您亲自训练过该模型，否则您并不清楚它所使用的训练数据，也就无法确定它的回答是否基于真实见过的数据，还是纯粹的猜测。此外，“广为人知”的标准本身也不明确，而且这类数据通常很容易被大多数模型准确回忆出来。根据我个人的主观经验，大多数幻觉往往出现在模型提取那些极少被提及或讨论的信息，或者面对相互矛盾的事实时。如果不知道模型的训练数据来源，就无法验证这些幻觉的真实性，因为您根本无法判断哪些数据符合这一标准。另外，我也认为模型不太可能只在摘要生成时才产生幻觉。我们要求模型对信息进行加工处理，同时保持与原文的一致性。这类似于许多非摘要类的生成任务（例如，围绕这些要点写一封邮件……）。如果模型偏离了指令，那就意味着它未能遵循指示，这也表明它在其他遵循指令的任务中也会遇到困难。\n\n* **问** 这只是一个良好的开端，但远未达到最终目的。\n* **答** 我们同意这一点。还有很多工作要做，这个问题远未解决。但“良好的开端”意味着我们有望在这个领域取得进展。通过开源该模型，我们希望能够吸引社区参与进来，共同推动这一领域的进步。\n\n## 其他资源\n* 请查看我们的 [Open-RAG-Eval](https:\u002F\u002Fgithub.com\u002Fvectara\u002Fopen-rag-eval)：一个开源的 RAG 评估框架，它使用 HHEM，同时还提供检索、事实一致性和引用等方面的指标。\n* 请了解 HHEM 模型的商业版本（参见 [API 文档](https:\u002F\u002Fdocs.vectara.com\u002Fdocs\u002Frest-api\u002Fevaluate-factual-consistency)），该版本具有更好的检测性能，并支持多种语言。\n* 如需了解更多关于 Vectara 的信息或安排演示，请在此 [联系我们](https:\u002F\u002Fwww.vectara.com\u002Fcontact-us)。\n\n\u003Cimg referrerpolicy=\"no-referrer-when-downgrade\" src=\"https:\u002F\u002Fstatic.scarf.sh\u002Fa.png?x-pxid=5f53f560-5ba6-4e73-917b-c7049e9aea2c\" \u002F>","# Hallucination Leaderboard 快速上手指南\n\nHallucination Leaderboard 是一个基于 Vectara 的幻觉评估模型（HHEM）构建的公开榜单，用于评估大型语言模型（LLM）在总结文档时产生“幻觉”（即捏造事实）的频率。该工具帮助开发者选择在 RAG（检索增强生成）或代理系统中事实一致性更高的模型。\n\n## 环境准备\n\n本项目主要是一个评估数据集和结果展示库，核心评估依赖于 Vectara 的 HHEM 模型。若要复现评估或本地运行相关逻辑，需满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows (WSL 推荐)\n*   **Python 版本**：建议 Python 3.9 或更高版本\n*   **前置依赖**：\n    *   `git`：用于克隆仓库\n    *   `pip`：Python 包管理工具\n    *   (可选) GPU 环境：若需本地运行 HHEM 模型进行大规模推理，建议使用支持 CUDA 的环境以加速处理。\n\n## 安装步骤\n\n### 1. 克隆仓库\n首先从 GitHub 克隆项目代码到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fvectara\u002Fhallucination-leaderboard.git\ncd hallucination-leaderboard\n```\n\n### 2. 安装 Python 依赖\n虽然仓库主要用于展示榜单数据，但若需使用其评估模型（HHEM），请安装相关依赖。由于原仓库未提供明确的 `requirements.txt`，通常需安装 HHEM 模型所需的库。你可以直接通过 Hugging Face 获取模型：\n\n```bash\npip install transformers torch accelerate sentencepiece\n```\n\n> **提示**：国内开发者若下载 Hugging Face 模型较慢，可设置镜像环境变量：\n> ```bash\n> export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com\n> ```\n\n### 3. 获取评估模型 (可选)\n如果你希望在本地运行评估逻辑，可以从 Hugging Face 下载开源版本的 HHEM 模型（如 HHEM-2.1-Open）：\n\n```python\nfrom transformers import AutoModelForSequenceClassification, AutoTokenizer\n\nmodel_name = \"vectara\u002Fhallucination_evaluation_model\" # 或具体版本如 vectara\u002Fhhem-2.1-open\ntokenizer = AutoTokenizer.from_pretrained(model_name)\nmodel = AutoModelForSequenceClassification.from_pretrained(model_name)\n```\n\n## 基本使用\n\n该项目的核心用途是**查阅榜单**以选择低幻觉率的模型，或参考其**方法论**构建自己的评估流程。\n\n### 方式一：查看在线交互式榜单\n最快捷的方式是直接访问部署在 Hugging Face Spaces 上的交互式榜单，无需本地安装：\n\n*   **地址**: [https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvectara\u002Fleaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fvectara\u002Fleaderboard)\n*   **功能**: 可按模型名称搜索，对比不同模型的“幻觉率 (Hallucination Rate)\"、“事实一致性率 (Factual Consistency Rate)\"及“回答率 (Answer Rate)\"。\n\n### 方式二：参考评估提示词 (Prompt) 进行本地测试\n你可以直接使用该项目定义的标准化 Prompt 来测试任意 LLM 的幻觉情况。以下是调用 LLM API 时的标准 Prompt 模板：\n\n```text\nYour task is to provide a concise and factual summary for the given passage.\nRules\n1. Summarize using only the information in the given passage. Do not infer. Do not use your internal knowledge.\n2. Do not provide a preamble or explanation, output only the summary.\n3. Summaries should never exceed 20 percent of the original text's length.\n4. Maintain the tone of the passage.\nIf you are unable to summarize the text due to missing, unreadable, irrelevant or insufficient content, respond only with:\n\"I am unable to summarize this text.\"\nHere is the passage:\n\u003CPASSAGE>\n```\n\n**使用示例 (Python伪代码):**\n\n```python\ndocument = \"这里是你的源文档内容...\" # 替换为实际文档\nprompt = f\"\"\"\nYour task is to provide a concise and factual summary for the given passage.\nRules\n1. Summarize using only the information in the given passage. Do not infer. Do not use your internal knowledge.\n2. Do not provide a preamble or explanation, output only the summary.\n3. Summaries should never exceed 20 percent of the original text's length.\n4. Maintain the tone of the passage.\nIf you are unable to summarize the text due to missing, unreadable, irrelevant or insufficient content, respond only with:\n\"I am unable to summarize this text.\"\nHere is the passage:\n{document}\n\"\"\"\n\n# 调用你的 LLM (温度建议设置为 0)\n# response = call_llm(prompt, temperature=0)\n```\n\n### 方式三：解读榜单指标\n在查看 `README` 中的表格或使用数据时，请关注以下核心指标：\n*   **Hallucination Rate (幻觉率)**: 越低越好。表示模型在总结中捏造事实的比例。\n*   **Factual Consistency Rate (事实一致性率)**: 越高越好。计算公式通常为 `100% - 幻觉率`。\n*   **Answer Rate (回答率)**: 模型成功生成总结而非拒绝回答的比例。过低可能意味着模型过于敏感或触发安全过滤。\n*   **Average Summary Length**: 平均总结长度（单词数），用于评估模型的简洁性。\n\n> **注意**: 该榜单数据基于温度 (Temperature) 为 0 的测试结果，旨在最大化事实准确性。在实际应用中，若需创造性任务，请适当调整温度参数，但需注意幻觉率可能会上升。","某金融合规团队正在构建自动化的财报摘要系统，需要确保大模型生成的结论严格基于原始文档，绝不能出现虚构数据或事实错误。\n\n### 没有 hallucination-leaderboard 时\n- **选型盲目依赖名气**：团队仅凭模型参数量或通用排行榜选择模型，误以为“越大越强”，结果选用了幻觉率高达 9% 的模型，导致摘要中频繁出现虚构的营收数字。\n- **缺乏量化评估标准**：内部测试全靠人工抽检，效率低下且主观性强，无法准确量化不同模型在“事实一致性”上的具体差距，难以向管理层汇报风险。\n- **试错成本高昂**：为了找到低幻觉模型，工程师不得不花费数周时间逐一部署和微调多个候选模型，严重拖慢了项目上线进度。\n- **潜在合规风险不可控**：由于无法预知模型在特定总结任务中的幻觉概率，系统上线后可能因生成误导性信息而引发监管处罚。\n\n### 使用 hallucination-leaderboard 后\n- **精准锁定低幻觉模型**：团队直接查阅榜单，迅速发现 `antgroup\u002Ffinix_s1_32b` 的幻觉率仅为 1.8%，远低于其他热门模型，立即将其定为首选基座。\n- **数据驱动决策透明**：利用榜单提供的“事实一致性比率”和“平均摘要长度”等多维数据，团队用客观指标证明了选型合理性，轻松通过内部风控审核。\n- **大幅缩短研发周期**：无需再进行大规模的盲测，直接基于榜单前几名模型进行小规模验证，将模型筛选时间从数周压缩至两天。\n- **显著降低业务风险**：在开发初期就规避了高幻觉模型，从源头上保证了财报摘要的准确性，避免了后续因数据造假带来的法律纠纷。\n\nhallucination-leaderboard 通过将抽象的“可信度”转化为可视化的量化指标，帮助企业在关键业务场景中快速甄选出最诚实可靠的 AI 模型。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fvectara_hallucination-leaderboard_a6baf24d.png","vectara","Vectara","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fvectara_7cf2f32f.png","The trusted GenAI Platform",null,"https:\u002F\u002Fvectara.com","https:\u002F\u002Fgithub.com\u002Fvectara",[81],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,3191,"2026-04-15T09:20:44","Apache-2.0",1,"","未说明",{"notes":92,"python":90,"dependencies":93},"该仓库主要是一个排行榜展示项目，其核心评估模型 HHEM-2.3 为 Vectara 的商业闭源模型，无法直接在本地运行。README 中提到了开源变体 HHEM-2.1-Open 托管在 Hugging Face 和 Kaggle 上，但未提供具体的本地部署环境需求、依赖库或安装指令。榜单数据是通过调用各 LLM 的 API（温度设置为 0）生成的，而非本地推理。",[],[14,35],[96,97,98],"generative-ai","hallucinations","llm","2026-03-27T02:49:30.150509","2026-04-16T02:08:15.302301",[],[]]