[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-bird-bench--BIRD-Interact":3,"tool-bird-bench--BIRD-Interact":65},[4,17,27,35,48,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156033,2,"2026-04-14T23:32:00",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[15,26,14,13],"图像",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":10,"last_commit_at":33,"category_tags":34,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[26,43,44,45,14,46,15,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":54,"last_commit_at":55,"category_tags":56,"status":16},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[15,43,46],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":54,"last_commit_at":63,"category_tags":64,"status":16},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[46,26,43,13],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":23,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":111,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":112,"updated_at":113,"faqs":114,"releases":129},7653,"bird-bench\u002FBIRD-Interact","BIRD-Interact","[ICLR 2026 Oral] BIRD-INTERACT: Re-imagines Text-to-SQL evaluation via lens of dynamic interactions.","BIRD-Interact 是一个专为评估“文本转 SQL\"（Text-to-SQL）能力而设计的开源基准测试框架。它由香港大学与谷歌云联合推出，并荣获 ICLR 2026 口头报告奖。传统评估往往只关注模型能否一次性生成正确的 SQL 语句，忽略了真实场景中人类专家会通过多轮对话逐步澄清需求、修正错误的过程。BIRD-Interact 创新性地引入“动态交互”视角，将评估重点从单次输出转向多轮交互过程，从而更真实地反映模型在复杂数据查询任务中的实际表现。\n\n该工具特别适合从事自然语言处理、数据库交互或大模型应用的研究人员与开发者使用。如果你正在训练或优化一个能理解自然语言并生成数据库查询的 AI 系统，BIRD-Interact 能提供更细腻、更具现实意义的性能反馈。其技术亮点在于构建了支持多轮追问、上下文记忆和错误修正的交互式评估流程，并配套提供了轻量级数据集（bird-interact-lite），便于快速集成与实验。目前项目已开放 leaderboard 和 HuggingFace 数据接口，支持 Python 3.10+ 环境，兼容主流大模型 API。通过这一框架，社区可以更","BIRD-Interact 是一个专为评估“文本转 SQL\"（Text-to-SQL）能力而设计的开源基准测试框架。它由香港大学与谷歌云联合推出，并荣获 ICLR 2026 口头报告奖。传统评估往往只关注模型能否一次性生成正确的 SQL 语句，忽略了真实场景中人类专家会通过多轮对话逐步澄清需求、修正错误的过程。BIRD-Interact 创新性地引入“动态交互”视角，将评估重点从单次输出转向多轮交互过程，从而更真实地反映模型在复杂数据查询任务中的实际表现。\n\n该工具特别适合从事自然语言处理、数据库交互或大模型应用的研究人员与开发者使用。如果你正在训练或优化一个能理解自然语言并生成数据库查询的 AI 系统，BIRD-Interact 能提供更细腻、更具现实意义的性能反馈。其技术亮点在于构建了支持多轮追问、上下文记忆和错误修正的交互式评估流程，并配套提供了轻量级数据集（bird-interact-lite），便于快速集成与实验。目前项目已开放 leaderboard 和 HuggingFace 数据接口，支持 Python 3.10+ 环境，兼容主流大模型 API。通过这一框架，社区可以更系统地推动 Text-to-SQL 技术向实用化、人性化方向演进。","\n\u003Cdiv align=\"right\">\n  \u003Cdetails>\n    \u003Csummary >🌐 Language\u003C\u002Fsummary>\n    \u003Cdiv>\n      \u003Cdiv align=\"right\">\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=en\">English\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=zh-CN\">简体中文\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=zh-TW\">繁體中文\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ja\">日本語\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ko\">한국어\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=hi\">हिन्दी\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=th\">ไทย\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=fr\">Français\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=de\">Deutsch\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=es\">Español\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=it\">Itapano\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ru\">Русский\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=pt\">Português\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=nl\">Nederlands\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=pl\">Polski\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ar\">العربية\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=fa\">فارسی\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=tr\">Türkçe\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=vi\">Tiếng Việt\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=id\">Bahasa Indonesia\u003C\u002Fa>\u003C\u002Fp>\n      \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n  \u003C\u002Fdetails>\n\n\u003C\u002Fdiv>\n\n# BIRD-INTERACT 1.0 \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_98f2e5b39720.jpg\" alt=\"HKU Logo\" width=\"50\" style=\"vertical-align:middle;margin-left:10px;\"> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_613ff9cf74ef.png\" alt=\"Google Cloud Logo\" width=\"50\" style=\"vertical-align:middle;margin-left:10px;\">\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_f6d7eb75d21d.png\" \n       style=\"width: 30%; min-width: 100px; display: block; margin: auto; border-radius: 15px !important;\">\n\u003C\u002Fp>\n\n\n\u003Cdiv style=\"display: flex; justify-content: center; align-items: center; gap: 10px;\">\n  \u003Ca href=\"https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002Fdeed.en\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-CC%20By%20SA%204.0-orange.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fbird-interact.github.io\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLeaderboard-2025-28a745.svg\" alt=\"Leaderboard\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-lite\u002Ftree\u002Fmain\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset-HuggingFace-FFD21E.svg\" alt=\"HuggingFace\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-310\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-teal.svg\" alt=\"Python\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fopenai\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpenAI-1.40+-beige.svg\" alt=\"OpenAI\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## ⚠️ Announcement  \nPlease note that before your evaluation process, when Docker loads the databases, errors may occasionally occur due to environment inconsistency (these will not terminate the process but will appear in the Docker logs). As a result, some databases may fail to load properly, leading to empty databases. This will cause the evaluation results to be abnormally low.  \n👉 Therefore, we strongly recommend checking the Docker logs for any errors **before running the evaluation** and verifying that all databases have been successfully loaded.\n\n👉 We have updated the **Submission Guidelines**, where the customized agent scaffolds are supported. Please feel free to take a look at our detailed submission guidelines [here](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1F1DSqHDBzGvXFlWU8iCl9otkqxIefgcH\u002Fedit?usp=sharing&ouid=108161566779099489782&rtpof=true&sd=true).\n\n## 📰 News\n\n- [2026-03-29] 🔥🔥🔥 **BIRD-Interact-ADK**: We release **[BIRD-Interact-ADK](.\u002FBIRD-Interact-ADK\u002F)**, a Google ADK-based implementation with modular 3-microservices (agent, user simulator, and DB Env) architecture. Easily swap in your own agent, user simulator, or DB environment. Supports parallel execution and any [LiteLlm-compatible](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders) LLM provider. Recommend to use this implementation for your research.\n\n- [2026-02-08] 🔥🔥🔥 Our **[Bird-Interact paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05318)** has beed accepted at **ICLR 2026 (Oral)**! See you in Rio 🇧🇷!  \n\n- [2025-11-06] 🐛 **Bug Fix** & 🐳 **Docker update**: Update the sqlglot version to 26.16.4 to fix the bug that the sql parser cannot parse the SQL correctly for user simulator. You could fix this by re-install it by `pip install sqlglot==26.16.4` in the `bird_interact_eval` env. The `bird_interact_eval` image is also updated, so you could also pull it and recreate the `bird_interact_eval` container.\n\n- [2025-10-21] 🐳 **Docker update**: We added the docker for Full DB Env. And we pushed 3 docker images (Base\u002FFull DB Env and the evaluation environment for both `a-Interact` and `c-Interact`) to Docker Hub to facilitate the environment setup. No need to download the DB dumps and build the images manually!\n\n- [2025-10-08] 📝 Our **[Bird-Interact paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05318)** is now publicly available!  \n  It presents the full details, methodology, and evaluation of our interactive text-to-SQL benchmark.  \n  👉 Check it out and know more about the ideas behind [BIRD-Interact](https:\u002F\u002Fbird-interact.github.io\u002F).\n\n- [2025-08-26] 🚀 We're excited to announce the release of the **[BIRD-Interact-Full (600)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-full)** set!  \nIt's a tough one — the best LLMs are only achieving a **16.33%** success rate, with just **10.0%** on the `c-interact` and `a-interact` portions.  \n👉 For more details, please visit our [project website](https:\u002F\u002Fbird-interact.github.io\u002F).\n\n- [2025-08-26] 📬 We'll be sending the **Ground Truth & Test cases** to our mailing list this week.  \nIf you want early access, please send an email as instructed on the site for an **automatic download**.  \n\n- [2025-08-26] 💾 On another note, we've also released a SQLite version of **[LiveSQLBench-Lite](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Flivesqlbench-base-lite-sqlite)** for easier local research.  \nThe full **LiveSQLBench-Base** and **-Large** versions are coming soon!\n\n- [2025-08-22] **Bug Fix**: In Bird-Interact-Agent code, we fixed a bug that when evaluating phase-2 SQL, the stored phase-1 SQL cannot be executed successfully, leading to a lower success rate of Phase-2. This bug only affects those tasks where phase1 sql does some operations on the database, e.g. CREATE table, etc.\n\n## 🧸 Overview\n\nBIRD-INTERACT, an interactive text-to-SQL benchmark, **re-imagines Text-to-SQL evaluation via lens of dynamic interactions**.\nThe environment blends a hierarchical knowledge base, database documentation and a function-driven user simulator to recreate authentic enterprise environments across full **CRUD** operations.\nIt offers two rigorous test modes: (1) passive **Conversational Interaction** and (2) active **Agentic Interaction**, spanning 600 annotated tasks including Business Intelligence (BI), CRUD operations and etc., each guarded by executable test cases.\nTypical evaluations trigger 1,968-5,496 interaction turns between model and user simulator, while state-of-the-art reasoning models currently solve only **≈24%** and **≈18%** of tasks, underscoring the benchmark's challenge.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_74bf5baa622a.png\" \n       style=\"width: 100%; min-width: 100px; display: block; margin: auto; \">\n\u003C\u002Fp>\n\n### ✅ Two Evaluation Modes\n\nBIRD-INTERACT supports two evaluation modes as mentioned above:\n\n   - **c-Interact**: Conversational Interaction which is a passive mode and the workflow is fixed. The code and detailed information can be found in `bird_interact_conv`.\n   - **a-Interact**: Agentic Interaction which is an embodied active mode where the workflow is dynamic and led by models. The code and detailed information can be found in `bird_interact_agent`.\n\n\n### 🐣 Lite Version\n\nWe are releasing a lite version of BIRD-INTERACT, `bird-interact-lite-exp`, which includes 270 high-quality real-world tasks specifically for PostgreSQL. This is a good starting point for quick experimentation. \n\n### 🦜 Full Version\n\nThe full version of BIRD-INTERACT, `bird-interact-full`, is a comprehensive benchmark that includes 600 tasks for PostgreSQL. It covers a wide range of SQL operations and user queries. The full version is coming soon.\n\n### Model Performance Results on BIRD-INTERACT-FULL\n\n#### 1. **c-Interact Text-to-SQL** Performance\n| Rank | Model Name         | Normalized Reward | Avg Cost (USD)\u002FTask | Level              |\n|:----:|:-------------------|:-----------------:|:-------------------:|:------------------:|\n| 1    | Gemini-2.5-Pro     | 20.92             | $0.04               | 🏆 Excellent Chat  |\n| 2    | O3-Mini            | 20.27             | $0.07               | 🏆 Excellent Chat  |\n| 3    | Claude-Sonnet-4    | 18.35             | $0.29               | 💎 Good Chat       |\n| 4    | Qwen-3-Coder-480B  | 17.75             | $0.11               | 💎 Good Chat       |\n| 5    | Deepseek-Chat-V3.1 | 15.15             | $0.12               | ✨ Standard        |\n| 6    | Claude-Sonnet-3.7  | 13.87             | $0.29               | ✨ Standard        |\n| 7    | GPT-5              | 12.58             | $0.08               | ⚪ Basic           |\n\n#### 2. **a-Interact Text-to-SQL** Performance\n| Rank | Model Name         | Normalized Reward | Avg Cost (USD)\u002FTask | Level                    |\n|:----:|:-------------------|:-----------------:|:-------------------:|:------------------------:|\n| 1    | GPT-5              | 25.52             | $0.24               | 🏆 Excellent Interaction |\n| 2    | Claude-Sonnet-4    | 23.28             | $0.51               | 🏆 Excellent Interaction |\n| 3    | Claude-Sonnet-3.7  | 17.45             | $0.60               | 💎 Good Interaction      |\n| 4    | Gemini-2.5-Pro     | 17.33             | $0.22               | 💎 Good Interaction      |\n| 5    | O3-Mini            | 16.43             | $0.06               | ✨ Standard              |\n| 6    | Deepseek-Chat-V3.1 | 13.47             | $0.06               | ✨ Standard              |\n| 7    | Qwen-3-Coder-480B  | 10.58             | $0.07               | ⚪ Basic                 |\n\n> \\* Budget Parameters: Starting Budget\u002FUser Patience Budget, measured by our virtual currency *bird-coin*s \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_73dfce7a5e2a.png\" style=\"height: 1em; vertical-align: middle;\">. Refer to [bird_interact_agent\u002FREADME.md](bird_interact_agent\u002FREADME.md#task-setting) for more details.\n\n### Interaction-Time Scaling (ITS)\n\nInteraction-Time Scaling (ITS) refers to a model's ability to continuously increase its end performance through multi-turn interactions. When this interactive performance surpasses the model's idealized single-turn performance on a fully specified, unambiguous task, we say it satisfies the **ITS law**. As user patience grows and interaction turns accumulate, performance keeps improving, demonstrating that the model can sustain effective communication over extended dialogue. Currently, we only find claude-3-7-sonnet satisfies the ITS law.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_0b259cfca480.png\" \n       style=\"width: 100%; min-width: 100px; display: block; margin: auto; \">\n\u003C\u002Fp>\n\n## Environment Setup\n\n1. Run Docker containers for bird-interact-lite database, bird-interact-full database, and evaluation environment:\n  \n  > If you just want to evaluate on `bird-interact-lite`, you could comment out the [`postgresql_full` service](.\u002Fenv\u002Fdocker-compose.yml#L21-L31) in `docker-compose.yml` to speed up the environment setup.\n  \n  Start the environment by running: \n   ```bash\n   cd env\n   docker compose pull \n   docker compose up -d\n   ```\n   Wait for several minutes for database initialization. \n   \n  You could track the building progress by:\n  ```bash\n  docker compose logs -f --tail=100 bird_interact_postgresql_full # or bird_interact_postgresql for bird-interact-lite\n  ```\n  If finished, you should see the logs without errors like:\n\n  ```bash\n  bird_interact_postgresql_full  | 2025-10-28 17:58:30.413 HKT [1] LOG:  database system is ready to accept connection\n  ```\n\n  If you have created containers before and want to recreate it, you could run the following command:\n  ```bash\n  docker compose down -v # this cmd removes the containers and the volumes\n  docker compose pull   # pull the latest images from Docker Hub\n  docker compose up -d --force-recreate # build and start the containers again. --force-recreate means force the recreation of the containers. \n  # Or `docker compose up -d --force-recreate bird_interact_eval` to only recreate the bird_interact_eval container about evalution code environment.\n  ```\n   \n   This runs 3 containers using prebuilt images from Docker Hub:\n   - `bird_interact_postgresql`: PostgreSQL database for bird-interact-lite\n   - `bird_interact_postgresql_full`: PostgreSQL database for bird-interact-full\n   - `bird_interact_eval`: Evaluation environment for both `a-Interact` and `c-Interact`.\n\n   Now, you could start the evaluation environment by executing the following command:\n   ```bash\n   docker compose exec bird_interact_eval bash\n   ```\n\n2. (Optional) Build the environment manually (if you want to build the images from scratch): \n   - Downdload the database dumps \n      - [bird-interact-lite](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QIGQlRKbkqApAOrQXPqFJgUg8rQ7HRRZ\u002Fview). Unzip and rename it as `env\u002Fpostgre_table_dumps`.\n      - [bird-interact-full](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1V9SFIWebi27JtaDUAScG1xE9ELbYcWLR\u002Fview). Unzip and rename it as `env\u002Fpostgre_table_dumps_full`.\n   - Build the environment manually by running `docker-compose.build.yml`.\n      ```bash\n      cd env\u002F\n      docker compose -f docker-compose.build.yml build\n      docker compose -f docker-compose.build.yml up -d\n      ```\n\n3. (Recommended) Check the database containers are built and running successfully.\n\n-  Print the container build logs to ensure that the databases are built successfully without errors:\n   ```bash \n   docker logs bird_interact_postgresql > build_bird_interact_postgresql.log 2>&1\n   docker logs bird_interact_postgresql_full > build_bird_interact_postgresql_full.log 2>&1\n   ```\n   If errors occur, `\"Errors occurred during import:\"` will be printed in the log files.\n\n\n-  Check if the database containers are in good shape.\n   \n   Use our provided Python script to verify database metadata:\n   ```bash\n   docker compose exec bird_interact_eval bash\n   cd \u002Fapp\u002Fenv\n   python check_db_metadata.py --host bird_interact_postgresql\n   python check_db_metadata.py --host bird_interact_postgresql_full\n   ```\n   \n   Expected results:\n   - **bird-interact-lite**: \n     - 📈 Total Databases: 18\n     - 📋 Total Tables: 175\n     - 🔢 Total Columns: 2286\n     - 📈 Avg Rows per Table: 1,038.48\n     - 💾 Total Size: 207.15 MB (around)\n   - **bird-interact-full**: \n     - 📈 Total Databases: 22\n     - 📋 Total Tables: 244\n     - 🔢 Total Columns: 2011\n     - 📈 Avg Rows per Table: 1,121.19\n     - 💾 Total Size: 272.00 MB (around)\n\n\n## 📦 Dataset Details\n\n### Dataset Description\n\n- **Database:** The complete PostgreSQL database can be download from [bird-interact-lite](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QIGQlRKbkqApAOrQXPqFJgUg8rQ7HRRZ\u002Fview) and [bird-interact-full](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1V9SFIWebi27JtaDUAScG1xE9ELbYcWLR\u002Fview).\n- **data:** Each data instance contain the following main parts:\n   - `selected_database`: The name of the database.  \n   - `query`: The unambiguous user query.  \n   - `amb_user_query`: The user query with injected ambiguities.\n   - `user_query_ambiguity`: The ambiguities injected into the user query.\n   - `non_critical_ambiguity`: The non-critical ambiguities like order, limit, etc.\n   - `knowledge_ambiguity`: The ambiguities created by masked external knowledges. \n   - `sol_sql`: The ground truth SQL solution.  \n   - `preprocess_sql`: SQL queries to run before executing the solution or prediction.  \n   - `clean_up_sql`: SQL queries to run after the test cases to revert any changes made to the database.  \n   - `test_cases`: A set of test cases to validate the predicted corrected SQL.\n   - `follow_up`: The labeled follow up questions.\n   - `external_knowledge`: The external knowledge related to the specific task.\n\n- **evaluation:** The evaluation code is available in the [`.\u002Fevaluation`](.\u002Fevaluation) directory.\n- **Curated by:** BIRD Team & Google Cloud\n- **License:** [cc-by-sa-4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002F)\n- **HuggingFace Dataset Card:** [bird-interact-lite](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-lite)\n  and [bird-interact-full](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-full) for PostgreSQL; and [mini-interact](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fmini-interact) for SQLite.\n### Dataset Uses\n\nTo avoid data leakage by auto-crawling, we do not include GT solution sqls and test cases along with data.\nplease email [bird.bench25@gmail.com](mailto:bird.bench25@gmail.com) with the tag `[bird-interact-lite GT&Test Cases]` or `[bird-interact-full GT&Test Cases]` in the title for ground truth and test cases for the bird-interact-lite or bird-interact-full dataset, which will be sent automatically.\n\n\n### Combine the public data with the ground truth and test cases\n\nThen use the following script to combine the public data with the ground truth and test cases:\n\nTake the full version as an example:\n(1) Run:\n```bash\npython combine_public_with_gt.py \u002Fpath\u002Fto\u002Fbird-interact-full\u002Fbird_interact_data.jsonl \u002Fpath\u002Fto\u002Fbird_interact_full_gt_kg_testcases_08022.jsonl \u002Fpath\u002Fto\u002Fbird_interact_data.jsonl  # bird_interact_full_gt_kg_testcases_08022.jsonl is the data of ground-truth fields, which is obtained by emailing us.\n```\nThis will create a new file at `\u002Fpath\u002Fto\u002Fbird_interact_data.jsonl` with the combined data. \n\n(2) Then replace the original public data with the combined data:\n\n```bash\ncp \u002Fpath\u002Fto\u002Fbird_interact_data.jsonl \u002Fpath\u002Fto\u002Fbird-interact-full\u002Fbird_interact_data.jsonl\n```\n\nSame for the other versions: bird-interact-lite, mini version, etc. Just set correct paths for the public data and the ground truth and test cases, and then replace the public data with the combined data.\n\n\n\n\n\u003C!-- ### Use the Dataset from HuggingFace\n\nYou can download the dataset from HuggingFace using the following command:\n```bash\nfrom datasets import load_dataset\n# Load the flash version of the dataset\ndataset = load_dataset(\"birdsql\u002Fbird-interact-lite\")\nprint(dataset[\"lite\"][0])\n\n# Load the full version of the dataset (coming soon)\ndataset = load_dataset(\"birdsql\u002Fbird-interact-full\")\nprint(dataset[\"full\"][0])\n```\n\nOr you can use the provided script to download the full version of the dataset and split it into different dialects.\n```bash\ncd baseline\u002Fdata\npython pull_data.py \\\n  --schema_path path\u002Fto\u002Ffull_schema.jsonl \\\n  --input_path path\u002Fto\u002Finput.jsonl \\ # Path to the input JSONL file (may be empty if you want to download the dataset from HuggingFace)\n  --output_folder path\u002Fto\u002Foutput_dir # output folder of the split files\n``` -->\n\n## Folder Structure\n```ultree\n.\n├── LICENSE\n├── README.md\n├── BIRD-Interact-ADK\n│   ├── ...\n│   └── README.md\n├── bird_interact_conv\n│   ├── ...\n│   └── README.md\n├── bird_interact_agent\n│   ├── ...\n│   └── README.md\n├── evaluation\n│   ├── docker-compose.yml\n│   ├── env\n│   ├── postgre_table_dumps\n│   ├── run\n│   └── src\n├── materials\n│   ├── ...\n└── requirements.txt\n```\nThe details about running **a-interact** can be found in `.\u002Fbird_interact_agent\u002FREADME.md`; **c-interact** can be found in `.\u002Fbird_interact_conv\u002FREADME.md`; and the **ADK-based implementation** can be found in `.\u002FBIRD-Interact-ADK\u002FREADME.md`.\n\n## 📋 Todo Lists\n\n- [x] Release lite version, bird-interact-lite (270).\n- [x] Release conversational version, bird-interact-conv.\n- [x] Release agent version, bird-interact-agent.\n- [x] Release Full bird-interact-full (600).\n- [x] Release ADK-based implementation, BIRD-Interact-ADK.\n- [ ] SFT \u002F RL an User Simulator\n\n## Acknowledgement\nWe would like to express our sincere gratitude to **Irina Saparina**, **Mohammadreza Pourreza**, **Mehdi Bouzouina**, **Hailong Li**, **Jiatong Shi**, and Professor **Shinji Watanabe** for their fruitful discussions and valuable insights that helped improve this project.\n\n## Created By:\nBIRD Team & Google Cloud\n\n\n\n\n\n\n\n## Citation\n\n```bibtex\n@inproceedings{\nhuo2026birdinteract,\ntitle={{BIRD}-{INTERACT}: Re-imagining Text-to-{SQL} Evaluation via Lens of Dynamic Interactions},\nauthor={Nan Huo and Xiaohan Xu and Jinyang Li and Per Jacobsson and Shipei Lin and Bowen Qin and Binyuan Hui and Xiaolong Li and Ge Qu and Shuzheng Si and Linheng Han and Edward Alexander and Xintong Zhu and Rui Qin and Ruihan Yu and Yiyao Jin and Feige Zhou and Weihao Zhong and Yun Chen and Hongyu Liu and Chenhao Ma and Fatma Ozcan and Yannis Papakonstantinou and Reynold Cheng},\nbooktitle={The Fourteenth International Conference on Learning Representations},\nyear={2026},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=nHrYBGujps}\n}\n```\n\n\n## Change Log\n\n- [2025-11-06] 🐛 **Bug Fix** & 🐳 **Docker update**: Update the sqlglot version to 26.16.4 to fix the bug that the sql parser cannot parse the SQL correctly for user simulator. You could fix this by re-install it by `pip install sqlglot==26.16.4` in the `bird_interact_eval` env. The `bird_interact_eval` image is also updated, so you could also pull it and recreate the `bird_interact_eval` container.\n- [2025-10-21] 🐳 **Docker update**: Add the docker for Full DB Env. And we pushed 3 docker images (Base\u002FFull DB Env and the evaluation environment for both `a-Interact` and `c-Interact`) to Docker Hub to facilitate the environment setup. No need to download the DB dumps and build the images manually! Please pull the latest images from Docker Hub and recreates the containers, e.g. using `docker compose down -v && docker compose pull && docker compose up -d --force-recreate`.\n- [2025-08-22]  🐛 **Bug Fix**: Fix the bug that when evaluating phase-2 SQL, the stored phase-1 SQL cannot be executed successfully, leading to a lower success rate of Phase-2. This bug only affects those tasks where phase1 sql does some operations on the database, e.g. CREATE table, etc.\n","\u003Cdiv align=\"right\">\n  \u003Cdetails>\n    \u003Csummary >🌐 语言\u003C\u002Fsummary>\n    \u003Cdiv>\n      \u003Cdiv align=\"right\">\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=en\">英语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=zh-CN\">简体中文\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=zh-TW\">繁體中文\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ja\">日语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ko\">韩语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=hi\">印地语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=th\">泰语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=fr\">法语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=de\">德语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=es\">西班牙语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=it\">意大利语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ru\">俄语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=pt\">葡萄牙语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=nl\">荷兰语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=pl\">波兰语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=ar\">阿拉伯语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=fa\">波斯语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=tr\">土耳其语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=vi\">越南语\u003C\u002Fa>\u003C\u002Fp>\n        \u003Cp>\u003Ca href=\"https:\u002F\u002Fopenaitx.github.io\u002Fview.html?user=bird-bench&project=BIRD-Interact&lang=id\">印尼语\u003C\u002Fa>\u003C\u002Fp>\n      \u003C\u002Fdiv>\n    \u003C\u002Fdiv>\n  \u003C\u002Fdetails>\n\n\u003C\u002Fdiv>\n\n# BIRD-INTERACT 1.0 \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_98f2e5b39720.jpg\" alt=\"HKU Logo\" width=\"50\" style=\"vertical-align:middle;margin-left:10px;\"> \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_613ff9cf74ef.png\" alt=\"Google Cloud Logo\" width=\"50\" style=\"vertical-align:middle;margin-left:10px;\">\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_f6d7eb75d21d.png\" \n       style=\"width: 30%; min-width: 100px; display: block; margin: auto; border-radius: 15px !important;\">\n\u003C\u002Fp>\n\n\n\u003Cdiv style=\"display: flex; justify-content: center; align-items: center; gap: 10px;\">\n  \u003Ca href=\"https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002Fdeed.en\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-CC%20By%20SA%204.0-orange.svg\" alt=\"License\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fbird-interact.github.io\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLeaderboard-2025-28a745.svg\" alt=\"Leaderboard\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-lite\u002Ftree\u002Fmain\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDataset-HuggingFace-FFD21E.svg\" alt=\"HuggingFace\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.python.org\u002Fdownloads\u002Frelease\u002Fpython-310\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.10+-teal.svg\" alt=\"Python\">\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fopenai\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FOpenAI-1.40+-beige.svg\" alt=\"OpenAI\">\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n## ⚠️ 公告  \n请注意，在进行评估之前，当 Docker 加载数据库时，由于环境不一致，可能会偶尔出现错误（这些错误不会终止进程，但会显示在 Docker 日志中）。因此，部分数据库可能无法正确加载，导致数据库为空。这将使评估结果异常偏低。  \n👉 因此，我们强烈建议您在运行评估之前，先检查 Docker 日志中是否存在任何错误，并确认所有数据库是否已成功加载。\n\n👉 我们已更新了**提交指南**，其中支持自定义智能体框架。请随时查看我们的详细提交指南[此处](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1F1DSqHDBzGvXFlWU8iCl9otkqxIefgcH\u002Fedit?usp=sharing&ouid=108161566779099489782&rtpof=true&sd=true)。\n\n## 📰 新闻\n\n- [2026-03-29] 🔥🔥🔥 **BIRD-Interact-ADK**: 我们发布了基于Google ADK的实现——**[BIRD-Interact-ADK](.\u002FBIRD-Interact-ADK\u002F)**，采用模块化的三微服务架构（智能体、用户模拟器和数据库环境）。您可以轻松替换自己的智能体、用户模拟器或数据库环境。支持并行执行以及任何[与LiteLlm兼容](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders)的LLM提供商。建议您在研究中使用此实现。\n\n- [2026-02-08] 🔥🔥🔥 我们的**[Bird-Interact论文](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05318)**已被**ICLR 2026（口头报告）**接收！里约热内卢见哦 🇧🇷！\n\n- [2025-11-06] 🐛 **Bug修复** & 🐳 **Docker更新**: 将sqlglot版本更新至26.16.4，以修复用户模拟器中SQL解析器无法正确解析SQL的问题。您可以通过在`bird_interact_eval`环境中运行`pip install sqlglot==26.16.4`来解决此问题。同时，`bird_interact_eval`镜像也已更新，您可以拉取新镜像并重新创建`bird_interact_eval`容器。\n\n- [2025-10-21] 🐳 **Docker更新**: 我们新增了完整数据库环境的Docker镜像，并将3个Docker镜像（基础镜像、完整数据库环境镜像以及用于`a-Interact`和`c-Interact`的评估环境镜像）推送到Docker Hub，以简化环境搭建。无需再手动下载数据库转储文件并构建镜像！\n\n- [2025-10-08] 📝 我们的**[Bird-Interact论文](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2510.05318)**现已公开发布！  \n  论文详细介绍了我们的交互式文本到SQL基准测试的全部细节、方法论及评估结果。  \n  👉 欢迎查阅，了解更多关于[BIRD-Interact](https:\u002F\u002Fbird-interact.github.io\u002F)背后的理念。\n\n- [2025-08-26] 🚀 我们很高兴宣布推出**[BIRD-Interact-Full (600)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-full)**数据集！  \n  这是一份极具挑战性的数据集——目前最佳的LLM模型仅能达到**16.33%**的成功率，其中`c-Interact`和`a-Interact`部分的成功率更是低至**10.0%**。  \n  👉 更多详情请访问我们的[项目官网](https:\u002F\u002Fbird-interact.github.io\u002F)。\n\n- [2025-08-26] 📬 本周我们将向邮件列表发送**真值数据与测试用例**。  \n  若您希望提前获取，请按照网站上的说明发送邮件，即可获得**自动下载链接**。\n\n- [2025-08-26] 💾 另外，我们还发布了SQLite版本的**[LiveSQLBench-Lite](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Flivesqlbench-base-lite-sqlite)**，方便本地研究。  \n  完整版的**LiveSQLBench-Base**和**-Large**也将很快推出！\n\n- [2025-08-22] **Bug修复**: 在Bird-Interact-Agent代码中，我们修复了一个bug：在评估第二阶段SQL时，存储的第一阶段SQL无法成功执行，从而导致第二阶段的成功率降低。该bug仅影响那些第一阶段SQL会对数据库进行操作的任务，例如CREATE TABLE等。\n\n## 🧸 概述\n\nBIRD-INTERACT是一个交互式的文本到SQL基准测试，它**通过动态交互的视角重新定义了文本到SQL的评估方式**。\n该环境结合了分层知识库、数据库文档以及函数驱动的用户模拟器，以重现涵盖完整**CRUD**操作的真实企业级环境。\n它提供两种严格的测试模式：(1)被动的**对话式交互**和(2)主动的**代理式交互**，共包含600个标注任务，涵盖商业智能（BI）、CRUD操作等，每个任务都配有可执行的测试用例。\n典型的评估过程中，模型与用户模拟器之间会进行1,968至5,496轮交互，而当前最先进的推理模型仅能分别解决约**24%**和**18%**的任务，这充分体现了该基准测试的挑战性。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_74bf5baa622a.png\" \n       style=\"width: 100%; min-width: 100px; display: block; margin: auto; \">\n\u003C\u002Fp>\n\n### ✅ 两种评估模式\n\n如上所述，BIRD-INTERACT支持两种评估模式：\n\n   - **c-Interact**: 对话式交互，属于被动模式，工作流程固定。相关代码和详细信息可在`bird_interact_conv`中找到。\n   - **a-Interact**: 代理式交互，属于主动模式，工作流程由模型主导且动态变化。相关代码和详细信息可在`bird_interact_agent`中找到。\n\n\n### 🐣 精简版\n\n我们发布了BIRD-INTERACT的精简版`bird-interact-lite-exp`，其中包括270个高质量的真实世界任务，专门针对PostgreSQL。这是进行快速实验的良好起点。\n\n### 🦜 完整版\n\nBIRD-INTERACT的完整版`bird-interact-full`是一个全面的基准测试，包含600个针对PostgreSQL的任务，覆盖广泛的SQL操作和用户查询。完整版即将发布。\n\n### BIRD-INTERACT-FULL上的模型性能结果\n\n#### 1. **c-Interact文本到SQL**性能\n| 排名 | 模型名称         | 归一化奖励 | 每任务平均成本（美元） | 水平              |\n|:----:|:-------------------|:-----------------:|:-------------------:|:------------------:|\n| 1    | Gemini-2.5-Pro     | 20.92             | $0.04               | 🏆 卓越聊天        |\n| 2    | O3-Mini            | 20.27             | $0.07               | 🏆 卓越聊天        |\n| 3    | Claude-Sonnet-4    | 18.35             | $0.29               | 💎 优秀聊天        |\n| 4    | Qwen-3-Coder-480B  | 17.75             | $0.11               | 💎 优秀聊天        |\n| 5    | Deepseek-Chat-V3.1 | 15.15             | $0.12               | ✨ 标准            |\n| 6    | Claude-Sonnet-3.7  | 13.87             | $0.29               | ✨ 标准            |\n| 7    | GPT-5              | 12.58             | $0.08               | ⚪ 基础            |\n\n#### 2. **a-Interact文本到SQL**性能\n| 排名 | 模型名称         | 归一化奖励 | 每任务平均成本（美元） | 水平                    |\n|:----:|:-------------------|:-----------------:|:-------------------:|:------------------------:|\n| 1    | GPT-5              | 25.52             | $0.24               | 🏆 卓越互动            |\n| 2    | Claude-Sonnet-4    | 23.28             | $0.51               | 🏆 卓越互动            |\n| 3    | Claude-Sonnet-3.7  | 17.45             | $0.60               | 💎 良好互动            |\n| 4    | Gemini-2.5-Pro     | 17.33             | $0.22               | 💎 良好互动            |\n| 5    | O3-Mini            | 16.43             | $0.06               | ✨ 标准                  |\n| 6    | Deepseek-Chat-V3.1 | 13.47             | $0.06               | ✨ 标准                  |\n| 7    | Qwen-3-Coder-480B  | 10.58             | $0.07               | ⚪ 基础                 |\n\n> \\* 预算参数：初始预算\u002F用户耐心预算，以我们的虚拟货币*bird-coin*s衡量\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_73dfce7a5e2a.png\" style=\"height: 1em; vertical-align: middle;\">。更多详情请参阅[bird_interact_agent\u002FREADME.md](bird_interact_agent\u002FREADME.md#task-setting)。\n\n### 交互时间缩放定律（ITS）\n\n交互时间缩放定律（ITS）是指模型通过多轮交互能够持续提升其最终性能的能力。当这种交互性能在完全明确、无歧义的任务上超越了模型的理想化单轮性能时，我们就说该模型满足**ITS定律**。随着用户耐心的增加和交互轮次的累积，性能会不断提升，这表明模型能够在长时间的对话中保持有效的沟通。目前，我们只发现claude-3-7-sonnet满足ITS定律。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_readme_0b259cfca480.png\" \n       style=\"width: 100%; min-width: 100px; display: block; margin: auto; \">\n\u003C\u002Fp>\n\n## 环境搭建\n\n1. 运行用于bird-interact-lite数据库、bird-interact-full数据库以及评估环境的Docker容器：\n\n  > 如果你只想在`bird-interact-lite`上进行评估，可以注释掉`docker-compose.yml`中的[`postgresql_full`服务](.\u002Fenv\u002Fdocker-compose.yml#L21-L31)，以加快环境搭建速度。\n\n  通过以下命令启动环境：\n   ```bash\n   cd env\n   docker compose pull \n   docker compose up -d\n   ```\n   等待几分钟完成数据库初始化。\n   \n  你可以通过以下命令跟踪构建进度：\n  ```bash\n  docker compose logs -f --tail=100 bird_interact_postgresql_full # 或者 bird_interact_postgresql 用于 bird-interact-lite\n  ```\n  如果完成，你应该会看到没有错误的日志，例如：\n\n  ```bash\n  bird_interact_postgresql_full  | 2025-10-28 17:58:30.413 HKT [1] LOG:  database system is ready to accept connection\n  ```\n\n  如果你之前已经创建过容器并希望重新创建，可以运行以下命令：\n  ```bash\n  docker compose down -v # 此命令会移除容器及其数据卷\n  docker compose pull   # 从Docker Hub拉取最新镜像\n  docker compose up -d --force-recreate # 重新构建并启动容器。--force-recreate表示强制重新创建容器。\n  # 或者 `docker compose up -d --force-recreate bird_interact_eval` 只重新创建用于评估代码环境的bird_interact_eval容器。\n  ```\n   \n  这将使用Docker Hub上的预构建镜像运行3个容器：\n   - `bird_interact_postgresql`: 用于bird-interact-lite的PostgreSQL数据库\n   - `bird_interact_postgresql_full`: 用于bird-interact-full的PostgreSQL数据库\n   - `bird_interact_eval`: 用于`a-Interact`和`c-Interact`评估的环境。\n\n  现在，你可以通过执行以下命令来启动评估环境：\n  ```bash\n  docker compose exec bird_interact_eval bash\n  ```\n\n2. （可选）手动搭建环境（如果你想从头开始构建镜像）：\n   - 下载数据库转储文件\n      - [bird-interact-lite](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QIGQlRKbkqApAOrQXPqFJgUg8rQ7HRRZ\u002Fview)。解压后重命名为`env\u002Fpostgre_table_dumps`。\n      - [bird-interact-full](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1V9SFIWebi27JtaDUAScG1xE9ELbYcWLR\u002Fview)。解压后重命名为`env\u002Fpostgre_table_dumps_full`。\n   - 通过运行`docker-compose.build.yml`手动构建环境。\n      ```bash\n      cd env\u002F\n      docker compose -f docker-compose.build.yml build\n      docker compose -f docker-compose.build.yml up -d\n      ```\n\n3. （推荐）检查数据库容器是否已成功构建并运行。\n\n- 打印容器构建日志，以确保数据库成功构建且无错误：\n   ```bash \n   docker logs bird_interact_postgresql > build_bird_interact_postgresql.log 2>&1\n   docker logs bird_interact_postgresql_full > build_bird_interact_postgresql_full.log 2>&1\n   ```\n   如果出现错误，日志文件中会显示“导入过程中发生错误：”字样。\n\n\n- 检查数据库容器的状态是否良好。\n   \n   使用我们提供的Python脚本验证数据库元数据：\n   ```bash\n   docker compose exec bird_interact_eval bash\n   cd \u002Fapp\u002Fenv\n   python check_db_metadata.py --host bird_interact_postgresql\n   python check_db_metadata.py --host bird_interact_postgresql_full\n   ```\n   \n   预期结果：\n   - **bird-interact-lite**：\n     - 📈 数据库总数：18\n     - 📋 表格总数：175\n     - 🔢 列总数：2286\n     - 📈 每表平均行数：1,038.48\n     - 💾 总大小：207.15 MB（左右）\n   - **bird-interact-full**：\n     - 📈 数据库总数：22\n     - 📋 表格总数：244\n     - 🔢 列总数：2011\n     - 📈 每表平均行数：1,121.19\n     - 💾 总大小：272.00 MB（左右）\n\n\n## 📦 数据集详情\n\n### 数据集描述\n\n- **数据库**：完整的PostgreSQL数据库可以从[鸟互动轻量版](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1QIGQlRKbkqApAOrQXPqFJgUg8rQ7HRRZ\u002Fview)和[鸟互动完整版](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1V9SFIWebi27JtaDUAScG1xE9ELbYcWLR\u002Fview)下载。\n- **数据**：每个数据实例包含以下主要部分：\n   - `selected_database`：数据库名称。\n   - `query`：明确的用户查询。\n   - `amb_user_query`：注入歧义后的用户查询。\n   - `user_query_ambiguity`：注入到用户查询中的歧义。\n   - `non_critical_ambiguity`：非关键性歧义，如排序、限制等。\n   - `knowledge_ambiguity`：由外部知识掩盖而产生的歧义。\n   - `sol_sql`：真实答案SQL解决方案。\n   - `preprocess_sql`：在执行解决方案或预测之前需要运行的SQL查询。\n   - `clean_up_sql`：测试用例执行完毕后，用于恢复数据库状态的SQL查询。\n   - `test_cases`：一组用于验证预测修正后SQL的测试用例。\n   - `follow_up`：标注好的后续问题。\n   - `external_knowledge`：与特定任务相关的外部知识。\n\n- **评估**：评估代码位于[`.\u002Fevaluation`](.\u002Fevaluation)目录中。\n- **整理者**：BIRD团队 & Google Cloud\n- **许可证**：[cc-by-sa-4.0](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F4.0\u002F)\n- **HuggingFace 数据集卡片**：[bird-interact-lite](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-lite)\n  和 [bird-interact-full](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fbird-interact-full) 对应PostgreSQL；以及 [mini-interact](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fmini-interact) 对应SQLite。\n### 数据集用途\n\n为避免因自动爬取导致的数据泄露，我们未随数据一同提供GT解决方案SQL及测试用例。\n请发送邮件至[bird.bench25@gmail.com](mailto:bird.bench25@gmail.com)，标题注明`[bird-interact-lite GT&Test Cases]`或`[bird-interact-full GT&Test Cases]`，即可自动获取bird-interact-lite或bird-interact-full数据集的真实答案和测试用例。\n\n### 将公开数据与真实标签和测试用例结合\n\n然后使用以下脚本将公开数据与真实标签和测试用例结合：\n\n以完整版为例：\n(1) 运行：\n```bash\npython combine_public_with_gt.py \u002Fpath\u002Fto\u002Fbird-interact-full\u002Fbird_interact_data.jsonl \u002Fpath\u002Fto\u002Fbird_interact_full_gt_kg_testcases_08022.jsonl \u002Fpath\u002Fto\u002Fbird_interact_data.jsonl  # bird_interact_full_gt_kg_testcases_08022.jsonl 是真实标签字段的数据，可通过邮件向我们索取。\n```\n这将在 `\u002Fpath\u002Fto\u002Fbird_interact_data.jsonl` 创建一个包含合并后数据的新文件。\n\n(2) 然后用合并后的数据替换原始的公开数据：\n\n```bash\ncp \u002Fpath\u002Fto\u002Fbird_interact_data.jsonl \u002Fpath\u002Fto\u002Fbird-interact-full\u002Fbird_interact_data.jsonl\n```\n\n其他版本（如 bird-interact-lite、mini 版等）也相同。只需设置正确的公开数据路径以及真实标签和测试用例路径，然后用合并后的数据替换原始的公开数据即可。\n\n\n\n\n\u003C!-- ### 使用 HuggingFace 上的数据集\n\n你可以通过以下命令从 HuggingFace 下载数据集：\n```bash\nfrom datasets import load_dataset\n# 加载 flash 版本的数据集\ndataset = load_dataset(\"birdsql\u002Fbird-interact-lite\")\nprint(dataset[\"lite\"][0])\n\n# 加载完整版的数据集（即将推出）\ndataset = load_dataset(\"birdsql\u002Fbird-interact-full\")\nprint(dataset[\"full\"][0])\n```\n\n或者你也可以使用提供的脚本下载完整版数据集，并将其拆分为不同的方言。\n```bash\ncd baseline\u002Fdata\npython pull_data.py \\\n  --schema_path path\u002Fto\u002Ffull_schema.jsonl \\\n  --input_path path\u002Fto\u002Finput.jsonl \\ # 输入 JSONL 文件的路径（如果想从 HuggingFace 下载数据集，此路径可以为空）\n  --output_folder path\u002Fto\u002Foutput_dir # 拆分后文件的输出文件夹\n``` -->\n\n## 文件夹结构\n```ultree\n.\n├── LICENSE\n├── README.md\n├── BIRD-Interact-ADK\n│   ├── ...\n│   └── README.md\n├── bird_interact_conv\n│   ├── ...\n│   └── README.md\n├── bird_interact_agent\n│   ├── ...\n│   └── README.md\n├── evaluation\n│   ├── docker-compose.yml\n│   ├── env\n│   ├── postgre_table_dumps\n│   ├── run\n│   └── src\n├── materials\n│   ├── ...\n└── requirements.txt\n```\n关于运行 **a-interact** 的详细信息，请参阅 `.\u002Fbird_interact_agent\u002FREADME.md`；**c-interact** 的相关信息请查阅 `.\u002Fbird_interact_conv\u002FREADME.md`；而基于 ADK 的实现则可在 `.\u002FBIRD-Interact-ADK\u002FREADME.md` 中找到。\n\n## 📋 待办事项清单\n\n- [x] 发布轻量版，bird-interact-lite（270）。\n- [x] 发布对话版，bird-interact-conv。\n- [x] 发布代理版，bird-interact-agent。\n- [x] 发布完整版 bird-interact-full（600）。\n- [x] 发布基于 ADK 的实现，BIRD-Interact-ADK。\n- [ ] 对用户模拟器进行 SFT \u002F RL 训练。\n\n## 致谢\n我们衷心感谢 **Irina Saparina**、**Mohammadreza Pourreza**、**Mehdi Bouzouina**、**Hailong Li**、**Jiatong Shi** 以及 **Shinji Watanabe** 教授，感谢他们富有成效的讨论和宝贵见解，这些都极大地帮助改进了本项目。\n\n## 创作团队：\nBIRD 团队 & Google Cloud\n\n\n\n\n\n\n\n## 引用\n\n```bibtex\n@inproceedings{\nhuo2026birdinteract,\ntitle={{BIRD}-{INTERACT}: Re-imagining Text-to-{SQL} Evaluation via Lens of Dynamic Interactions},\nauthor={Nan Huo and Xiaohan Xu and Jinyang Li and Per Jacobsson and Shipei Lin and Bowen Qin and Binyuan Hui and Xiaolong Li and Ge Qu and Shuzheng Si and Linheng Han and Edward Alexander and Xintong Zhu and Rui Qin and Ruihan Yu and Yiyao Jin and Feige Zhou and Weihao Zhong and Yun Chen and Hongyu Liu and Chenhao Ma and Fatma Ozcan and Yannis Papakonstantinou and Reynold Cheng},\nbooktitle={The Fourteenth International Conference on Learning Representations},\nyear={2026},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=nHrYBGujps}\n}\n```\n\n\n## 变更日志\n\n- [2025-11-06] 🐛 **Bug 修复** & 🐳 **Docker 更新**：将 sqlglot 版本更新至 26.16.4，以修复用户模拟器中 SQL 解析器无法正确解析 SQL 的问题。你可以在 `bird_interact_eval` 环境中通过运行 `pip install sqlglot==26.16.4` 来解决此问题。同时，`bird_interact_eval` 镜像也已更新，因此你可以拉取最新镜像并重新创建 `bird_interact_eval` 容器。\n- [2025-10-21] 🐳 **Docker 更新**：新增完整数据库环境的 Docker 镜像。我们已将 3 个 Docker 镜像（基础镜像、完整数据库环境镜像以及用于 `a-Interact` 和 `c-Interact` 的评估环境镜像）推送到 Docker Hub，以方便环境搭建。无需再手动下载数据库转储文件并构建镜像！请从 Docker Hub 拉取最新镜像，并重新创建容器，例如使用 `docker compose down -v && docker compose pull && docker compose up -d --force-recreate`。\n- [2025-08-22] 🐛 **Bug 修复**：修复了在评估第二阶段 SQL 时，无法成功执行第一阶段 SQL 的问题，该问题会导致第二阶段的成功率降低。此 bug 仅影响那些第一阶段 SQL 会对数据库进行操作的任务，例如 CREATE table 等。","# BIRD-Interact 快速上手指南\n\nBIRD-Interact 是一个交互式 Text-to-SQL 基准测试工具，旨在通过动态交互（对话式与代理式）重新定义 SQL 生成模型的评估方式。本指南将帮助您快速搭建环境并运行基础评估。\n\n## 1. 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux 或 macOS (推荐)，Windows 需使用 WSL2。\n*   **Python 版本**: 3.10 或更高版本。\n*   **Docker**: 必须安装并运行 Docker Desktop 或 Docker Engine，用于加载数据库环境。\n*   **API Key**: 准备好您所使用的 LLM 提供商（如 OpenAI, Anthropic, Google Cloud 等）的 API Key。\n\n**前置依赖检查：**\n```bash\npython --version  # 确保 >= 3.10\ndocker --version  # 确保已安装\n```\n\n## 2. 安装步骤\n\n### 2.1 克隆项目\n首先从 GitHub 克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fbird-bench\u002FBIRD-Interact.git\ncd BIRD-Interact\n```\n\n### 2.2 创建虚拟环境\n推荐使用 `conda` 或 `venv` 创建独立的 Python 环境：\n```bash\npython -m venv bird_interact_env\nsource bird_interact_env\u002Fbin\u002Factivate  # Windows 用户请使用: bird_interact_env\\Scripts\\activate\n```\n\n### 2.3 安装核心依赖\n安装基础依赖包。如果遇到网络问题，可临时指定国内镜像源（如清华源）：\n```bash\n# 使用默认源\npip install -r requirements.txt\n\n# 或使用国内加速源 (推荐)\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2.4 关键组件修复与更新\n根据官方最新公告，需特别注意以下两点以确保评估准确性：\n\n1.  **修复 SQL 解析器 bug**：强制安装特定版本的 `sqlglot`。\n    ```bash\n    pip install sqlglot==26.16.4\n    ```\n2.  **拉取 Docker 镜像**：项目已提供预构建的 Docker 镜像（包含完整数据库环境和评估环境），无需手动下载数据库转储文件。\n    ```bash\n    # 拉取基础评估环境镜像\n    docker pull birdsql\u002Fbird_interact_eval:latest\n    \n    # 如果需要完整数据库环境 (Full DB Env)\n    docker pull birdsql\u002Fbird_interact_full_db:latest\n    ```\n\n## 3. 基本使用\n\nBIRD-Interact 提供两种主要的评估模式：**c-Interact** (被动对话式) 和 **a-Interact** (主动代理式)。以下以最常用的 **c-Interact** 为例演示运行流程。\n\n### 3.1 配置 API Key\n在终端中导出您的 API Key（以 OpenAI 为例）：\n```bash\nexport OPENAI_API_KEY=\"your-api-key-here\"\n```\n*注：若使用其他模型提供商，请参考对应目录下的 `.env` 配置说明。*\n\n### 3.2 运行评估脚本\n进入对话式交互目录并运行评估。以下命令将启动 Docker 容器加载数据库，并使用指定模型对 Lite 版本数据集进行评估：\n\n```bash\ncd bird_interact_conv\n\n# 运行评估示例 (请替换 model_name 和 dataset_path)\npython eval.py \\\n    --model_name gpt-4o \\\n    --dataset_path ..\u002Fdata\u002Fbird-interact-lite-exp \\\n    --output_dir .\u002Fresults\n```\n\n**参数说明：**\n*   `--model_name`: 您要测试的模型名称（需符合 LiteLLM 命名规范）。\n*   `--dataset_path`: 数据集路径，可使用 HuggingFace 下载的 `bird-interact-lite-exp` (270 任务) 或 `bird-interact-full` (600 任务)。\n*   `--output_dir`: 评估结果保存目录。\n\n### 3.3 查看结果\n运行结束后，检查输出目录中的 JSON 文件或日志，获取 `Normalized Reward` (标准化奖励) 和成功率指标。\n\n> **⚠️ 重要提示**：\n> 在正式运行大规模评估前，请务必先检查 Docker 容器的日志，确认所有数据库已成功加载且无报错。若数据库加载失败（显示为空），会导致评估结果异常偏低。\n>\n> 检查命令示例：\n> ```bash\n> docker logs \u003Ccontainer_id_or_name>\n> ```\n\n---\n*更多高级用法（如自定义 Agent 架构、并行执行等）请参考项目根目录下的 `BIRD-Interact-ADK` 模块及官方详细文档。*","某金融科技公司数据团队正在评估新一代 Text-to-SQL 模型，以构建能让业务人员通过自然语言直接查询复杂交易数据库的智能助手。\n\n### 没有 BIRD-Interact 时\n- **评估结果虚高**：传统静态评测仅对比最终 SQL 语句，模型即使靠“猜”对了答案但逻辑完全错误，仍被判为合格，掩盖了真实的推理缺陷。\n- **缺乏交互反馈**：无法模拟真实用户在面对模糊意图时的追问或澄清过程，导致模型在实际对话中一旦遇到歧义就立刻“胡编乱造”。\n- **调试黑盒化**：当模型生成错误查询时，开发人员只能看到最终错误的 SQL，无法定位是哪一步理解偏差或中间推理断裂导致了失败。\n- **场景覆盖单一**：测试集多为固定问答对，难以覆盖真实业务中需要多轮交互、动态修正的复杂查询场景。\n\n### 使用 BIRD-Interact 后\n- **动态精准验真**：引入动态交互视角，强制模型在生成 SQL 前进行必要的澄清或分步确认，确保执行逻辑与用户意图严格对齐，剔除侥幸得分。\n- **还原真实对话**：支持多轮交互评测，模拟用户补充条件或纠正误解的过程，验证模型在复杂沟通链条中的鲁棒性和适应能力。\n- **过程透明可溯**：完整记录从自然语言到最终 SQL 的交互推导路径，帮助开发者快速定位是语义解析错误还是逻辑规划失误，大幅缩短调优周期。\n- **覆盖长尾场景**：基于动态交互构建的测试用例，有效覆盖了需多次澄清的模糊查询，显著提升了模型在真实生产环境中的可用性。\n\nBIRD-Interact 通过将静态的“答题考试”升级为动态的“真人面试”，彻底解决了 Text-to-SQL 模型在真实复杂交互中“高分低能”的落地难题。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbird-bench_BIRD-Interact_f6d7eb75.png","bird-bench","bird_sql","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fbird-bench_54426b4a.jpg",null,"https:\u002F\u002Fbird-bench.github.io\u002F","https:\u002F\u002Fgithub.com\u002Fbird-bench",[84,88,92],{"name":85,"color":86,"percentage":87},"Python","#3572A5",69.2,{"name":89,"color":90,"percentage":91},"HTML","#e34c26",27.2,{"name":93,"color":94,"percentage":95},"Shell","#89e051",3.6,677,15,"2026-04-08T07:34:15","MIT","未说明 (基于 Docker 和 Python，通常支持 Linux\u002FmacOS\u002FWindows)","未说明 (主要依赖外部 LLM API，如 OpenAI、Google ADK 等，本地无大型模型训练\u002F推理需求)","未说明 (建议至少 8GB 以运行 Docker 容器和数据库环境)",{"notes":104,"python":105,"dependencies":106},"1. 核心功能依赖 Docker 运行数据库环境（提供 Base\u002FFull DB Env 镜像），使用前需安装 Docker 并拉取指定镜像。\n2. 评估前务必检查 Docker 日志，确认数据库加载成功，否则会导致评估结果异常偏低。\n3. 项目主要作为基准测试工具，通过 API 调用外部大模型（如 GPT, Claude, Gemini 等），而非本地部署模型。\n4. 推荐使用 'BIRD-Interact-ADK' 架构进行研究，支持模块化替换 Agent、用户模拟器和数据库环境。\n5. 需注意 sqlglot 版本必须为 26.16.4 以修复 SQL 解析 bug。","3.10+",[107,108,109,110],"openai>=1.40","sqlglot==26.16.4","docker","litellm (可选，用于 BIRD-Interact-ADK)",[15,46],"2026-03-27T02:49:30.150509","2026-04-15T11:26:11.228035",[115,120,125],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},34271,"BIRD-INTERACT mini-interact 数据集在哪里可以下载？","mini-interact 数据集已托管在 Hugging Face Hub 上，可以通过以下链接访问和加载：https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fbirdsql\u002Fmini-interact。该数据集也已链接到项目的 GitHub README 中，支持使用 `datasets.load_dataset()` 直接加载。","https:\u002F\u002Fgithub.com\u002Fbird-bench\u002FBIRD-Interact\u002Fissues\u002F5",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},34272,"数据集中缺少 `clarifications` 字段，用户模拟器报错怎么办？","这是一个已知问题，`clarifications` 字段应包含 `user_query_ambiguity` 和 `knowledge_ambiguity`。维护者已在最新提交（18ff0b0d6f5296e42e60922d84ef81569c757d95）中修复了此问题。请拉取最新代码以获取包含正确字段的数据集或生成逻辑。","https:\u002F\u002Fgithub.com\u002Fbird-bench\u002FBIRD-Interact\u002Fissues\u002F6",{"id":126,"question_zh":127,"answer_zh":128,"source_url":124},34273,"运行 BIRD-Interact Agent 时推荐使用哪种脚本？","官方推荐使用批处理运行脚本 `run_batch_experiments.sh`（命令：`bash run_batch_experiments.sh`），而不是单样本运行脚本 `run_experiment.sh`。因为后者运行速度较慢且代码可能已过时，维护者会尽力保持单样本脚本更新，但批处理脚本更稳定高效。",[]]