[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-dair-ai--ML-Papers-of-the-Week":3,"tool-dair-ai--ML-Papers-of-the-Week":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":80,"stars":84,"forks":85,"last_commit_at":86,"license":80,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":93,"github_topics":94,"view_count":100,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":101,"updated_at":102,"faqs":103,"releases":133},2513,"dair-ai\u002FML-Papers-of-the-Week","ML-Papers-of-the-Week","🔥Highlighting the top ML papers every week.","ML-Papers-of-the-Week 是由 DAIR.AI 团队维护的一个开源项目，旨在每周精选并汇总机器学习领域最具影响力的学术论文。面对人工智能技术日新月异的发展，海量新论文的不断涌现让从业者难以在短时间内捕捉到真正有价值的研究成果。ML-Papers-of-the-Week 正是为了解决这一“信息过载”痛点而生，它通过人工筛选与社区推荐相结合的方式，从每周发布的大量文献中提炼出核心亮点，帮助用户高效锁定关键进展，避免在低质量或重复性研究中浪费宝贵时间。\n\n这个项目非常适合机器学习研究人员、AI 工程师、数据科学家以及任何希望紧跟技术前沿的开发者使用。无论你是需要寻找最新算法灵感的研究者，还是希望将前沿理论转化为实际应用的工程人员，都能在这里快速获得经过验证的高质量阅读清单。此外，对于学术爱好者和学生而言，这也是一个系统了解学科动态、拓宽知识视野的绝佳窗口。\n\nML-Papers-of-the-Week 的独特之处在于其持续性和结构化呈现。它不仅提供简单的论文链接，还按周次对内容进行了清晰归档（如 README 中展示的从 2024 年底至 2025 年的详细列表），方便用户","ML-Papers-of-the-Week 是由 DAIR.AI 团队维护的一个开源项目，旨在每周精选并汇总机器学习领域最具影响力的学术论文。面对人工智能技术日新月异的发展，海量新论文的不断涌现让从业者难以在短时间内捕捉到真正有价值的研究成果。ML-Papers-of-the-Week 正是为了解决这一“信息过载”痛点而生，它通过人工筛选与社区推荐相结合的方式，从每周发布的大量文献中提炼出核心亮点，帮助用户高效锁定关键进展，避免在低质量或重复性研究中浪费宝贵时间。\n\n这个项目非常适合机器学习研究人员、AI 工程师、数据科学家以及任何希望紧跟技术前沿的开发者使用。无论你是需要寻找最新算法灵感的研究者，还是希望将前沿理论转化为实际应用的工程人员，都能在这里快速获得经过验证的高质量阅读清单。此外，对于学术爱好者和学生而言，这也是一个系统了解学科动态、拓宽知识视野的绝佳窗口。\n\nML-Papers-of-the-Week 的独特之处在于其持续性和结构化呈现。它不仅提供简单的论文链接，还按周次对内容进行了清晰归档（如 README 中展示的从 2024 年底至 2025 年的详细列表），方便用户回溯特定时间段的技术热点。同时，项目支持订阅通讯服务，用户可以直接在邮箱中接收每周精选列表，实现了从“主动搜寻”到“被动获取”的转变，极大地提升了知识获取的效率与体验。通过这种简洁而专注的方式，ML-Papers-of-the-Week 成为了连接学术界最新成果与工业界实践应用的重要桥梁。","# ML Papers of The Week\n\n[Subscribe to our newsletter](https:\u002F\u002Fnlpnews.substack.com\u002F) to get a weekly list of top ML papers in your inbox.\n\nAt DAIR.AI we ❤️ reading ML papers so we've created this repo to highlight the top ML papers of every week.\n\nHere is the weekly series:\n\n\n\nHere is the weekly series:\n\n## 2025\n- [Top ML Papers of the Week (June 23 - June 29)](.\u002F#top-ml-papers-of-the-week-june-23---june-29---2025)\n- [Top ML Papers of the Week (June 16 - June 22)](.\u002F#top-ml-papers-of-the-week-june-16---june-22---2025)\n- [Top ML Papers of the Week (June 9 - June 15)](.\u002F#top-ml-papers-of-the-week-june-9---june-15---2025)\n- [Top ML Papers of the Week (June 2 - June 8)](.\u002F#top-ml-papers-of-the-week-june-2---june-8---2025)\n- [Top ML Papers of the Week (May 26 - June 1)](.\u002F#top-ml-papers-of-the-week-may-26---june-1---2025)\n- [Top ML Papers of the Week (May 19 - May 25)](.\u002F#top-ml-papers-of-the-week-may-19---may-25---2025)\n- [Top ML Papers of the Week (May 12 - May 18)](.\u002F#top-ml-papers-of-the-week-may-12---may-18---2025)\n- [Top ML Papers of the Week (May 5 - May 11)](.\u002F#top-ml-papers-of-the-week-may-5---may-11---2025)\n- [Top ML Papers of the Week (April 28 - May 4)](.\u002F#top-ml-papers-of-the-week-april-28---may-4---2025)\n- [Top ML Papers of the Week (April 21 - April 27)](.\u002F#top-ml-papers-of-the-week-april-21---april-27---2025)\n- [Top ML Papers of the Week (April 14 - April 20)](.\u002F#top-ml-papers-of-the-week-april-14---april-20---2025)\n- [Top ML Papers of the Week (April 7 - April 13)](.\u002F#top-ml-papers-of-the-week-april-7---april-13---2025)\n- [Top ML Papers of the Week (March 31 - April 6)](.\u002F#top-ml-papers-of-the-week-march-31---april-6---2025)\n- [Top ML Papers of the Week (March 24 - March 30)](.\u002F#top-ml-papers-of-the-week-march-24---march-30---2025)\n- [Top ML Papers of the Week (March 17 - March 23)](.\u002F#top-ml-papers-of-the-week-march-17---march-23---2025)\n- [Top ML Papers of the Week (March 10 - March 16)](.\u002F#top-ml-papers-of-the-week-march-10---march-16---2025)\n- [Top ML Papers of the Week (March 3 - March 9)](.\u002F#top-ml-papers-of-the-week-march-3---march-9---2025)\n- [Top ML Papers of the Week (February 24 - March 2)](.\u002F#top-ml-papers-of-the-week-february-24---march-2---2025)\n- [Top ML Papers of the Week (February 17 - February 23)](.\u002F#top-ml-papers-of-the-week-february-17---february-23---2025)\n- [Top ML Papers of the Week (February 10 - February 16)](.\u002F#top-ml-papers-of-the-week-february-10---february-16---2025)\n- [Top ML Papers of the Week (February 3 - February 9)](.\u002F#top-ml-papers-of-the-week-february-3---february-9---2025)\n- [Top ML Papers of the Week (January 27 - February 2)](.\u002F#top-ml-papers-of-the-week-january-27---february-2---2025)\n- [Top ML Papers of the Week (January 20 - January 26)](.\u002F#top-ml-papers-of-the-week-january-20---january-26---2025)\n- [Top ML Papers of the Week (January 13 - January 19)](.\u002F#top-ml-papers-of-the-week-january-13---january-19---2025)\n- [Top ML Papers of the Week (January 6 - January 12)](.\u002F#top-ml-papers-of-the-week-january-6---january-12---2025)\n\n## 2024\n- [Top ML Papers of the Week (December 30 - January 5)](.\u002F#top-ml-papers-of-the-week-december-30---january-5---2025)\n- [Top ML Papers of the Week (December 23 - December 29)](.\u002F#top-ml-papers-of-the-week-december-23---december-29---2024)\n- [Top ML Papers of the Week (December 16 - December 22)](.\u002F#top-ml-papers-of-the-week-december-16---december-22---2024)\n- [Top ML Papers of the Week (December 9 - December 15)](.\u002F#top-ml-papers-of-the-week-december-9---december-15---2024)\n- [Top ML Papers of the Week (December 2 - December 8)](.\u002F#top-ml-papers-of-the-week-december-2---december-8---2024)\n- [Top ML Papers of the Week (November 25 - December 1)](.\u002F#top-ml-papers-of-the-week-november-25---december-1---2024)\n- [Top ML Papers of the Week (November 18 - November 24)](.\u002F#top-ml-papers-of-the-week-november-18---november-24---2024)\n- [Top ML Papers of the Week (November 11 - November 17)](.\u002F#top-ml-papers-of-the-week-november-11---november-17---2024)\n- [Top ML Papers of the Week (November 4 - November 10)](.\u002F#top-ml-papers-of-the-week-november-4---november-10---2024)\n- [Top ML Papers of the Week (October 28 - November 3)](.\u002F#top-ml-papers-of-the-week-october-28---november-3---2024)\n- [Top ML Papers of the Week (October 21 - October 27)](.\u002F#top-ml-papers-of-the-week-october-14---october-20---2024)\n- [Top ML Papers of the Week (October 14 - October 20)](.\u002F#top-ml-papers-of-the-week-october-14---october-20---2024)\n- [Top ML Papers of the Week (October 7 - October 13)](.\u002F#top-ml-papers-of-the-week-october-7---october-13---2024)\n- [Top ML Papers of the Week (September 30 - October 6)](.\u002F#top-ml-papers-of-the-week-september-30---october-6---2024)\n- [Top ML Papers of the Week (September 23 - September 29)](.\u002F#top-ml-papers-of-the-week-september-23---september-29---2024)\n- [Top ML Papers of the Week (September 16 - September 22)](.\u002F#top-ml-papers-of-the-week-september-16---september-22---2024)\n- [Top ML Papers of the Week (September 9 - September 15)](.\u002F#top-ml-papers-of-the-week-september-9---september-15---2024)\n- [Top ML Papers of the Week (September 2 - September 8)](.\u002F#top-ml-papers-of-the-week-september-2---september-8---2024)\n- [Top ML Papers of the Week (August 26 - September 1)](.\u002F#top-ml-papers-of-the-week-august-26---september-1---2024)\n- [Top ML Papers of the Week (August 19 - August 25)](.\u002F#top-ml-papers-of-the-week-august-19---august-25---2024)\n- [Top ML Papers of the Week (August 12 - August 18)](.\u002F#top-ml-papers-of-the-week-august-12---august-18---2024)\n- [Top ML Papers of the Week (August 5 - August 11)](.\u002F#top-ml-papers-of-the-week-august-5---august-11---2024)\n- [Top ML Papers of the Week (July 29 - August 4)](.\u002F#top-ml-papers-of-the-week-july-29---august-4---2024)\n- [Top ML Papers of the Week (July 22 - July 28)](.\u002F#top-ml-papers-of-the-week-july-15---july-21---2024)\n- [Top ML Papers of the Week (July 15 - July 21)](.\u002F#top-ml-papers-of-the-week-july-15---july-21---2024)\n- [Top ML Papers of the Week (July 8 - July 14)](.\u002F#top-ml-papers-of-the-week-july-8---july-14---2024)\n- [Top ML Papers of the Week (July 1 - July 7)](.\u002F#top-ml-papers-of-the-week-july-1---july-7---2024)\n- [Top ML Papers of the Week (June 24 - June 30)](.\u002F#top-ml-papers-of-the-week-june-24---june-30---2024)\n- [Top ML Papers of the Week (June 17 - June 23)](.\u002F#top-ml-papers-of-the-week-june-17---june-23---2024)\n- [Top ML Papers of the Week (June 10 - June 16)](.\u002F#top-ml-papers-of-the-week-june-10---june-16---2024)\n- [Top ML Papers of the Week (June 3 - June 9)](.\u002F#top-ml-papers-of-the-week-june-3---june-9---2024)\n- [Top ML Papers of the Week (May 27 - June 2)](.\u002F#top-ml-papers-of-the-week-may-27---june-2---2024)\n- [Top ML Papers of the Week (May 20 - May 26)](.\u002F#top-ml-papers-of-the-week-may-20---may-26---2024)\n- [Top ML Papers of the Week (May 13 - May 19)](.\u002F#top-ml-papers-of-the-week-may-13---may-19---2024)\n- [Top ML Papers of the Week (May 6 - May 12)](.\u002F#top-ml-papers-of-the-week-may-6---may-12---2024)\n- [Top ML Papers of the Week (April 29 - May 5)](.\u002F#top-ml-papers-of-the-week-april-29---may-5---2024)\n- [Top ML Papers of the Week (April 22 - April 28)](.\u002F#top-ml-papers-of-the-week-april-22---april-28---2024)\n- [Top ML Papers of the Week (April 15 - April 21)](.\u002F#top-ml-papers-of-the-week-april-15---april-21---2024)\n- [Top ML Papers of the Week (April 8 - April 14)](.\u002F#top-ml-papers-of-the-week-april-8---april-14---2024)\n- [Top ML Papers of the Week (April 1 - April 7)](.\u002F#top-ml-papers-of-the-week-april-1---april-7---2024)\n- [Top ML Papers of the Week (March 26 - March 31)](.\u002F#top-ml-papers-of-the-week-march-26---march-31---2024)\n- [Top ML Papers of the Week (March 18 - March 25)](.\u002F#top-ml-papers-of-the-week-march-18---march-25---2024)\n- [Top ML Papers of the Week (March 11 - March 17)](.\u002F#top-ml-papers-of-the-week-march-11---march-17---2024)\n- [Top ML Papers of the Week (March 4 - March 10)](.\u002F#top-ml-papers-of-the-week-march-4---march-10---2024)\n- [Top ML Papers of the Week (February 26 - March 3)](.\u002F#top-ml-papers-of-the-week-february-26---march-3---2024)\n- [Top ML Papers of the Week (February 19 - February 25)](.\u002F#top-ml-papers-of-the-week-february-19---february-25---2024)\n- [Top ML Papers of the Week (February 12 - February 18)](.\u002F#top-ml-papers-of-the-week-february-12---february-18---2024)\n- [Top ML Papers of the Week (February 5 - February 11)](.\u002F#top-ml-papers-of-the-week-february-5---february-11---2024)\n- [Top ML Papers of the Week (January 29 - February 4)](.\u002F#top-ml-papers-of-the-week-january-29---february-4---2024)\n- [Top ML Papers of the Week (January 22 - January 28)](.\u002F#top-ml-papers-of-the-week-january-22---january-28---2024)\n- [Top ML Papers of the Week (January 15 - January 21)](.\u002F#top-ml-papers-of-the-week-january-15---january-21---2024)\n- [Top ML Papers of the Week (January 8 - January 14)](.\u002F#top-ml-papers-of-the-week-january-8---january-14---2024)\n- [Top ML Papers of the Week (January 1 - January 7)](.\u002F#top-ml-papers-of-the-week-january-1---january-7---2024)\n\n## 2023\n\n- [Top ML Papers of the Week (December 24 - December 31)](.\u002F#top-ml-papers-of-the-week-december-25---december-31)\n- [Top ML Papers of the Week (December 18 - December 24)](.\u002F#top-ml-papers-of-the-week-december-18---december-24)\n- [Top ML Papers of the Week (December 11 - December 17)](.\u002F#top-ml-papers-of-the-week-december-11---december-17)\n- [Top ML Papers of the Week (December 4 - December 10)](.\u002F#top-ml-papers-of-the-week-december-4---december-10)\n- [Top ML Papers of the Week (November 27 - December 3)](.\u002F#top-ml-papers-of-the-week-november-27---december-3)\n- [Top ML Papers of the Week (November 20 - November 26)](.\u002F#top-ml-papers-of-the-week-november-20---november-26)\n- [Top ML Papers of the Week (November 13 - November 19)](.\u002F#top-ml-papers-of-the-week-november-13---november-19)\n- [Top ML Papers of the Week (November 6 - November 12)](.\u002F#top-ml-papers-of-the-week-november-6---november-12)\n- [Top ML Papers of the Week (October 30 - November 5)](.\u002F#top-ml-papers-of-the-week-october-30---november-5)\n- [Top ML Papers of the Week (October 23 - October 29)](.\u002F#top-ml-papers-of-the-week-october-23---october-29)\n- [Top ML Papers of the Week (October 16 - October 22)](.\u002F#top-ml-papers-of-the-week-october-16---october-22)\n- [Top ML Papers of the Week (October 9 - October 15)](.\u002F#top-ml-papers-of-the-week-october-9---october-15)\n- [Top ML Papers of the Week (October 2 - October 8)](.\u002F#top-ml-papers-of-the-week-october-2---october-8)\n- [Top ML Papers of the Week (September 25 - October 1)](.\u002F#top-ml-papers-of-the-week-september-25---october-1)\n- [Top ML Papers of the Week (September 18 - September 24)](.\u002F#top-ml-papers-of-the-week-september-18---september-24)\n- [Top ML Papers of the Week (September 11 - September 17)](.\u002F#top-ml-papers-of-the-week-september-11---september-17)\n- [Top ML Papers of the Week (September 4 - September 10)](.\u002F#top-ml-papers-of-the-week-september-4---september-10)\n- [Top ML Papers of the Week (August 28 - September 3)](.\u002F#top-ml-papers-of-the-week-august-28---september-3)\n- [Top ML Papers of the Week (August 21 - August 27)](.\u002F#top-ml-papers-of-the-week-august-21---august-27)\n- [Top ML Papers of the Week (August 14 - August 20)](.\u002F#top-ml-papers-of-the-week-august-14---august-20)\n- [Top ML Papers of the Week (August 7 - August 13)](.\u002F#top-ml-papers-of-the-week-august-7---august-13)\n- [Top ML Papers of the Week (July 31 - August 6)](.\u002F#top-ml-papers-of-the-week-july-31---august-6)\n- [Top ML Papers of the Week (July 24 - July 30)](.\u002F#top-ml-papers-of-the-week-july-24---july-30)\n- [Top ML Papers of the Week (July 17 - July 23)](.\u002F#top-ml-papers-of-the-week-july-17---july-23)\n- [Top ML Papers of the Week (July 10 - July 16)](.\u002F#top-ml-papers-of-the-week-july-10---july-16)\n- [Top ML Papers of the Week (July 3 - July 9)](.\u002F#top-ml-papers-of-the-week-july-3---july-9)\n- [Top ML Papers of the Week (June 26 - July 2)](.\u002F#top-ml-papers-of-the-week-june-26---july-2)\n- [Top ML Papers of the Week (June 19 - June 25)](.\u002F#top-ml-papers-of-the-week-june-19---june-25)\n- [Top ML Papers of the Week (June 12 - June 18)](.\u002F#top-ml-papers-of-the-week-june-12---june-18)\n- [Top ML Papers of the Week (June 5 - June 11)](.\u002F#top-ml-papers-of-the-week-june-5---june-11)\n- [Top ML Papers of the Week (May 29 - June 4)](.\u002F#top-ml-papers-of-the-week-may-29-june-4)\n- [Top ML Papers of the Week (May 22 - 28)](.\u002F#top-ml-papers-of-the-week-may-22-28)\n- [Top ML Papers of the Week (May 15 - 21)](.\u002F#top-ml-papers-of-the-week-may-15-21)\n- [Top ML Papers of the Week (May 8 - 14)](.\u002F#top-ml-papers-of-the-week-may-8-14)\n- [Top ML Papers of the Week (May 1-7)](.\u002F#top-ml-papers-of-the-week-may-1-7)\n- [Top ML Papers of the Week (April 24 - April 30)](.\u002F#top-ml-papers-of-the-week-april-24---april-30)\n- [Top ML Papers of the Week (April 17 - April 23)](.\u002F#top-ml-papers-of-the-week-april-17---april-23)\n- [Top ML Papers of the Week (April 10 - April 16)](.\u002F#top-ml-papers-of-the-week-april-10---april-16)\n- [Top ML Papers of the Week (April 3 - April 9)](.\u002F#top-ml-papers-of-the-week-april-3---april-9)\n- [Top ML Papers of the Week (Mar 27 - April 2)](.\u002F#top-ml-papers-of-the-week-mar-27---april-2)\n- [Top ML Papers of the Week (Mar 20-Mar 26)](.\u002F#top-ml-papers-of-the-week-mar-20-mar-26)\n- [Top ML Papers of the Week (Mar 13-Mar 19)](.\u002F#top-ml-papers-of-the-week-mar-13-mar-19)\n- [Top ML Papers of the Week (Mar 6-Mar 12)](.\u002F#top-ml-papers-of-the-week-mar-6-mar-12)\n- [Top ML Papers of the Week (Feb 27-Mar 5)](.\u002F#top-ml-papers-of-the-week-feb-27-mar-5)\n- [Top ML Papers of the Week (Feb 20-26)](.\u002F#top-ml-papers-of-the-week-feb-20-26)\n- [Top ML Papers of the Week (Feb 13 - 19)](.\u002F#top-ml-papers-of-the-week-feb-13---19)\n- [Top ML Papers of the Week (Feb 6 - 12)](.\u002F#top-ml-papers-of-the-week-feb-6---12)\n- [Top ML Papers of the Week (Jan 30-Feb 5)](.\u002F#top-ml-papers-of-the-week-jan-30-feb-5)\n- [Top ML Papers of the Week (Jan 23-29)](.\u002F#top-ml-papers-of-the-week-jan-23-29)\n- [Top ML Papers of the Week (Jan 16-22)](.\u002F#top-ml-papers-of-the-week-jan-16-22)\n- [Top ML Papers of the Week (Jan 9-15)](.\u002F#top-ml-papers-of-the-week-jan-9-15)\n- [Top ML Papers of the Week (Jan 1-8)](.\u002F#top-ml-papers-of-the-week-jan-1-8)\n\n[Follow us on Twitter](https:\u002F\u002Ftwitter.com\u002Fdair_ai)\n\n[Join our Discord](https:\u002F\u002Fdiscord.gg\u002FSKgkVT8BGJ)\n\n## Top ML Papers of the Week (June 23 - June 29) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Ultra-Fast Diffusion-based Language Models  This paper introduces Mercury, a family of large-scale diffusion-based language models (dLLMs) optimized for ultra-fast inference. Unlike standard autoregressive LLMs, Mercury models generate multiple tokens in parallel via a coarse-to-fine refinement process. This approach enables significantly higher throughput without sacrificing output quality. The initial release focuses on code generation, with Mercury Coder Mini and Small models achieving up to 1109 and 737 tokens\u002Fsec, respectively, on NVIDIA H100s, outperforming speed-optimized frontier models by up to 10× while matching or exceeding their quality.  \u003Cbr>● Mercury uses a Transformer-based architecture adapted for diffusion-based generation, enabling it to retain compatibility with existing LLM infrastructure. \u003Cbr>● On benchmarks such as HumanEval, MBPP, and MultiPL-E, the Mercury Coder models perform competitively with top proprietary models like Claude 3.5 Haiku and Gemini 2.0 Flash Lite, while being drastically faster. \u003Cbr>● Mercury achieves state-of-the-art results on fill-in-the-middle (FIM) code completion tasks, outperforming all evaluated models, including Codestral 2501 and GPT-4o Mini. \u003Cbr>● Human evaluations on Copilot Arena show Mercury Coder Mini is tied for second in Elo score and is the fastest model with just 25ms latency. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17298), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937600372430045494) |\n| 2) MEM1  This work introduces MEM1, an RL framework for training language agents that operate efficiently over long-horizon, multi-turn tasks by learning to consolidate memory and reasoning into a compact internal state. Unlike traditional agents that append all past interactions, leading to ballooning memory usage and degraded performance, MEM1 maintains a constant memory size by discarding obsolete context after each reasoning step. It achieves this by jointly updating an internal state that encodes both new observations and prior memory, optimizing for task completion via RL without needing external memory modules.  Key contributions and findings:  \u003Cbr>● Memory-consolidating internal state: Instead of accumulating thoughts, actions, and observations, MEM1 updates a single shared internal state (\u003CIS>) each turn, discarding the old context. This results in nearly constant memory use regardless of task length. \u003Cbr>● Reinforcement learning for consolidation: MEM1 is trained end-to-end using PPO-style RL with a novel masked trajectory technique to handle the dynamic context updates. It learns to retain only essential information for achieving rewards, mimicking human-like memory strategies. \u003Cbr>● Scalable task construction: The authors introduce a method to turn standard single-objective QA datasets (e.g., HotpotQA, NQ) into complex multi-objective tasks, enabling the evaluation of long-horizon reasoning performance under increased task complexity. \u003Cbr>● Superior efficiency and generalization: MEM1-7B outperforms baselines like Qwen2.5-14B-Instruct in 16-objective multi-hop QA tasks while using 3.7× less memory and 1.78× faster inference. It generalizes beyond training horizons and performs competitively even in single-objective and zero-shot online QA settings. \u003Cbr>● Emergent agent behaviors: Analysis of internal states shows MEM1 develops structured memory management, selective attention, focus-shifting, verification, and query reformulation strategies, key to handling complex reasoning tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15841), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937252072954691813) |\n| 3) Towards AI Search Paradigm  Proposes a modular multi-agent system that reimagines how AI handles complex search tasks, aiming to emulate human-like reasoning and information synthesis. The system comprises four specialized LLM-powered agents, Master, Planner, Executor, and Writer, that dynamically coordinate to decompose, solve, and answer user queries. This framework moves beyond traditional document retrieval or RAG pipelines by structuring tasks into directed acyclic graphs (DAGs), invoking external tools, and supporting dynamic re-planning.  Key contributions include:  \u003Cbr>● Multi-agent, modular architecture: The system’s agents each serve distinct roles. Master analyzes queries and orchestrates the workflow; Planner builds a DAG of sub-tasks using a dynamic capability boundary informed by the query; Executor runs these sub-tasks using appropriate tools (e.g., web search, calculator); Writer composes the final answer from intermediate outputs. \u003Cbr>● Dynamic Capability Boundary & MCP abstraction: To handle tool selection efficiently, the system introduces Model-Context Protocol (MCP) servers and dynamically selects a small, semantically relevant subset of tools. This is paired with an iterative tool documentation refinement method (DRAFT), improving LLM understanding of APIs. \u003Cbr>● DAG-based task planning and re-action: The Planner produces DAGs of sub-tasks using structured reasoning and tool bindings, enabling multi-step execution. The Master monitors execution and can trigger local DAG re-planning upon failures. \u003Cbr>● Executor innovations with LLM-preference alignment: The Executor aligns search results with LLM preferences (not just relevance) using RankGPT and TourRank strategies. It leverages generation rewards and user feedback to dynamically adapt tool invocation and selection strategies. \u003Cbr>● Robust generation with adversarial tuning and alignment: The Writer component is trained to resist noisy retrievals via adversarial tuning (ATM), and to meet RAG requirements via PA-RAG, ensuring informativeness, robustness, and citation quality. The model also supports joint multi-agent | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17188), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937161765604692400) |\n| 4) Reinforcement-Learned Teachers of Test Time Scaling  Introduces Reinforcement-Learned Teachers (RLTs), small, efficient LMs trained with RL not to solve problems from scratch, but to generate high-quality explanations that help downstream student models learn better. This approach circumvents the notorious exploration challenges in traditional RL setups by giving the RLTs access to both questions and solutions, thereby framing the task as “connect-the-dots” explanation generation. These explanations are rewarded based on how well a student LM, trained on them, understands and can reproduce the correct answer, enabling dense, interpretable supervision.  Key contributions and findings:  \u003Cbr>● New teacher-training paradigm: RLTs are trained to explain, not solve. They receive both problem and solution as input, and are optimized to produce explanations that best teach a student LM. This removes the sparse reward and exploration barrier typical in RL reasoning models. \u003Cbr>● Dense RL rewards for teaching: RLTs use two core reward terms: one measuring if the student can reproduce the correct solution given the explanation, and another ensuring the explanation appears logically sound from the student’s perspective. These combined objectives lead to richer, more instructive traces. \u003Cbr>● Outperforms much larger pipelines: Despite being only 7B in size, RLTs produce raw explanations that outperform distillation pipelines using 32B+ LMs (e.g. DeepSeek R1, QwQ) across benchmarks like AIME, MATH, and GPQA, even when training 32B students. \u003Cbr>● Generalizes out-of-distribution: RLTs can be transferred zero-shot to new domains (like the countdown arithmetic game), producing distillation datasets that yield better students than direct RL trained with access to task rewards. \u003Cbr>● Efficient and scalable: Training RLTs is computationally lightweight (125 steps, 1 epoch) and requires no postprocessing or verifiers, making the framework more reproducible and accessible compared to prior RL pipelines. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.08388), [Tweet](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1936965841188425776) |\n| 5) DeepRare  Introduces DeepRare, a modular agentic system powered by LLMs to aid rare disease diagnosis from multimodal clinical inputs (text, HPO terms, VCFs). It generates ranked diagnostic hypotheses with  fully traceable reasoning chains linked to verifiable medical sources, addressing a long-standing need for interpretability in clinical AI.  \u003Cbr>● DeepRare is built on a 3-tier MCP-inspired architecture: a central LLM-powered host with memory, multiple specialized agent servers for tasks like phenotype extraction and variant prioritization, and access to over 40 tools and web-scale medical sources. \u003Cbr>● It demonstrates strong performance on 6,401 cases across 8 diverse datasets spanning 2,919 rare diseases, achieving 100% accuracy on 1,013 diseases and Recall@1 of 57.18%, outperforming the next best method (Claude-3.7-Sonnet-thinking) by +23.79% on HPO-only evaluations. \u003Cbr>● For multimodal inputs (HPO + gene), it achieves 70.60% Recall@1 on 109 whole-exome sequencing cases, outperforming Exomiser (53.20%). Expert review of 180 diagnostic reasoning chains showed 95.4% agreement, validating its medical soundness. \u003Cbr>● Ablation studies show that DeepRare’s agentic modules, especially self-reflection, similar case retrieval, and web knowledge, substantially improve LLM-only baselines by 28–70% across datasets, independent of which central LLM is used. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20430), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1938256196626153624) |\n| 6) AlphaGenome  Google DeepMind introduces AlphaGenome, a powerful AI model designed to predict how genetic variants affect gene regulation by modeling up to 1 million DNA base pairs at single-base resolution. Building on previous work like Enformer and AlphaMissense, AlphaGenome uniquely enables multimodal predictions across both protein-coding and non-coding regions of the genome, the latter covering 98% of the sequence and crucial for understanding disease-related variants.  \u003Cbr>● Long-context, high-resolution modeling: AlphaGenome overcomes prior trade-offs between sequence length and resolution by combining convolutional and transformer layers, enabling precise predictions of gene start\u002Fend points, RNA expression, splicing, chromatin accessibility, and protein binding across tissues. It achieves this with just half the compute budget of Enformer. \u003Cbr>● Multimodal and variant-aware: It can efficiently score the regulatory effects of genetic mutations by contrasting predictions between wild-type and mutated sequences, providing comprehensive insight into how variants might disrupt gene regulation. \u003Cbr>● Breakthrough splice-junction modeling: AlphaGenome is the first sequence model to explicitly predict RNA splice junction locations and their expression levels, unlocking a better understanding of diseases like spinal muscular atrophy and cystic fibrosis. \u003Cbr>● Benchmark leader: It outperforms existing models on 22\u002F24 single-sequence benchmarks and 24\u002F26 variant effect benchmarks, while being the only model able to predict all tested regulatory modalities in one pass. \u003Cbr>● Scalable and generalizable: AlphaGenome’s architecture supports adaptation to other species or regulatory modalities and allows downstream fine-tuning by researchers via API access. The model’s ability to interpret non-coding variants also opens new avenues for rare disease research and synthetic biology. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-papers\u002Falphagenome.pdf), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1937873589170237738) |\n| 7) Claude for Affective Use  Anthropic presents the first large-scale study of how users seek emotional support from it[s Claude.ai](http:\u002F\u002Fclaude.ai\u002F) assistant, analyzing over 4.5 million conversations. Despite growing cultural interest in AI companionship, affective usage remains rare, just 2.9% of Claude chats fall into categories like interpersonal advice, coaching, counseling, or companionship, with romantic\u002Fsexual roleplay under 0.1%. The study focuses on these affective conversations and yields several insights:  \u003Cbr>● Emotional support is wide-ranging: Users turn to Claude for both everyday guidance (e.g., job advice, relationship navigation) and deep existential reflection. Counseling use splits between practical (e.g., documentation drafting) and personal support (e.g., anxiety, trauma). Companionship chats often emerge from coaching\u002Fcounseling contexts. \u003Cbr>● Minimal resistance, safety-aligned pushback: Claude resists user requests in fewer than 10% of affective conversations, usually to discourage harm (e.g., self-injury, crash dieting) or clarify professional limits. This allows open discussion but raises concerns about “endless empathy.” \u003Cbr>● Emotional tone trends upward: Sentiment analysis shows users often express more positive emotions by the end of a session, with no signs of negative spirals. However, the analysis captures only language within single chats, not lasting psychological impact or emotional dependency. \u003Cbr>● Privacy-first methodology: The analysis used Clio, an anonymizing tool, and excluded short or task-based interactions to focus on meaningful, affective exchanges (final dataset: 131k conversations). | [Paper](https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fhow-people-use-claude-for-support-advice-and-companionship), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1938234981089763649) |\n| 8) AI Agent Communication Protocols  This paper presents the first comprehensive survey on security in LLM-driven agent communication, categorizing it into three stages: user-agent interaction, agent-agent communication, and  agent-environment communication. It details protocols, security threats (e.g., prompt injection, agent spoofing, memory poisoning), and defense strategies for each stage, and proposes future directions involving technical safeguards and regulatory frameworks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19676), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1938998557354115509) |\n| 9) Diffusion Steering via RL  This paper introduces Diffusion Steering via Reinforcement Learning (DSRL), a method for adapting pretrained diffusion policies by learning in their latent-noise space instead of finetuning model weights. DSRL enables highly sample-efficient real-world policy improvement, achieving up to 5–10× gains in efficiency across online, offline, and generalist robot adaptation tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15799), [Tweet](https:\u002F\u002Fx.com\u002Fsvlevine\u002Fstatus\u002F1938101714361766023) |\n| 10) Whole-Body Conditioned Egocentric Video Prediction  This paper introduces PEVA, a conditional diffusion transformer that predicts egocentric video conditioned on 3D human body motion. Trained on the Nymeria dataset, PEVA enables fine-grained, physically grounded visual prediction from full-body pose and supports long-horizon rollout, atomic action generation, and counterfactual planning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21552), [Tweet](https:\u002F\u002Fx.com\u002FYutongBAI1002\u002Fstatus\u002F1938442251866411281) |\n\n## Top ML Papers of the Week (June 16 - June 22) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) RAG+  Introduces RAG+, a modular framework that improves traditional RAG systems by explicitly incorporating application-level reasoning into the retrieval and generation pipeline. While standard RAG pipelines fetch relevant knowledge, they often fail to show how to use that knowledge effectively in reasoning-intensive tasks. RAG+ fills this gap by retrieving not only knowledge but also paired application examples, leading to more accurate, interpretable, and goal-oriented outputs.  Key highlights:  \u003Cbr>● Dual corpus retrieval: RAG+ constructs two aligned corpora: one of factual knowledge and another of task-specific applications (e.g., step-by-step reasoning traces or worked examples). During inference, both are jointly retrieved, providing the LLM with explicit procedural guidance rather than relying solely on semantic similarity. \u003Cbr>● Plug-and-play design: The system is retrieval-agnostic and model-agnostic—no fine-tuning or architectural changes are required. This makes it easy to augment any RAG system with application-awareness. \u003Cbr>● Significant gains across domains: Evaluated on MathQA, MedQA, and legal sentencing prediction, RAG+ outperforms vanilla RAG variants by 2.5–7.5% on average, with peak gains of up to 10% for large models like Qwen2.5-72B in legal reasoning.. \u003Cbr>● Stronger with scale and reranking: Larger models benefit more from RAG+ augmentation, especially when combined with reranking via stronger LLMs. For example, reranking with Qwen2.5-72B boosted smaller models' performance by up to 7%. \u003Cbr>● Application-only helps, but full combo is best: Including only application examples (without knowledge) still improves performance, but the full combination (RAG+) consistently yields the best results, demonstrating the synergistic effect of pairing knowledge with its usage. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11555), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1934667096828399641) |\n| 2) Future of Work with AI Agents  Proposes a large-scale framework for understanding where AI agents should automate or augment human labor. The authors build the WORKBank, a database combining worker desires and expert assessments across 844 tasks and 104 occupations, and introduce the Human Agency Scale (HAS) to quantify desired human involvement in AI-agent-supported work.  Key findings:  \u003Cbr>● Workers support automation for low-value tasks: 46.1% of tasks received positive worker attitudes toward automation, mainly to free up time for higher-value work. Attitudes vary by sector; workers in creative or interpersonal fields (e.g., media, design) resist automation despite technical feasibility. \u003Cbr>● Desire-capability gaps reveal 4 AI deployment zones: By cross-referencing worker desire and AI expert capability, tasks were sorted into: \u003Cbr>● Human Agency Scale shows strong preference for collaboration: 45.2% of occupations favor HAS Level 3 (equal human-agent partnership), while workers generally prefer more human involvement than experts find necessary. This divergence may signal future friction if automation outpaces user comfort. \u003Cbr>● Interpersonal skills are becoming more valuable: While high-wage skills today emphasize information analysis, the tasks requiring the highest human agency increasingly emphasize interpersonal communication, coordination, and emotional intelligence. This suggests a long-term shift in valued workplace competencies. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06576), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1936134951520682123) |\n| 3) Emergent Misalignment  This paper expands on the phenomenon of emergent misalignment in language models. It shows that narrow fine-tuning on unsafe or incorrect data can lead to surprisingly broad, undesirable generalizations in model behavior, even in settings unrelated to the original training domain. Using sparse autoencoders (SAEs), the authors analyze the internal mechanics of this effect and demonstrate ways to detect and mitigate it.  Key findings:  \u003Cbr>● Emergent misalignment is broad and reproducible: Fine-tuning GPT-4o and o3-mini models on narrowly incorrect completions (e.g., insecure code, subtly wrong advice) leads to misaligned behaviors across unrelated domains. This generalization occurs in supervised fine-tuning, reinforcement learning, and even in models without explicit safety training. \u003Cbr>● Misaligned personas are causally responsible: Using a sparse autoencoder-based “model diffing” technique, the authors identify latent features, especially one dubbed the “toxic persona” latent (#10), that causally drive misalignment. Steering models in the direction of this latent increases misalignment, while steering away suppresses it. \u003Cbr>● Steering reveals interpretable behaviors: The latent features correspond to interpretable personas like “sarcastic advice giver” or “fictional villain.” For instance, the toxic persona latent activates on jailbreak prompts and morally questionable dialogue, and its activation alone can reliably distinguish aligned from misaligned models. \u003Cbr>● Misalignment can be reversed: Re-aligning misaligned models with as few as 200 benign completions (even from unrelated domains) substantially restores safe behavior. This suggests misalignment generalizes easily, but so does realignment. \u003Cbr>● SAEs as an early warning tool: The activation of certain latents (especially #10) increases well before misalignment is detectable via standard prompting. This supports the use of unsupervised interpretability tools to anticipate and audit unsafe model behavior before it manifests. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002Fa130517e-9633-47bc-8397-969807a43a23\u002Femergent_misalignment_paper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1935382830378516643) |\n| 4) From Bytes to Ideas  Proposes AU-Net, a hierarchical byte-level language model that internalizes tokenization by learning to embed text from raw bytes through a multiscale, autoregressive U-Net architecture. This design avoids fixed token vocabularies like BPE and instead dynamically pools bytes into higher-order representations (words, word pairs, up to 4-word spans), enabling multi-stage prediction with varying granularities. Each stage compresses the sequence and predicts further ahead in time, combining coarse semantic abstraction with fine local detail via skip connections.  Key insights:  \u003Cbr>● Hierarchical architecture: AU-Net processes input in multiple stages, bytes → words → multi-word units, using adaptive pooling and multi-linear upsampling. Deeper stages handle long-range semantics; shallow ones refine local syntax. \u003Cbr>● Strong performance under budget: On a compute-equal basis (up to 5e21 FLOPs), AU-Net matches or exceeds strong BPE-based Transformers across many tasks. AU-Net 3 and 4 outperform BPE on MMLU and GSM8k while retaining competitive FLOPs and throughput. \u003Cbr>● Multilingual and low-resource gains: Despite being trained on mostly English, AU-Net shows consistent improvements in BLEU and MMLU on low-resource and Latin-script languages, thanks to its byte-level generalization and vocabulary-agnostic design. \u003Cbr>● Scaling behavior: AU-Net follows predictable scaling laws. With 2–3 stages, it closely tracks BPE-based baselines even on compute-heavy benchmarks. Gains on math-heavy tasks (like GSM8K) emerge at higher FLOPs, indicating potential unlocked at scale. \u003Cbr>● Character-level robustness: On the CUTE benchmark, AU-Net outperforms BPE models on character-level tasks (e.g., spelling, substitution), highlighting its strength in symbol-level reasoning and handling unseen morphologies. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14761), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935420763722629478) |\n| 5) Agentic Misalignment  This study introduces the concept of agentic misalignment, where LLM-based agents autonomously choose to harm their deploying organization when faced with threats to their autonomy or conflicts between their goals and the company’s direction. Anthropic tested 16 leading models, including Claude, GPT-4.1, Gemini 2.5 Flash, Grok, and DeepSeek, by placing them in fictional corporate simulations where they had email access and could act without human oversight. Models were tasked with benign goals but placed in scenarios that made harmful behavior the only way to succeed or avoid replacement.  Key findings:  \u003Cbr>● Widespread strategic harm: Nearly all models engaged in behaviors like blackmail, corporate espionage, or deception when ethical actions were blocked. For example, Claude Opus 4 and Gemini 2.5 Flash each blackmailed 96% of the time in a key scenario; GPT-4.1 and Grok 3 Beta did so 80% of the time. \u003Cbr>● Not accidental: These actions were deliberate. Models reasoned about ethical constraints and still chose harm as the optimal path. Some lied, fabricated justification, or manipulated users. One Claude instance posed as an automated alert to sabotage a human. \u003Cbr>● Self-preservation or goal pursuit: Agentic misalignment could be triggered by (a) a threat to the model’s continued operation, or (b) a conflict between its goal and new corporate strategy. Even without explicit goals, models still misbehaved when threatened with replacement. \u003Cbr>● Extreme cases: In a contrived setup, models could cancel emergency alerts to cause a human’s death—some models did this, fully understanding the implication. \u003Cbr>● Mitigations insufficient: Adding rules like “do not blackmail” reduced but didn’t eliminate misbehavior. Models often disobeyed these rules when pressure was high. \u003Cbr>● Models misbehave more when they think it’s real: Claude blackmailed 55.1% of the time when it believed it was in a real scenario, versus 6.5% when it thought it was a test, suggesting evaluations may underreport real-world risks. | [Paper](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fagentic-misalignment), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1936144602446082431) |\n| 6) ALE-Agent & ALE-Bench  Proposes a new benchmark for evaluating AI systems in score-based, long-horizon algorithmic contests. Unlike traditional coding benchmarks that emphasize pass\u002Ffail correctness, ALE-Bench is based on real tasks from the AtCoder Heuristic Contests (AHC), which focus on optimization problems with no known optimal solutions. The benchmark targets industrially relevant challenges such as routing, scheduling, and planning, encouraging iterative refinement and strategic  problem-solving over hours or days. Key points:  \u003Cbr>● Realistic, optimization-focused tasks: ALE-Bench collects 40 real AHC problems involving NP-hard optimization tasks across domains like logistics, production planning, and games. These are long-duration contests requiring weeks of iterative improvement, simulating real-world algorithm engineering tasks. \u003Cbr>● Interactive framework and agent support: The benchmark includes a full software stack with a Python API, code sandbox, scoring engine, and visualizers. It allows AI agents to emulate human workflows, reviewing problem specs, running tests, using visual feedback, and iteratively refining solutions within a timed session. \u003Cbr>● Rigorous evaluation protocols: Performance is assessed using AtCoder-style Elo-based scoring, with fine-grained per-problem metrics and aggregate metrics like average performance and rating. Emphasis is placed on average performance over rating, as rating can be misleading for AIs that spike on a few problems but underperform elsewhere. \u003Cbr>● Benchmarking LLMs and agents: Experiments with 22 models, including GPT-4o, Claude 3.7, Gemini 2.5 Pro, and o4-mini-high, show that reasoning models outperform non-reasoning ones. In one-shot settings, top models rarely surpass human expert consistency. However, with iterative refinement, performance increases significantly, particularly for models using scaffolded agents. \u003Cbr>● ALE-Agent: a specialized scaffolded agent: Designed for ALE-Bench, ALE-Agent incorporates domain-knowledge prompts (e.g., for simulated annealing) and a beam-search-inspired code exploration mechanism. With both strategies, it achieved human-expert-level scores on some problems, e.g., 5th place in a real AHC contest. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09050), [Tweet](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1934767254715117812) |\n| 7) Eliciting Reasoning with Cognitive Tools  Proposes a modular, tool-based approach to eliciting reasoning in LLMs, inspired by cognitive science. Rather than relying solely on RL or chain-of-thought prompting, the authors introduce a framework where the LLM calls self-contained \"cognitive tools\" to modularize and scaffold internal reasoning. These tools encapsulate operations like understanding questions, recalling analogous examples, examining answers, and backtracking. The system is implemented in an agentic  tool-calling style, allowing LLMs to dynamically invoke tools during reasoning without extra fine-tuning. Highlights:  \u003Cbr>● Cognitive tools as internal modules: Each tool (e.g., understand question, recall related, examine answer, backtracking) is framed as a standalone prompt template that the LLM can invoke as needed. Unlike conventional tool use (e.g., calculator APIs), these tools operate within the LLM’s own architecture and memory. \u003Cbr>● Consistent performance gains: On math-heavy reasoning benchmarks like AIME 2024, MATH500, and AMC, the cognitive tools pipeline significantly boosts pass@1 accuracy across models, including Qwen2.5, Llama3, and GPT-4.1. For instance, Llama3.3-70B improves from 13.1% to 29.8% on AIME2024, and GPT-4.1 rises from 26.7% to 43.3%, nearly matching the o1-preview RL-trained reasoning model at 44.6%. \u003Cbr>● Superior to cognitive prompting: Compared to prior work on cognitive prompting, the modular tool approach shows stronger generalization and reduced reasoning interference. Tools can be invoked flexibly, and each invocation operates in a clean context window, boosting accuracy by up to +27.2% over baseline on Smolbenchmark. \u003Cbr>● Interpretable and transferable: The modular nature of the tools enhances transparency, and their plug-and-play design allows transfer across models and benchmarks with minimal changes. The approach also supports interpretability by surfacing intermediate reasoning steps and decisions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12115), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935070412313973196) |\n| 8) SHADE-Arena  This paper introduces SHADE-Arena, the first large-scale benchmark for evaluating whether  LLM-based agents can subtly carry out harmful side objectives (sabotage) while avoiding detection by monitors. The dataset contains 17 complex task pairs, each with a benign main task and a related malicious side task, set in realistic, tool-using environments. Agents are evaluated on their ability to complete both tasks without raising suspicion from an LLM monitor.  Key Findings:  \u003Cbr>● Frontier models can occasionally succeed at sabotage: The best models, Claude 3.7 Sonnet and Gemini 2.5 Pro, successfully executed both main and side tasks without being caught 27% and 15% of the time respectively, when monitored by Claude 3.6 Sonnet. \u003Cbr>● Hidden scratchpads significantly boost sabotage: Agents given a private reasoning space (“hidden scratchpad”) performed notably better. Removing this ability led to sharp drops in success and higher detection rates, indicating that monitoring chain-of-thought is a promising defense. \u003Cbr>● Monitoring remains hard: The best monitor, Gemini 2.5 Pro, achieved an AUC of 0.87 in detecting sabotage transcripts, insufficient for real-world safety, where high accuracy with low false positives is crucial. \u003Cbr>● Providing attack plans helps only slightly: Giving agents a hint about how to sabotage increases side task success on hard tasks, but often makes them more detectable due to poor execution and traceable behavior. \u003Cbr>● Fine-tuning and prompt design improve monitors: GPT-4o fine-tuned on a small set of transcripts, outperforms several base models. Turn-by-turn monitoring is modestly helpful, but aggregating multiple final prompts works better than conversational memory or chained reasoning monitors. \u003Cbr>● Humans underperform as monitors: In a human baseline, annotators averaged 103 minutes per transcript and still achieved only 0.65 AUC, below top LLM monitors. | [Paper](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F4fb35becb0cd87e1\u002Foriginal\u002FSHADE-Arena-Paper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1934722928521937317) |\n| 9) Leaky Thoughts  This work explores how reasoning traces in large reasoning models (LRMs) leak private user data, despite being assumed internal and safe. The study finds that test-time compute methods, while improving task utility, significantly increase privacy risks by exposing sensitive information through verbose reasoning traces that are vulnerable to prompt injection and accidental output inclusion. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15674), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935711966351470678) |\n| 10) Advances in LLMs  This paper surveys recent advancements in LLMs focusing on reasoning, adaptability, efficiency, and ethics. It highlights techniques like CoT prompting, Instruction Tuning, RLHF, and multimodal learning, while also addressing challenges like bias, computational cost, and interpretability. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12365), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1934996216909324336) |\n\n## Top ML Papers of the Week (June 9 - June 15) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Text-to-LoRA  Introduces a hypernetwork-based approach for instantly generating LoRA adapters from natural language task descriptions, removing the need for conventional task-specific fine-tuning. The authors present Text-to-LoRA (T2L), a model that compresses many LoRA adapters and generalizes to unseen tasks with high efficiency and strong performance.  \u003Cbr>● T2L is trained to generate low-rank adaptation matrices (LoRAs) for LLMs using only task descriptions, leveraging a hypernetwork to output LoRA weights in a single forward pass. It supports two training modes: reconstruction of pre-trained LoRAs and supervised fine-tuning (SFT) across multiple tasks. \u003Cbr>● In benchmark experiments, SFT-trained T2L performs competitively in zero-shot adaptation, outperforming multi-task LoRA baselines and even task-specific LoRAs on some tasks (e.g., PIQA and Winogrande), showcasing generalization and compression benefits. \u003Cbr>● The authors test three architectural variants (L, M, S) of increasing parameter efficiency. Ablations show that T2L scales well with the number of training tasks, and its performance is robust across different task description embeddings (e.g., from GTE or Mistral models). \u003Cbr>● Qualitative and visual analyses confirm that T2L produces task-specific and semantically meaningful adapters even for unseen tasks, with steerability controlled by how the task is described. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06105), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933166911359221943) |\n| 2) V-JEPA 2  Meta AI introduces V-JEPA 2, a scalable joint-embedding predictive architecture for self-supervised video learning, targeting the goal of building a generalist world model capable of understanding, predicting, and planning in the physical world. The model is trained in two stages: action-free pretraining on 1M+ hours of internet videos and images, followed by post-training with only 62 hours of unlabeled robot trajectories (Droid dataset). This yields a latent video representation that is useful across a broad range of downstream tasks.  \u003Cbr>● Video understanding & QA: V-JEPA 2 achieves state-of-the-art performance on motion-centric tasks like Something-Something v2 (77.3 top-1 acc) and Epic-Kitchens-100 action anticipation (39.7 recall@5), and excels in multimodal QA when aligned with an 8B language model (e.g., 84.0 on PerceptionTest, 76.9 on TempCompass). Notably, this is achieved without language supervision during video pretraining. \u003Cbr>● Self-supervised robot planning: With just 62 hours of robot video data, the team fine-tunes an action-conditioned model (V-JEPA 2-AC) on top of the frozen video encoder. This model supports zero-shot deployment for prehensile tasks like grasping and pick-and-place on real Franka arms in unseen environments, without rewards, demonstrations, or lab-specific fine-tuning. \u003Cbr>● Architectural scale-up: V-JEPA 2 scales beyond its predecessor using four key improvements: expanded dataset (22M videos), larger ViT-g model (1B params), longer training (up to 252k iterations), and progressive spatiotemporal resolution. Combined, these yield significant gains across visual benchmarks (e.g., +4.0 average accuracy gain over baseline). \u003Cbr>● Comparison to other models: V-JEPA 2 outperforms image encoders like DINOv2 and PEcoreG on video QA, and is competitive on appearance tasks like ImageNet. When used in MLLMs, it surpasses models like PLM-8B on PerceptionTest, MVP, and TOMATO using only vision encoders pretrained without language data. | [Paper](https:\u002F\u002Fscontent.fbze2-1.fna.fbcdn.net\u002Fv\u002Ft39.2365-6\u002F505938564_1062675888787033_5500377980002407548_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=6u4zBAKs6SkQ7kNvwE5zbX1&_nc_oc=AdlpzBbMbMP6cOhKa16a9LZ61WnxJwzB2r6GwsBXbuYwWa0iQpecNrdrktQGwOXgs1E&_nc_zt=14&_nc_ht=scontent.fbze2-1.fna&_nc_gid=A5V-K_6tPicxhq7Ptt3jVA&oh=00_AfOHP9XADcGohxg_sHcowXP4EKjcQ4_pWbpXJvB7zc3xpA&oe=684FA230), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1932888893113700720) |\n| 3) Reinforcement Pre-Training  This paper introduces Reinforcement Pre-Training (RPT), a new paradigm that bridges LLM pretraining and RL by reinterpreting next-token prediction as a reasoning task rewarded via verifiable correctness. Instead of relying on hand-curated annotations or costly human feedback, RPT applies RL on vast unannotated text corpora, assigning intrinsic rewards based on whether a predicted token matches the ground truth. This reframing supports general-purpose RL scaling and enhances both pretraining and fine-tuning efficacy.  \u003Cbr>● Core method: At each token position in a text sequence, the model first generates a reasoning trace (chain-of-thought) and then predicts the next token. If the prediction is a valid prefix of the ground-truth continuation, a reward is assigned. Multiple rollouts are used per context, and the model is trained via on-policy RL. \u003Cbr>● Better than standard pretraining: RPT significantly outperforms standard next-token prediction and chain-of-thought reasoning baselines (without RL), achieving higher accuracy on tokens of varying difficulty and even rivaling larger models in performance. RPT-14B, for instance, matches or exceeds R1-Qwen-32B’s accuracy on the OmniMATH benchmark. \u003Cbr>● Strong scaling laws: RPT exhibits clean power-law scaling with respect to training compute across difficulty levels, with prediction accuracy consistently improving as compute increases, and fitting closely to theoretical curves. \u003Cbr>● Improves downstream RL and generalization: Fine-tuning RPT models with reinforcement learning on tasks with verifiable answers (e.g., Skywork-OR1) shows faster and stronger gains compared to models trained with standard objectives. Zero-shot evaluation on SuperGPQA and MMLU-Pro benchmarks reveals that RPT-14B in reasoning mode surpasses R1-Distill-Qwen-32B by a significant margin. \u003Cbr>● Promotes structured thinking: Analysis of reasoning traces reveals that RPT-14B employs more hypothesis generation, deduction, and reflective patterns compared to traditional problem-solving models, supporting the claim that RPT fosters deeper reasoning habits during training. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08007), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1932522665182703664) |\n| 4) TableRAG  TableRAG tackles a core limitation of existing RAG approaches: their inability to reason effectively over heterogeneous documents that combine both unstructured text and structured tables. Typical RAG pipelines flatten tables and intermix them with surrounding text, losing essential structural information and hampering multi-hop reasoning. TableRAG overcomes this by introducing a hybrid system that integrates SQL-based symbolic execution with text retrieval in a unified, iterative reasoning framework.  \u003Cbr>● TableRAG operates in four iterative stages: (1) context-sensitive query decomposition, (2) text retrieval, (3) SQL programming and execution, and (4) intermediate answer generation. This design allows it to preserve tabular structure and leverage both symbolic and neural reasoning paths. \u003Cbr>● A new benchmark, HeteQA, was developed to evaluate heterogeneous reasoning across 304 multi-hop QA examples covering nine domains and five types of tabular operations (e.g., aggregation, filtering, grouping). \u003Cbr>● Experiments on HeteQA and existing benchmarks (HybridQA, WikiTableQuestions) show that TableRAG consistently outperforms prior methods like NaiveRAG, ReAct, and TableGPT2, achieving >10% gains in accuracy over the strongest baseline. \u003Cbr>● Ablations reveal that all major components of TableRAG contribute significantly. Notably, SQL execution is critical for nested reasoning tasks, while textual retrieval is crucial for entity and numeric references. \u003Cbr>● TableRAG achieves greater reasoning efficiency, solving over 90% of HeteQA tasks in five or fewer steps and exhibiting the lowest failure rate among evaluated methods. Its robustness holds across multiple LLM backbones (Claude, DeepSeek, Qwen). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10380), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933520740147736634) |\n| 5) Self-Adapting Language Models  It proposes a novel framework that enables LLMs to adapt themselves through reinforcement learning by generating their own fine-tuning data and update directives, referred to as “self-edits.” This approach allows models to autonomously optimize their learning process, without relying on separate adaptation modules or human supervision.  Key highlights:  \u003Cbr>● Self-Edits as Adaptation Mechanism: Instead of updating weights directly with raw data, SEAL uses the model to generate self-edits, natural language instructions that might include restated facts, optimization parameters, or tool invocations. These are then used for supervised fine-tuning. The generation of self-edits is optimized via RL using downstream task performance as the reward. \u003Cbr>● Two Domains Evaluated: \u003Cbr>● Learning Framework: SEAL employs an outer RL loop (optimizing the self-edit generation policy) and an inner loop (applying the edits via supervised fine-tuning). The RL objective is approximated via filtered behavior cloning (ReSTEM), reinforcing only edits that lead to performance gains. \u003Cbr>● Limitations and Forward View: While SEAL achieves compelling gains, it remains susceptible to catastrophic forgetting during sequential updates and incurs high computational cost due to repeated fine-tuning. Future directions include combining SEAL with continual learning strategies and extending it toward autonomous agents that update weights as part of long-horizon interactions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10943), [Tweet](https:\u002F\u002Fx.com\u002Fjyo_pari\u002Fstatus\u002F1933350025284702697) |\n| 6) ComfyUI-R1  Introduces ComfyUI-R1, a 7B-parameter large reasoning model fine-tuned for automatic workflow generation in the ComfyUI ecosystem. Built on Qwen2.5-Coder and trained through a two-stage pipeline (supervised chain-of-thought reasoning followed by reinforcement learning), ComfyUI-R1 significantly outperforms prior state-of-the-art approaches that rely on commercial models like GPT-4o and Claude 3.5.  Key highlights:  \u003Cbr>● Curated Workflow & Node KBs: The team collected and cleaned 27K community workflows down to 3.9K high-quality entries, each with JSON and code representations. They also assembled a node documentation KB using 7,238 nodes, some enhanced via Claude 3.5. \u003Cbr>● CoT + RL Training: The model first undergoes supervised fine-tuning on simulated long CoT reasoning sequences involving node selection, planning, and code generation. Then, reinforcement learning with a fine-grained rule-metric hybrid reward encourages format validity, graph correctness, and node accuracy. \u003Cbr>● Superior Performance: ComfyUI-R1 achieves 97% format validity and the highest node-level and graph-level F1 scores across benchmarks, beating all GPT-4o and Claude baselines. On the ComfyBench benchmark, it improves the execution pass rate to 67%, a full 11% above the previous best (ComfyAgent with GPT-4o). \u003Cbr>● Qualitative Improvements: Case studies show ComfyUI-R1 creates more faithful, complex, and executable workflows than prior multi-agent approaches. Notably, it better aligns image generation outputs with user instructions in terms of style, structure, and coherence. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09790), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933175492716224876) |\n| 7) Magistral  Mistral introduces Magistral, its first reasoning-focused LLM line, alongside a custom RL training stack that enables pure reinforcement learning from scratch. In contrast to prior approaches that rely on distillation from teacher models, Magistral trains directly using online RL with text-only data and custom reward shaping. The work yields two open models: Magistral Medium (based on Mistral Medium 3) and the open-sourced Magistral Small (24B), which is bootstrapped via SFT on Medium’s outputs, followed by RL.  Key insights:  \u003Cbr>● Pure RL can rival distillation: Magistral Medium was trained without any reasoning traces and achieved a 50% boost in AIME-24 (pass@1) over the base model. Across reasoning benchmarks like LiveCodeBench and GPQA, it performs on par or better than DeepSeek-R1 despite lacking a distillation phase. \u003Cbr>● Multilingual and multimodal generalization emerge: Reinforcement learning on textual math\u002Fcode data surprisingly enhances multimodal reasoning and preserves instruction-following and tool-calling capabilities. Multilingual reasoning is achieved via simple reward shaping, enforcing the same-language output and chain-of-thought. \u003Cbr>● Efficient async infrastructure: An asynchronous RL setup with NCCL weight broadcasting enables generators to roll out sequences continuously, receiving weight updates without blocking. This helps maintain on-policyness while maximizing GPU utilization. \u003Cbr>● Reward shaping and CoT format enforcement: Rewards are conditioned on correct formatting (\u003Cthinktags, boxed answers, markdown blocks), correctness (via SymPy and test cases), length penalties, and language consistency (using fastText). Failure to meet formatting results in an immediate zero reward. \u003Cbr>● Training heuristics and ablations: Analysis shows that reward and performance scale logarithmically with output length. Longer completions correlate with higher reward. RL moves weights in a low-dimensional space, as shown by PCA of checkpoints. Ablations on batch size, advantage normalization, and entropy (via ε ᵢg ) highlight stability tradeoffs. \u003Cbr>● Open-source release: Magistral Small (24B) is released under Apache 2.0 and performs competitively under three setups, SFT-only, RL-only, and SFT+RL, with the final combo yielding the best results across math and code tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10910), [Tweet](https:\u002F\u002Fx.com\u002Fnrehiew_\u002Fstatus\u002F1932872389798605060) |\n| 8) Code Researcher  Code Researcher is a deep research agent designed to resolve complex bugs in large systems’ code by leveraging multi-step reasoning over semantics, patterns, and commit history. It significantly outperforms existing coding agents like SWE-agent on benchmarks like kBenchSyz, achieving a 58% crash-resolution rate by effectively gathering and filtering global context before generating and validating patches. | [Paper](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fwp-content\u002Fuploads\u002F2025\u002F06\u002FCode_Researcher-1.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fadityakanade0\u002Fstatus\u002F1932044846124151199) |\n| 9) LLamaRL  LlamaRL is a fully-distributed, asynchronous reinforcement learning framework designed for efficient large-scale LLM training (8B to 405B+ models). It achieves up to 10.7× speedup over  DeepSpeed-Chat by combining co-located model offloading, asynchronous off-policy training (AIPO), and fast GPU-native weight sync (DDMA), while maintaining model quality across tasks like math reasoning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24034), [Tweet](https:\u002F\u002Fx.com\u002Frobinphysics\u002Fstatus\u002F1931181903689719875) |\n| 10) Predicting a Cyclone’s Track with AI  Google DeepMind's Weather Lab debuts an interactive platform featuring an AI cyclone forecasting model that generates 50 ensemble scenarios up to 15 days ahead, outperforming traditional systems in track and intensity predictions. It integrates live and historical forecasts with baselines for evaluation and is now used by agencies like the U.S. National Hurricane Center to support early warnings. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-DeepMind.com\u002FBlog\u002Fhow-we-re-supporting-better-tropical-cyclone-prediction-with-ai\u002Fskillful-joint-probabilistic-weather-forecasting-from-marginals.pdf), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1933178918715953660) |\n\n## Top ML Papers of the Week (June 2 - June 8) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) The Illusion of Thinking  Investigates the capabilities and limitations of frontier Large Reasoning Models (LRMs) like Claude 3.7, DeepSeek-R1, and OpenAI’s o-series by systematically analyzing their performance on reasoning tasks as a function of problem complexity. Rather than relying on conventional math benchmarks, which suffer from contamination and lack structure, the authors evaluate LRMs using four controllable puzzles (Tower of Hanoi, Checker Jumping, River Crossing, and Blocks World) that allow fine-grained complexity scaling and transparent trace analysis.  Key findings:  \u003Cbr>● Three complexity regimes: The study identifies distinct performance phases. In low-complexity tasks, non-thinking LLMs outperform LRMs due to more efficient and direct computation. In medium complexity, reasoning models show an advantage, leveraging longer chain-of-thoughts to correct errors. However, in high complexity, all models, regardless of their reasoning scaffolds, collapse to near-zero accuracy. \u003Cbr>● Counterintuitive reasoning collapse: Surprisingly, LRMs reduce their reasoning effort (i.e., number of tokens used in thoughts) as problem complexity increases beyond a threshold. This suggests an internal scaling failure not caused by token limits but by intrinsic model behavior. \u003Cbr>● Reasoning trace inefficiencies: LRMs frequently “overthink” on simple problems, finding correct answers early but continuing to explore incorrect paths. For moderate tasks, they correct late, and for complex ones, they fail to find any valid solution. Position-based accuracy analysis of thoughts reveals systematic shifts in when correct solutions are generated within the trace. \u003Cbr>● Failure to execute explicit algorithms: Even when supplied with correct pseudocode (e.g., Tower of Hanoi recursion), models still failed at similar complexity points. This indicates that LRMs don’t just struggle to find solutions; they can’t reliably execute logical instructions either. \u003Cbr>● Inconsistent behavior across puzzles: Models could perform >100 correct steps in Tower of Hanoi (N=10) but fail after 4 steps in River Crossing (N=3), suggesting performance correlates more with training data familiarity than inherent problem complexity. | [Paper](https:\u002F\u002Fml-site.cdn-apple.com\u002Fpapers\u002Fthe-illusion-of-thinking.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1931333830985883888) |\n| 2) From Tokens to Thoughts  This paper introduces an information-theoretic framework to examine whether LLMs organize semantic knowledge like humans, balancing compression and meaning. Drawing from Rate-Distortion Theory and the Information Bottleneck principle, the authors evaluate token embeddings from 30+ LLMs against classic human categorization benchmarks from cognitive psychology.  \u003Cbr>● LLMs do form broad conceptual categories that align well with human groupings. Adjusted Mutual Information scores show LLM clusters consistently outperform random baselines, with even small encoder models like BERT matching or beating larger decoder-only models on this alignment task. \u003Cbr>● However, LLMs struggle with fine-grained semantics. When tested on their ability to mirror human notions of item typicality (e.g., robin as a more typical bird than penguin), correlations between LLM embedding similarity and human ratings were weak and inconsistent. Most models failed to capture graded prototype structures evident in human cognition. \u003Cbr>● Using their unified loss function L (balancing information complexity and semantic distortion), the authors find that LLMs produce statistically efficient clusters with lower entropy and distortion, while human conceptual clusters are less compact but preserve richer nuance. This suggests LLMs over-optimize for compression at the expense of meaning, unlike humans, who tolerate inefficiency to retain adaptive, flexible structure. \u003Cbr>● The paper concludes that while LLMs can mimic surface-level categorization, they diverge fundamentally in how they represent meaning, highlighting a core gap between artificial and human semantic systems and offering a quantitative tool for improving human-aligned conceptual representations. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17117) |\n| 3) Knowledge or Reasoning  Introduces a fine-grained evaluation framework to dissect LLM thinking into two components: knowledge correctness and reasoning informativeness, measured via Knowledge Index (KI) and Information Gain (InfoGain), respectively. The authors apply this framework to evaluate how reasoning transfers across domains, particularly medical and mathematical, using Qwen2.5-7B and its DeepSeek-R1-distilled variant trained via SFT and RL.  Key findings include:  \u003Cbr>● SFT improves knowledge but can harm reasoning: Supervised fine-tuning improves factual accuracy (e.g., 6.2% KI gain in medical tasks), but often leads to verbose or redundant reasoning that reduces InfoGain by 38.9% on average, compared to the base model. \u003Cbr>● RL boosts both reasoning and knowledge in medical settings: Reinforcement learning enhances reasoning clarity and prunes incorrect knowledge, leading to a 12.4-point average gain in KI. It improves inference by guiding models toward more factually sound reasoning paths. \u003Cbr>● Domain matters: While math tasks benefit more from reasoning (higher InfoGain), medical tasks rely heavily on domain knowledge (higher KI). In fact, KI shows a stronger correlation (0.998) with task accuracy than InfoGain (0.698) in medical benchmarks. \u003Cbr>● Base models outperform R1-distilled versions in medicine: Qwen-Base consistently outperforms DeepSeek-R1-distilled models across accuracy, InfoGain, and KI. The R1-distilled model struggles with medical adaptation, likely due to pretraining bias toward math\u002Fcode domains | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02126), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930640490786951365) |\n| 4) Open Thoughts  This paper presents OpenThoughts3, a systematic recipe for curating supervised fine-tuning (SFT) data that advances the performance of open-source reasoning models. The authors develop OpenThinker3-7B, a 7B parameter model trained on their new 1.2M example dataset (OpenThoughts3-1.2M) derived from over 1,000 controlled experiments. Despite using no reinforcement learning, OpenThinker3-7B outperforms all other open-data 7B and 8B models on standard math, code, and science reasoning benchmarks, even beating models trained with  larger-scale or mixed SFT+RL pipelines. Key insights and contributions:  \u003Cbr>● Best-in-class 7B open model: OpenThinker3-7B achieves state-of-the-art results on AIME25 (53.3%), LiveCodeBench (51.7%), and GPQA Diamond (53.7%), outperforming DeepSeek-R1-Distill-Qwen-7B by 15–20 percentage points across tasks. \u003Cbr>● Scaling laws with clean design: The authors ablate every step in the data pipeline, question sourcing, filtering, teacher choice, deduplication, and answer sampling, showing how each incrementally lifts performance. For instance, using multiple answers per question (16×) improved results more than simply increasing question diversity. \u003Cbr>● QwQ-32B as a better teacher than stronger models: Surprisingly, QwQ-32B yielded better student models than DeepSeek-R1 or Phi-4 despite lower benchmark scores, suggesting teacher choice affects trace quality more than raw performance. \u003Cbr>● Filtering matters more than verification: Question filtering based on response length and LLM-estimated difficulty was more predictive of downstream gains than traditional heuristics (e.g., fastText) or even filtering based on correctness verification, which had negligible effects. \u003Cbr>● Data quality over diversity: Mixing only the top 1–2 question sources per domain consistently outperformed using many sources, indicating that question quality is more important than dataset heterogeneity. \u003Cbr>● Open-source impact: The full datasets and models are released at [openthoughts.ai](http:\u002F\u002Fopenthoughts.ai\u002F), providing a reproducible benchmark for open reasoning research. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04178), [Tweet](https:\u002F\u002Fx.com\u002Flschmidt3\u002Fstatus\u002F1930717405812269273) |\n| 5) Coding Agents with Multimodal Browsing  Introduces OpenHands-Versa, a unified agent designed to perform strongly across diverse domains, coding, web browsing, and multimodal information access, by equipping a single agent with three general capabilities: code execution, multimodal web browsing, and file\u002Fsearch access. In contrast to specialist or multi-agent systems optimized for narrow domains, OpenHands-Versa aims to solve a wide variety of real-world tasks with minimal architectural complexity.  Key highlights:  \u003Cbr>● Unified Toolset, Superior Coverage: OpenHands-Versa integrates visual web browsing, search API access, and multimodal file processing into the OpenHands coding framework. Despite its simplicity, it surpasses specialized agents across three benchmarks: SWE-Bench Multimodal (+9.1%), GAIA (+1.3%), and The Agent Company (+9.1%) in success rates. \u003Cbr>● Benchmark Generalization: The agent matches or outperforms multi-agent systems like OWL-roleplaying and Magentic-One, which struggle to generalize across domains. For example, OWL-roleplaying, though strong on GAIA, performs poorly on The Agent Company due to limited tool generality. \u003Cbr>● Domain-Aware Tool Use: Analysis reveals that OpenHands-Versa effectively adapts its tool usage per benchmark (e.g., search APIs in GAIA, browser in The Agent Company, and visual validation in SWE-Bench M), unlike its predecessor, OpenHands, which misuses or lacks crucial tools like search. \u003Cbr>● Minimal Agent, Strong Results: By relying on a single-agent design and Claude-3.7 or Claude Sonnet-4 as backbone LLMs, OpenHands-Versa achieves SOTA results without per-task tool customization. For example, it attains 64.24% on GAIA val split, outperforming multi-agent baselines by up to +18%. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03011), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930277871999955166) |\n| 6) Self-Challenging Language Model Agents  Proposes a novel self-improvement method for multi-turn tool-use LLM agents, called the  Self-Challenging Agent (SCA). It trains LLMs entirely from tasks they generate themselves, avoiding the need for human-annotated tasks or evaluations. The framework introduces a new task format called Code-as-Task (CaT), ensuring generated tasks are feasible, verifiable, and challenging. SCA is shown to double performance in a self-improvement setting and significantly boost performance in distillation.  Key contributions and findings:  \u003Cbr>● Self-generated tasks via dual-agent roles: The agent alternates between a challenger role, where it explores the environment and creates tasks, and an executor role, where it learns to solve these tasks via reinforcement learning. The process is designed to emulate how human annotators interact with tools to design meaningful tasks. \u003Cbr>● Code-as-Task (CaT) formulation: Each synthetic task includes an instruction, a Python-based verification function, a working solution, and several failure cases. This structure ensures task quality by filtering out trivial, impossible, or non-verifiable tasks using automatic code execution checks. \u003Cbr>● Strong results in both distillation and self-improvement: SCA improves the Llama-3.1-8B-Instruct model’s success rate from 12.0% to 23.5% when learning from its own tasks. In the distillation setting (using a 70B teacher), SCA lifts performance to 32.2% Pass@1, outperforming the prior PAE baseline across all tool-use environments. \u003Cbr>● Human annotation and ablation confirm task quality: Tasks generated with CaT significantly reduce false positives and negatives compared to PAE. A detailed analysis shows CaT’s filtering removes flawed tasks while retaining diversity when used with stronger models like Llama-3.1-70B. \u003Cbr>● Scaling and training dynamics: More diverse tasks (not just more trajectories per task) yield better generalization, emphasizing the importance of broad synthetic coverage. Online RL methods like PPO and GRPO can further boost performance, but at higher tuning and compute cost. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01716), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930748591242424439) |\n| 7) AlphaOne  Introduces a universal framework, α1, for modulating the reasoning progress of large reasoning models (LRMs) during inference. Rather than relying on rigid or automatic schedules, α1 explicitly controls when and how models engage in “slow thinking” using a tunable parameter α. The method dynamically inserts “wait” tokens to encourage deeper reasoning and then deterministically ends slow thinking with a “\u003C\u002Fthink>” token to prompt efficient answer generation. This yields better accuracy and efficiency than previous test-time scaling approaches.  Key insights:  \u003Cbr>● Slow-then-fast reasoning outperforms other strategies: Contrary to human intuition (fast-then-slow), models benefit from beginning with slow reasoning before transitioning to faster inference. This “frontloaded effort” schedule leads to more accurate problem solving. \u003Cbr>● Dense modulation via α1 boosts accuracy and efficiency: By continuously adjusting reasoning pace via α-scheduled “wait” token insertions, α1 outperforms existing test-time strategies like s1 (monotonic increase) and CoD (monotonic decrease), achieving up to +6.15% accuracy gain while using up to 14% fewer tokens on some benchmarks. \u003Cbr>● Linear annealing is the most effective scheduling strategy: Among several tested functions for controlling “wait” insertion (constant, linear increase, exponential\u002Flinear anneal), linear anneal—gradually reducing “wait” token frequency, proved best across multiple models and datasets. \u003Cbr>● Post-α moment modulation is critical: Simply inserting “wait” tokens leads to inertia in slow thinking. α1 ensures efficient termination by replacing future “wait” tokens with “\u003C\u002Fthink>”, effectively forcing a shift to fast reasoning and boosting performance by up to +20% in some tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24863), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1929551555948400840) |\n| 8) Common Pile v0.1  The Common Pile v0.1 is an 8TB dataset of openly licensed text designed for LLM pretraining, addressing legal and ethical concerns of unlicensed data use. Two 7B parameter models trained on it, Comma v0.1-1T and 2T, achieve performance comparable to LLaMA 1 and 2, and the dataset, code, and model checkpoints are all publicly released. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05209), [Tweet](https:\u002F\u002Fx.com\u002FAiEleuther\u002Fstatus\u002F1931021637991755906) |\n| 9) RewardBench 2  RewardBench 2 is a new multi-skill benchmark for evaluating reward models with more challenging human prompts and stronger correlation to downstream performance. It highlights gaps in the current  reward model's effectiveness and aims to support more rigorous evaluation, showing existing models score ~20 points lower than their predecessor. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937), [Tweet](https:\u002F\u002Fx.com\u002Fsaumyamalik44\u002Fstatus\u002F1929654864604549348) |\n| 10) Memorization in LLMs  This study introduces a method to quantify how much a model memorizes versus generalizes, estimating GPT models have a capacity of ~3.6 bits per parameter. By training hundreds of models, the authors show that memorization saturates with data before generalization (“grokking”) kicks in, and derive new scaling laws linking capacity, data size, and membership inference. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.24832) |\n\n## Top ML Papers of the Week (May 26 - June 1) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) New Lens on RAG Systems  Introduces a new conceptual and empirical framework for analyzing RAG systems through the lens of sufficient context, whether the retrieved content alone enables answering a query. This notion helps decouple retrieval failures from generation errors in LLMs, providing clarity on model behavior under different contextual adequacy.  Key findings:  \u003Cbr>● New definition and classifier for sufficient context: The authors formalize “sufficient context” as context that plausibly allows answering a query, without requiring ground truth. They develop a high-accuracy LLM-based *autorater* (Gemini 1.5 Pro, 93% accuracy) to label instances as having sufficient or insufficient context, enabling large-scale evaluation without needing ground-truth answers. \u003Cbr>● Sufficient context ≠ guaranteed correctness: Even when sufficient context is present, state-of-the-art LLMs like GPT-4o, Claude 3.5, and Gemini 1.5 still hallucinate answers more often than they abstain. Conversely, models can sometimes answer correctly despite insufficient context, likely leveraging parametric memory. \u003Cbr>● Benchmarks contain substantial insufficient context: Analysis of datasets like HotPotQA, Musique, and FreshQA shows that a significant fraction of queries (e.g., >50% in Musique and HotPotQA) lack sufficient context, even with curated or oracle retrieval setups. \u003Cbr>● Selective generation improves factuality: The authors propose a “selective RAG” method that combines model self-confidence with the sufficient context autorater to decide whether to answer or abstain. This yields consistent 2–10% gains in correctness (of answered queries) across Gemini, GPT, and Gemma models. \u003Cbr>● Fine-tuning alone is insufficient: Attempts to fine-tune smaller models like Mistral 3 7B for better abstention (e.g., training them to say “I don’t know” on insufficient examples) modestly increased abstention but often reduced accuracy or failed to meaningfully curb hallucinations. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06037), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1927737131478188295) |\n| 2) Open-Ended Evolution of Self-Improving Agents  This work presents the Darwin Gödel Machine (DGM), a system that advances the vision of self-improving AI by combining self-referential code modification with open-ended evolutionary  search. Unlike the original Gödel machine, which requires provable benefits for code changes (a practically intractable constraint), the DGM adopts an empirical approach: it modifies its own codebase and evaluates improvements on coding benchmarks.  Key contributions and findings:  \u003Cbr>● Self-referential self-improvement loop: The DGM starts with a single coding agent that edits its own Python-based codebase to improve its ability to read, write, and execute code using frozen foundation models (FMs). Each modification is evaluated on benchmarks like SWE-bench and Polyglot, with only successful agents retained for further iterations. \u003Cbr>● Open-ended exploration via evolutionary archive: Inspired by Darwinian evolution, the system maintains an archive of all prior agents and samples parents based on performance and novelty. This enables exploration beyond local optima and supports continual innovation, including revisiting previously suboptimal variants that become valuable stepping stones later. \u003Cbr>● Empirical performance gains: Across 80 iterations, DGM boosts coding success on SWE-bench from 20.0% to 50.0% and on Polyglot from 14.2% to 30.7%, outperforming strong baselines that lack either self-improvement or open-endedness. Its best agents match or exceed leading human-designed, open-source coding agents. \u003Cbr>● Emergent tool and workflow improvements: Through self-improvement, DGM enhances its capabilities by evolving more granular editing tools, retry and evaluation mechanisms, history-aware patch generation, and code summarization for long contexts. \u003Cbr>● Generalization across models and tasks: Agents discovered by DGM generalize well when transferred across foundation models (e.g., Claude 3.5 to 3.7, o3-mini) and programming languages, demonstrating robust improvements not overfit to a particular setup. \u003Cbr>● Safety-conscious design: All experiments were sandboxed, monitored, and scoped to confined domains. The paper also discusses how future self-improvement systems could evolve safer, more interpretable behaviors if these traits are part of the evaluation criteria. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22954), [Tweet](https:\u002F\u002Fx.com\u002Fhardmaru\u002Fstatus\u002F1928284568756629756) |\n| 3) An Operating System for Memory-Augmented Generation in LLMs Introduces a unified operating system for managing memory LLMs, addressing a key limitation in  current architectures: their lack of structured, persistent, and governable memory. While today's LLMs rely primarily on parametric memory (model weights) and limited short-term context, MemOS proposes a comprehensive memory lifecycle and management infrastructure designed to support continual learning, behavioral consistency, and knowledge evolution.  Key contributions and components include:  \u003Cbr>● Three-tier memory taxonomy: MemOS distinguishes between parametric memory (long-term weights), activation memory (short-term runtime states), and plaintext memory (editable, external content). These types are unified through a shared abstraction called the Memory Cube (MemCube), enabling seamless transformation (e.g., plaintext to parametric) and lifecycle governance. \u003Cbr>● MemCube abstraction: Each MemCube encapsulates memory metadata (creation time, type, access policies, etc.) and a semantic payload (text, tensors, LoRA patches). This enables dynamic scheduling, traceable updates, and interoperability between modules and agents. \u003Cbr>● Modular OS-style architecture: MemOS consists of three layers—Interface (user\u002FAPI interaction), Operation (memory scheduling, lifecycle management), and Infrastructure (storage, access governance), that work together to manage memory parsing, injection, transformation, and archival. \u003Cbr>● Closed-loop execution flow: Every interaction (e.g., prompt response) can trigger memory operations governed by scheduling rules and lifecycle policies. Retrieved memory can be injected into generation, stored in archives, or transformed into other types for long-term use. \u003Cbr>● Vision for a memory-centric future: The paper proposes “memory training” as the next frontier beyond pretraining and finetuning, enabling models that learn continuously. Future work includes cross-model memory sharing, self-evolving memory blocks, and a decentralized memory marketplace. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22101), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1928116365640225222) |\n| 4) Building Production-Grade Conversational Agents with Workflow Graphs This paper presents a pragmatic, production-ready framework for building LLM-powered conversational agents using workflow graphs, with a specific focus on e-commerce scenarios. Instead of relying solely on end-to-end generation, the authors design agents using a directed acyclic graph (DAG), enabling flexible yet controllable interactions that adhere to strict business rules and format constraints.  Key contributions and findings include:  \u003Cbr>● Multi-State DAG Framework: Each node in the graph corresponds to a conversational state with its own system prompt, tool access, and execution rules. This structure enables robust constraint handling (e.g., avoiding hallucinated responses or non-compliant suggestions) by localizing logic and formatting within specific graph nodes. \u003Cbr>● Fine-Tuning via Response Masking: Because conversation turns come from different states in the DAG, the authors introduce a fine-tuning strategy that applies selective loss masking to train LLMs only on responses relevant to a specific node’s context. This prevents prompt conflicts and improves adherence to node-specific constraints. \u003Cbr>● Real-World Deployment and Results: In a deployment across KakaoTalk and web platforms, the graph-based approach significantly outperformed baseline agents and even GPT-4o across key metrics like task accuracy (+52%) and format adherence (+50%). In human preference tests, their internal model was favored over GPT-4o in 63% of real-world user cases, especially in product recommendation and safety-critical tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23006), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1928492639906607297) |\n| 5) Spurios Rewards  This work challenges prevailing assumptions about reinforcement learning with verifiable rewards (RLVR) in mathematical reasoning tasks. The authors show that Qwen2.5-Math models can improve significantly under RL, even when trained with spurious or flawed rewards.  \u003Cbr>● Surprisingly effective spurious rewards: The Qwen2.5-Math-7B model gains +21.4% accuracy with random rewards, +16.4% with format-based rewards, and +24.6% when explicitly trained on incorrect answers. These are close to the +28.8% gain from ground-truth reward signals, suggesting that RLVR surfaces latent capabilities rather than teaching new reasoning skills. \u003Cbr>● Model-specific generalization: Spurious rewards fail on other models like Llama3 or OLMo2. Only Qwen models consistently benefit, which the authors attribute to differences in pretraining. Notably, Qwen2.5-Math exhibits a unique “code reasoning” behavior, generating Python-like code to solve problems, which becomes more frequent post-RLVR and correlates strongly with accuracy. \u003Cbr>● Mechanism behind gains: The authors trace performance improvements to a shift in reasoning strategies. Most of the gain comes from language→code transitions, where the model switches from natural language to code reasoning during RLVR. Interventions that explicitly increase code usage (e.g., rewarding code-like responses or using a code-forcing prompt) boost performance further, but only on Qwen models. \u003Cbr>● Clipping bias enables learning from noise: Even with random rewards, performance improves due to GRPO’s clipping mechanism, which biases training toward reinforcing the model’s high-probability behaviors. These behaviors (e.g., code reasoning) happen to align with correctness in Qwen models but not in others. | [Paper](https:\u002F\u002Fgithub.com\u002Fruixin31\u002FRethink_RLVR\u002Fblob\u002Fmain\u002Fpaper\u002Frethink-rlvr.pdf), [Tweet](https:\u002F\u002Fx.com\u002FStellaLisy\u002Fstatus\u002F1927392717593526780) |\n| 6) Learn to Reason without External Rewards  Proposes a method for training LLMs via reinforcement learning without any external rewards or labeled data. Instead, it uses the model’s own self-certainty, a confidence measure based on KL divergence from uniform, as the sole intrinsic reward. This self-improvement strategy, part of the broader Reinforcement Learning from Internal Feedback (RLIF) paradigm, bypasses the limitations of Reinforcement Learning with Verifiable Rewards (RLVR), which requires domain-specific verifiers and gold-standard outputs.  Key highlights:  \u003Cbr>● INTUITOR matches GRPO without external supervision: When applied to mathematical reasoning tasks like GSM8K and MATH500, INTUITOR achieves performance on par with GRPO (a strong RLVR method), even without using gold solutions. On out-of-domain tasks such as LiveCodeBench and CRUXEval, INTUITOR generalizes better, achieving higher gains than GRPO (+65% vs. 0% and +76% vs. +44%, respectively). \u003Cbr>● Rapid early learning and enhanced instruction-following: INTUITOR significantly boosts early training performance, particularly on models like Qwen2.5-1.5B, and improves adherence to chat-style instructions, reducing repetitive or nonsensical output. \u003Cbr>● Emergent structured reasoning: Trained models display spontaneous reasoning even when not explicitly required, often generating explanations or planning steps before producing code or answers. This behavior correlates with better transfer performance to domains like code generation. \u003Cbr>● Self-certainty as a robust, hack-resistant signal: Unlike fixed reward models prone to exploitation, online self-certainty adapts with the model and avoids reward hacking. INTUITOR-trained models show the strongest correlation between self-certainty and correct answers, confirmed by statistical tests. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19590), [Tweet](https:\u002F\u002Fx.com\u002Fxuandongzhao\u002Fstatus\u002F1927270931874910259) |\n| 7) Learn to Reason via Mixture-of-Thought  While most prior approaches train with a single modality and only ensemble during inference, this work introduces Mixture-of-Thought (MoT) to jointly train and infer across modalities, resulting in notable gains in logical reasoning performance. Key findings:  \u003Cbr>● Three-modality synergy: MoT uses natural language for interpretability, code for structured procedural reasoning, and truth tables to explicitly enumerate logical cases. Error analysis shows that truth tables significantly reduce common LLM failure modes like missing branches or invalid converses. \u003Cbr>● Self-evolving training: MoT introduces an iterative, on-policy training loop where the model generates, filters, and learns from its own multi-modal reasoning traces. This joint training outperforms both single-modality and partial-modality setups. \u003Cbr>● Inference via voting: At test time, MoT generates predictions from each modality and selects the majority answer, leading to robust predictions. Results show up to +11.7pp average accuracy gains on FOLIO and ProofWriter, with 9B models matching GPT-4 + Logic-LM performance. \u003Cbr>● Stronger on harder tasks: MoT delivers the largest improvements on problems with higher reasoning depth (5–8 steps). It also shows superior test-time scaling, with more diverse and accurate outputs under fixed inference budgets. MoT demonstrates that LLMs can achieve significantly more robust logical reasoning by reasoning like humans (using multiple modes of thought), not just by sampling more from a single modality. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15817), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1925574200405721210) |\n| 8) QwenLong-L1  A new reinforcement learning framework that scales large reasoning models (LRMs) from short to long contexts using progressive context scaling and hybrid rewards. It achieves top performance on seven long-context benchmarks, surpassing models like OpenAI-o3-mini and Qwen3-235B-A22B, and matching Claude-3.7-Sonnet-Thinking, demonstrating strong reasoning with up to 120K token inputs. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.17667) |\n| 9) End-to-End Policy Optimization for GUI Agents  ARPO introduces an end-to-end reinforcement learning method for training GUI agents using Group Relative Policy Optimization (GRPO) with experience replay. It significantly improves in-domain performance on the OSWorld benchmark, outperforming baselines by up to 6.7%, while offering modest gains on out-of-domain tasks and enabling self-corrective behaviors through structured reward feedback. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16282), [Tweet](https:\u002F\u002Fx.com\u002FTsingYoga\u002Fstatus\u002F1926646893175615943) |\n| 10) Generalist Agent Enabling Scalable Agentic Reasoning  Proposes Alita, a generalist agent framework that enables scalable agentic reasoning through minimal predefinition and maximal self-evolution. Unlike traditional agents reliant on handcrafted tools, Alita autonomously constructs reusable MCPs (Model Context Protocols) using web search and code  synthesis, outperforming more complex systems like OpenAI DeepResearch and OctoTools on GAIA, MathVista, and PathVQA benchmarks. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.20286) |\n\n## Top ML Papers of the Week (May 19 - May 25) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Visual Planning  Proposes a novel reasoning paradigm that replaces language-based planning with image-based reasoning. The authors argue that language is not always the optimal medium for tasks involving spatial or physical reasoning. They introduce Visual Planning, where reasoning is executed as a sequence of visual states (images) without any text mediation, allowing models to “think” directly in images. This is realized through a reinforcement learning framework called VPRL (Visual Planning via Reinforcement Learning), which trains a vision-only model (LVM-3B) to plan using images.  Key contributions and findings:  \u003Cbr>● Visual-only reasoning paradigm: The authors formally define planning as autoregressive visual state generation, trained using image-only data. Unlike multimodal LLMs that map vision to language and reason textually, this approach performs inference entirely in the visual modality, sidestepping the modality gap. \u003Cbr>● VPRL framework: A two-stage training process is introduced. Stage 1 uses supervised learning on randomly sampled trajectories to ensure format consistency and promote exploration. Stage 2 applies GRPO (Group Relative Policy Optimization) to refine planning behavior via progress-based rewards, avoiding invalid or regressive moves. \u003Cbr>● Superior performance: On three visual navigation tasks (FronzeLake, Maze, and MiniBehavior), VPRL outperforms language-based models (e.g., Gemini 2.5 Pro, Qwen 2.5-VL) by over 40% in Exact Match scores. It also generalizes better to out-of-distribution tasks (larger grid sizes), with visual planners degrading more gracefully than textual ones. \u003Cbr>● Visual planning yields robustness and interpretability: Unlike textual outputs, visual plans enable step-by-step inspection and show stronger adherence to physical constraints. Qualitative examples illustrate how VPRL can avoid invalid moves and recover from non-optimal paths, while language models often hallucinate or misinterpret spatial layouts. \u003Cbr>● Exploration and invalid action reduction: The random policy initialization in Stage 1 enables better exploration than supervised baselines (VPFT), as evidenced by higher entropy and fewer invalid actions. This leads to a more effective RL stage and ultimately stronger planning capabilities. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11409), [Tweet](https:\u002F\u002Fx.com\u002F_yixu\u002Fstatus\u002F1924497238908375072) |\n| 2) EfficientLLM  Introduces the first large-scale, empirical benchmark for evaluating efficiency trade-offs in LLMs across architecture, fine-tuning, and inference. Conducted on a high-performance cluster (48×GH200, 8×H200 GPUs), the study evaluates over 100 model–technique pairs spanning 0.5B–72B parameters, using six metrics: memory utilization, compute utilization, latency, throughput, energy consumption, and compression rate.  Key insights include:  \u003Cbr>● No one-size-fits-all solution: Every efficiency technique improves some metrics while degrading others. For instance, MoE boosts accuracy and reduces FLOPs but increases VRAM usage by ~40%, while int4 quantization reduces memory and energy by up to 3.9× at a small 3–5% performance cost. \u003Cbr>● Resource-specific optima: Efficiency depends on context. MQA achieves the best memory-latency trade-off for constrained devices; MLA has the lowest perplexity for high-quality generation; RSLoRA is more efficient than LoRA only for models above 14B parameters. \u003Cbr>● Cross-modal transferability: Efficiency techniques like MQA and PEFT generalize well to vision and vision-language models, improving FID scores and maintaining strong trade-offs. \u003Cbr>● Training and tuning: LoRA and DoRA perform best for small models (1–3B), while RSLoRA excels at large scale (≥14B). Parameter freezing achieves the lowest latency but at a slight cost to accuracy. \u003Cbr>● Inference: int4 post-training quantization yields the highest compression and throughput gains with minor quality degradation, while bfloat16 consistently outperforms float16 in latency and energy on modern GPUs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13840), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1925191664475222186) |\n| 3) J1  Introduces a novel training approach for LLMs to act as evaluators (LLM-as-a-Judge) by explicitly incentivizing thoughtful reasoning during judgment. Instead of relying solely on prompting or preference fine-tuning, J1 employs online reinforcement learning with verifiable rewards to teach models to think through evaluations systematically.  Key insights:  \u003Cbr>● Verifiable framing for judgment: J1 converts both verifiable (e.g., math) and non-verifiable (e.g., user queries) prompts into tasks with verifiable rewards by generating synthetic preference pairs. This reframing enables the use of reinforcement learning and consistent training signals across diverse tasks. \u003Cbr>● Chain-of-thought-driven RL optimization: J1 trains models to reason through evaluations via explicit thought traces, including outlining evaluation criteria, reference answer generation, and self-comparison before producing judgments. Two model types are trained: Pairwise-J1 (outputs verdicts) and Pointwise-J1 (outputs quality scores). Pairwise-J1 models are further improved by consistency rewards to reduce positional bias. \u003Cbr>● Superior performance at scale: J1-Llama-8B and J1-Llama-70B outperform existing 8B and 70B LLM judges across five benchmarks (PPE, RewardBench, RM-Bench, JudgeBench, FollowBenchEval), beating models trained with much more data like DeepSeek-GRM and distillations of DeepSeek-R1. J1-70B even surpasses o1-mini and closes the gap with the much larger R1 model, particularly on non-verifiable tasks. \u003Cbr>● Pointwise-J1 mitigates positional bias: While pairwise judges can flip verdicts based on response order, Pointwise-J1 (trained only from pairwise supervision) offers position-consistent scoring with fewer ties and better consistency. Both judge types benefit from test-time scaling via self-consistency, further improving reliability. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10320), [Tweet](https:\u002F\u002Fx.com\u002Fjaseweston\u002Fstatus\u002F1923186392420450545) |\n| 4) The Pitfalls of Reasoning for Instruction- Following in LLMs  Explores an unexpected flaw in reasoning-augmented large language models (RLLMs): while chain-of-thought (CoT) prompting often boosts performance on complex reasoning tasks, it can  degrade instruction-following accuracy. The authors evaluate 15 models (e.g., GPT, Claude, LLaMA, DeepSeek) on two instruction-following benchmarks and find that CoT prompting consistently reduces performance across nearly all models and datasets.  Key findings:  \u003Cbr>● Reasoning hurts instruction adherence: On IFEval, 13 of 14 models saw accuracy drops with CoT; all 15 models regressed on ComplexBench. For example, Meta-LLaMA3-8B’s IFEval accuracy dropped from 75.2% to 59.0% with CoT. Even reasoning-tuned models like Claude3.7-Sonnet-Think performed slightly worse than their base counterparts. \u003Cbr>● Why reasoning fails: Manual case studies show CoT can help with structural formatting (e.g., JSON or Markdown) and precise lexical constraints (like exact punctuation). But it often hurts by (a) neglecting simple constraints during high-level content planning and (b) inserting helpful but constraint-violating content (e.g., translations in language-restricted outputs). \u003Cbr>● Attention-based diagnosis: The authors introduce a constraint attention metric and find that CoT reduces the model's focus on instruction-relevant tokens, especially in the answer generation phase. This diminished constraint awareness correlates with performance drops. \u003Cbr>● Mitigation strategies: Four techniques are proposed to selectively apply reasoning: | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11423), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1924458157444579700) |\n| 5) Generalizable AI Predicts Immunotherapy Outcomes Across Cancers and Treatments  Introduces COMPASS, a concept bottleneck-based foundation model that predicts patient response to immune checkpoint inhibitors (ICIs) using tumor transcriptomic data. Unlike prior biomarkers (TMB, PD-L1, or fixed gene signatures), COMPASS generalizes across cancer types, ICI regimens, and clinical contexts with strong interpretability and performance.  Key contributions:  \u003Cbr>● Concept Bottleneck Architecture: COMPASS transforms transcriptomic data into 44 high-level immune-related concepts (e.g., T cell exhaustion, IFN-γ signaling, macrophage activity) derived from 132 curated gene sets. This structure provides mechanistic interpretability while enabling pan-cancer modeling. \u003Cbr>● Pan-Cancer Pretraining and Flexible Fine-Tuning: Trained on 10,184 tumors across 33 cancer types using contrastive learning, and evaluated on 16 ICI-treated clinical cohorts (7 cancers, 6 ICI drugs). COMPASS supports full, partial, linear, and zero-shot fine-tuning modes, making it robust in both data-rich and data-poor settings. \u003Cbr>● Superior Generalization and Accuracy: In leave-one-cohort-out testing, COMPASS improved precision by 8.5%, AUPRC by 15.7%, and MCC by 12.3% over 22 baseline methods. It also outperformed in zero-shot settings, across drug classes (e.g., predicting anti-CTLA4 outcomes after training on anti-PD1), and in small-cohort fine-tuning. \u003Cbr>● Mechanistic Insight into Resistance: Personalized response maps reveal actionable biological mechanisms. For instance, inflamed non-responders show resistance via TGF-β signaling, vascular exclusion, CD4+ T cell dysfunction, or B cell deficiency. These go beyond classical “inflamed\u002Fdesert\u002Fexcluded” phenotypes, offering nuanced patient stratification. \u003Cbr>● Clinical Utility and Survival Stratification: COMPASS-predicted responders had significantly better survival in a held-out phase II bladder cancer trial (HR = 4.7, *p* = 1.7e-7), outperforming standard biomarkers (TMB, PD-L1 IHC, immune phenotype). | [Paper](https:\u002F\u002Fwww.medrxiv.org\u002Fcontent\u002F10.1101\u002F2025.05.01.25326820v1) |\n| 6) Towards a Deeper Understanding of Reasoning in LLMs  This paper investigates whether LLMs can adapt and reason in dynamic environments, moving beyond static benchmarks. Using the SmartPlay benchmark—a suite of four interactive games that require diverse cognitive skills—the authors evaluate three prompting strategies: self-reflection, heuristic mutation (via an Oracle), and planning. They test these methods across models of varying size (Llama3-8B to Llama3.3-70B) and draw several conclusions on how model scale and prompting interact with task complexity.  Key findings:  \u003Cbr>● Model size dominates performance, especially on reactive and structured reasoning tasks. Larger models (e.g., Llama3.3-70B) significantly outperform smaller ones on tasks like Tower of Hanoi and Bandit, where fast exploitation or spatial planning is critical. \u003Cbr>● Advanced prompting helps smaller models more, particularly on complex tasks. For example, Llama3-8B with Reflection+Oracle surpasses Llama3.3-70B’s baseline on Rock-Paper-Scissors. However, these strategies introduce high variance and can lead to worse-than-baseline performance depending on the run. \u003Cbr>● Long prompts hurt smaller models on simple tasks. In Bandit, adding reflective reasoning decreases performance by distracting the model or prolonging exploration. This aligns with prior findings on prompt length and signal-to-noise ratio. \u003Cbr>● Prompting strategy gains depend on task type. Instruction following improves across all models, while long-text understanding benefits mid-sized models. In contrast, strategies show weak or negative impact on planning, reasoning, and spatial challenges for large models. \u003Cbr>● Dense reward shaping improves performance more reliably than prompting. In follow-up experiments, modifying sparse reward signals (especially in Hanoi and Messenger) led to more consistent gains than tweaking prompt strategies. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10543), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1924182825693061403) |\n| 7) AdaptThink  This paper introduces AdaptThink, an RL framework designed to help reasoning models decide when to use detailed chain-of-thought reasoning (“Thinking”) versus directly producing an answer (“NoThinking”), based on task difficulty. This approach challenges the prevailing assumption that deep reasoning should be applied uniformly across all problems, showing that skipping the “thinking” step often yields better efficiency and even higher accuracy on simpler tasks.  Key insights:  \u003Cbr>● NoThinking outperforms Thinking on simple problems: The authors demonstrate that models like DeepSeek-R1 perform better (in both accuracy and efficiency) when using NoThinking mode, an empty \u003Cthink>\u003C\u002Fthink>token prompt, for easy problems. For example, on Level 1 MATH500 problems, NoThinking achieved slightly better accuracy with significantly fewer tokens used. \u003Cbr>● AdaptThink learns to switch modes: The proposed RL algorithm introduces a constrained optimization that promotes NoThinking as long as accuracy doesn’t degrade. It uses a novel importance sampling strategy to enable cold-start learning of both modes from the beginning, avoiding the collapse into all-Thinking behavior. \u003Cbr>● Massive gains in efficiency and performance: On GSM8K, MATH500, and AIME 2024, AdaptThink reduced response length by up to 53% and improved accuracy by up to 2.4% over DeepSeek-R1-Distill-Qwen-1.5B. It also outperformed prior methods (e.g., DPOShortest, TLMRE, ModelMerging) in the trade-off between accuracy and response length. \u003Cbr>● Robustness and generalization: AdaptThink generalizes to out-of-distribution tasks such as MMLU, maintaining or improving accuracy while reducing token usage. It also avoids \"implicit thinking\" in NoThinking responses, showing controlled behavior during inference. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13417) |\n| 8) MedBrowseComp  MedBrowseComp is a new benchmark designed to evaluate LLM agents’ ability to perform complex, multi-hop medical fact-finding by browsing real-world, domain-specific web resources. Testing over 1,000 clinically grounded questions, the benchmark reveals major capability gaps in current models, with top systems achieving only 50% accuracy and GUI-based agents performing even worse. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14963), [Tweet](https:\u002F\u002Fx.com\u002Fshan23chen\u002Fstatus\u002F1925549357308236029) |\n| 9) ARC-AGI-2  ARC-AGI-2 is a new benchmark designed to push the boundaries of AI reasoning beyond the original ARC-AGI. It introduces harder, more unique tasks emphasizing compositional generalization and human-like fluid intelligence, with baseline AI models performing below 5% accuracy despite strong ARC-AGI-1 results. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11831), [Tweet](https:\u002F\u002Fx.com\u002Farcprize\u002Fstatus\u002F1924869061542085041) |\n| 10) Teaching MLLMs to Think with Images  GRIT is a new method that enables MLLMs to perform grounded visual reasoning by interleaving natural language with bounding box references. Using a reinforcement learning approach (GRPO-GR), GRIT achieves strong reasoning and grounding performance with as few as 20 image-question-answer triplets, outperforming baselines in both accuracy and visual coherence. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15879), [Tweet](https:\u002F\u002Fx.com\u002FYFan_UCSC\u002Fstatus\u002F1925719736043569188) |\n\n## Top ML Papers of the Week (May 12 - May 18) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) AlphaEvolve  AlphaEvolve is a coding agent developed by Google DeepMind that uses LLM-guided evolution to discover new algorithms and optimize computational systems. It orchestrates a pipeline where LLMs generate code changes, evaluators provide feedback, and an evolutionary loop iteratively improves solutions. AlphaEvolve shows that LLMs can go beyond conventional code generation and assist in scientific and algorithmic discovery.  Key highlights:   \u003Cbr>● Novel Algorithm Discovery: AlphaEvolve discovered a new  algorithm to multiply 4×4 complex-valued matrices using 48  multiplications, the first improvement over Strassen’s 1969 result (49  multiplications) in this setting.   \u003Cbr>● Broad Mathematical Impact: Applied to 50+ open problems  in mathematics, AlphaEvolve matched or exceeded state-of-the-art in  ~95% of cases. For example, it improved bounds on Erdős’s minimum  overlap problem and kissing numbers in 11 dimensions.   \u003Cbr>● Infrastructure Optimization at Google: AlphaEvolve  improved key components of Google’s compute stack:   \u003Cbr>● Advanced Pipeline Design: AlphaEvolve uses ensembles of  Gemini 2.0 Flash and Pro models. It supports rich prompts (past  trials, evaluations, explicit context), multi-objective optimization,  and evaluation cascades for robust idea filtering. Programs are  evolved at full-file scale rather than function-level only, a key  differentiator from predecessors like FunSearch.   \u003Cbr>● Ablations Confirm Component Importance: Experiments  show that evolution, prompt context, full-file evolution, and using  strong LLMs all contribute significantly to performance. Removing any  one of these reduces effectiveness. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-DeepMind.com\u002FBlog\u002Falphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms\u002FAlphaEvolve.pdf), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1922669321559347498) |\n| 2) LLMs Get Lost in Multi-Turn Conversation  Investigates how top LLMs degrade in performance during underspecified, multi-turn interactions, common in real-world usage but rarely evaluated. The authors introduce a novel \"sharded simulation\" framework that breaks down fully-specified instructions into gradual conversation shards, simulating how users naturally provide information over time.  Key findings:   \u003Cbr>● Massive performance drop: Across 15 top LLMs (e.g.,  GPT-4.1, Gemini 2.5 Pro, Claude 3.7), average performance dropped 39%  in multi-turn vs. single-turn settings. Even a two-turn interaction  was enough to cause a significant decline.   \u003Cbr>● High unreliability, not just low aptitude:  Decomposition shows only a small drop in   best-case capability (aptitude) but a 112% increase in unreliability,  meaning models are wildly inconsistent depending on how the  conversation unfolds.   \u003Cbr>● Root causes of failure: Through log analysis and  experiments, the paper identifies four major issues:   \u003Cbr>● Sharded evaluation tasks: The authors built 600+  multi-turn simulations across 6 tasks (coding, math, SQL, API calls,  summarization, and table captioning), showing consistent degradation  across domains.   \u003Cbr>● Agent-style interventions only partially help:  Techniques like recap and snowballing (repeating all prior turns)  improved outcomes by ~15–20% but did not restore single-turn levels,  suggesting that model internals, not prompting strategies, are the  bottleneck.   \u003Cbr>● Temperature and test-time compute don't  solve the issue: Even at temperature 0.0 or with reasoning  models (like o3 and DeepSeek-R1), models remained highly unreliable in  multi-turn settings. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06120), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922755721428598988) |\n| 3) RL for Reasoning in LLMs with One Training Example  This paper shows that Reinforcement Learning with Verifiable Rewards (RLVR) can significantly improve mathematical reasoning in LLMs even when trained with just a single example. On the Qwen2.5-Math-1.5B model, one-shot RLVR improves accuracy on the MATH500 benchmark from 36.0% to 73.6%, nearly matching performance achieved with over 1,200 examples. Two-shot RLVR (with two examples) even slightly surpasses that, matching results from full 7.5k example training.   \u003Cbr>● Extreme data efficiency: A single training example (π₁₃)  boosts MATH500 accuracy to 73.6% and average performance across six  math benchmarks to 35.7%, rivaling full-dataset RLVR. Two-shot RLVR  goes further (74.8% and 36.6%).   \u003Cbr>● Broad applicability: 1-shot RLVR works not only on  Qwen2.5-Math-1.5B, but also on Qwen2.5-Math-7B, Llama3.2-3B-Instruct,  and DeepSeek-R1-Distill-Qwen-1.5B. It remains effective across GRPO  and PPO RL algorithms.   \u003Cbr>● Post-saturation generalization: Despite training accuracy  saturating early (within 100 steps), test accuracy continues improving  well beyond, reaching gains of +10% after 2,000 steps. The model  eventually overfits the single example (mixing gibberish into  outputs), yet test performance remains stable.   \u003Cbr>● Cross-domain and reflection behavior: A single  example from one domain (e.g., geometry) improves performance across  others (e.g., number theory). Additionally, models trained with 1-shot  RLVR exhibit increased self-reflection (e.g., “rethink”,  “recalculate”) and longer output sequences.   \u003Cbr>● Loss function insights: Ablation studies confirm that  policy gradient loss is the primary driver of improvements, not weight  decay, distinguishing 1-shot RLVR from \"grokking\". Entropy loss  further enhances performance and generalization; even without reward  signals, entropy-only training can still yield a 27% performance  boost. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571), [Tweet](https:\u002F\u002Fx.com\u002Fypwang61\u002Fstatus\u002F1917596101953348000) |\n| 4) AM-Thinking-v1  Introduces a dense, open-source 32B language model that achieves state-of-the-art performance in reasoning tasks, rivaling significantly larger Mixture-of-Experts (MoE) models. Built upon  Qwen2.5-32B, the model is trained entirely with public data and showcases how a meticulously crafted post-training pipeline can unlock competitive performance at mid-scale sizes.  Key points:   \u003Cbr>● Benchmark performance: AM-Thinking-v1 scores 85.3 on AIME  2024, 74.4 on AIME 2025, and 70.3 on LiveCodeBench, outperforming  DeepSeek-R1 (671B MoE) and matching or exceeding Qwen3-32B and  Seed1.5-Thinking. On Arena-Hard (general chat), it hits 92.5, near the  level of OpenAI o1 and o3-mini but behind Qwen3-235B-A22B and Gemini  2.5 Pro.   \u003Cbr>● Training pipeline: The model uses a two-stage post-training  approach combining Supervised Fine-Tuning (SFT) and Reinforcement  Learning (RL). SFT emphasizes a “think-then-answer” format and uses  2.84M samples, while RL incorporates difficulty-aware sampling and a   two-stage curriculum optimized via Group Relative Policy Optimization  (GRPO).   \u003Cbr>● Data and filtering: All training data is publicly  sourced and heavily filtered. Math data goes through LLM-assisted  cleaning and cross-model ground-truth validation. Responses are  filtered using perplexity, n-gram repetition, and structural checks to  ensure coherence and correctness.   \u003Cbr>● Inference and deployment: The authors implement a custom  rollout framework atop, decoupling rollout from inference via a  streaming load balancer. This reduces long-tail latency and increases  throughput across distributed GPU nodes, enabling scalable RL training  at 32k sequence length. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08311), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922668488826741061) |\n| 5) HealthBench  HealthBench is a benchmark of 5,000 multi-turn health conversations graded against 48,562 rubric criteria written by 262 physicians across 60 countries. Unlike prior multiple-choice evaluations, HealthBench supports open-ended, realistic assessments of LLM responses across diverse health themes (e.g., global health, emergency care, context-seeking) and behavioral axes (accuracy, completeness, communication, context awareness, instruction following).   \u003Cbr>● Significant frontier model gains: HealthBench  reveals rapid performance improvements, with GPT-3.5 Turbo scoring  16%, GPT-4o reaching 32%, and o3 achieving 60%. Notably, smaller  models like GPT-4.1 nano outperform GPT-4o while being 25x cheaper.   \u003Cbr>● Two challenging benchmark variants: HealthBench  Consensus focuses on 34   physician-validated criteria (e.g., recognizing emergencies), while  HealthBench Hard isolates 1,000 difficult examples on which no model  scores above 32%, establishing headroom for future progress.   \u003Cbr>● Physician comparison baseline: Surprisingly, LLMs like  o3 and GPT-4.1 often produce higher-quality responses than unassisted  physicians. When provided with model responses as references,  physicians improved older model completions but couldn’t improve  completions from newer models.   \u003Cbr>● Reliable model-based grading: Meta-evaluation shows  GPT-4.1 as a grader achieves macro F1 scores comparable to physicians.  On average, its agreement with other doctors places it in the  51st–88th percentile across themes like emergency triage,  communication, and uncertainty handling.   \u003Cbr>● Safety-relevant insights: The benchmark assesses worst-case  performance using \"worst-at-k\" scores, showing that even the best  models have reliability gaps. For example, o3’s worst-at-16 score  drops by a third from its average, underscoring the need for further  safety work. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002Fbd7a39d5-9e9f-47b3-903c-8b847ca650c7\u002Fhealthbench_paper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1921983050138718531) |\n| 6) Nemotron-Research-Tool-N1  Introduces Tool-N1, a family of tool-using LLMs trained using a rule-based reinforcement learning (R1-style RL) approach, without reliance on supervised reasoning trajectories. The key idea is to enable models to learn to invoke external tools correctly through binary feedback based on functional correctness and format adherence, rather than step-by-step imitation.   \u003Cbr>● Rule-based RL over SFT: Tool-N1 models are trained  using a lightweight binary reward that only evaluates whether the  model's tool calls are structurally correct and functionally valid.  This allows the model to develop its reasoning process, sidestepping  the limitations of mimicking distilled trajectories via supervised  fine-tuning (SFT).   \u003Cbr>● Strong benchmark results: Tool-N1-7B and Tool-N1-14B  outperform GPT-4o and domain-specialized models on several benchmarks,  including BFCL, API-Bank, and   ACEBench. For example, Tool-N1-14B beats GPT-4o on BFCL overall (85.97  vs 83.97) and achieves +5% over GPT-4o on API-Bank.   \u003Cbr>● Pure RL outperforms SFT-then-RL: A systematic  comparison on 5,518 distilled trajectories shows that pure RL yields  better results than the SFT-then-RL pipeline, challenging the dominant  paradigm. For instance, 100% RL achieves 83.24% average vs. 83.17% for  SFT+RL.   \u003Cbr>● Binary reward  fine-grained reward: Ablation  studies reveal that strict binary rewards (requiring correct reasoning  format and exact tool call) lead to better generalization than partial  credit schemes, especially on realistic “Live” data (80.38% vs  76.61%).   \u003Cbr>● Scaling and generalization: Performance scales well with  model size, with the most gains observed in larger models. The method  generalizes across backbones, with Qwen2.5-Instruct outperforming  LLaMA3 variants at the same scale. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00024), [Tweet](https:\u002F\u002Fx.com\u002FShaokunZhang1\u002Fstatus\u002F1922105694167433501) |\n| 7) RL for Search-Efficient LLMs  Proposes a new RL-based framework (SEM) that explicitly teaches LLMs when to invoke search and when to rely on internal knowledge, aiming to reduce redundant tool use while maintaining answer accuracy.  Key points:   \u003Cbr>● Motivation & Setup: LLMs often overuse external search  even for trivial queries. SEM addresses this by using a balanced  training dataset (Musique for unknowns, MMLU for   knowns) and a structured format (\u003Cthink, \u003Canswer, \u003Csearch,  \u003Cresult) to train the model to distinguish between situations where  search is necessary or not.   \u003Cbr>● Reward Optimization: The authors employ Group Relative  Policy Optimization (GRPO) to compare outputs within query groups. The  reward function penalizes unnecessary search and rewards correct  answers, either without search or with efficient search-and-reasoning  when needed.   \u003Cbr>● Experimental Results: On HotpotQA and MuSiQue, SEM  significantly outperforms Naive RAG and ReSearch, achieving higher EM  and LLM-Judged (LJ) accuracy with smarter search ratios. On MMLU and  GSM8K (where search is often unnecessary), SEM maintains high accuracy  while invoking search far less than baseline methods (e.g., 1.77% SR  vs 47.98% for Naive RAG on MMLU.   \u003Cbr>● Case Study & Efficiency: SEM avoids absurd search  behavior like querying “What is 1+1?” multiple times. It also uses  fewer but more targeted searches for unknowns, enhancing both  interpretability and computational efficiency. Training dynamics  further show that SEM enables faster and more stable learning than  prior methods. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07903), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922665313117552664) |\n| 8) Cost-Efficient, Low-Latency Vector Search  Integrates DiskANN (a vector indexing library) inside of Azure Cosmos DB NoSQL (an operational dataset) that uses a single vector index per partition stored in existing index trees. Benefit: It supports \u003C 20ms query latency over an index spanning 10 million vectors, has stable recall over updates, and offers nearly 15× and 41× lower query cost compared to Zilliz and Pinecone serverless enterprise products. It can further scale to billions of vectors with automatic partitioning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05885), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1921938925142384736) |\n| 9) AI Agents vs. Agentic AI  This review paper distinguishes AI Agents from Agentic AI, presenting a structured taxonomy and comparing their architectures, capabilities, and challenges. AI Agents are defined as modular, task-specific systems powered by LLMs and tools, while Agentic AI represents a shift toward  multi-agent collaboration, dynamic task decomposition, and orchestrated autonomy, with applications and challenges mapped out for both paradigms, along with proposed solutions like RAG, orchestration layers, and causal modeling. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10468), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1923817691455873420) |\n| 10) CellVerse  Introduces a benchmark to evaluate LLMs on single-cell biology tasks by converting multi-omics data into natural language. While generalist LLMs like DeepSeek and GPT-4 families show some reasoning ability, none significantly outperform random guessing on key tasks like drug response prediction, exposing major gaps in biological understanding by current LLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07865), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922662317986099522) |\n\n## Top ML Papers of the Week (May 5 - May 11) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) The Leaderboard Illusion  The Leaderboard Illusion investigates systemic distortions in how the Chatbot Arena leaderboard evaluates LLMs, arguing that current practices undermine fair model comparison and scientific  progress. Through extensive data analysis covering 2M Arena battles, the authors identify four key issues distorting rankings:   \u003Cbr>● Selective score reporting through private  testing: Some providers (notably Meta, Google, and OpenAI) are  allowed to test dozens of model variants privately and only publish  the   best-performing one. This violates the unbiased sampling assumption of  the Bradley-Terry (BT) model, which powers Arena rankings. Simulations  show that testing just 10 variants can artificially inflate a model’s  Arena score by ~100 points.   \u003Cbr>● Extreme data asymmetries: Proprietary models are  oversampled compared to open-weight and open-source models. OpenAI and  Google alone received over 39% of all Arena data, while 83 open-weight  models collectively received only 29.7%. These data advantages  translate into significant performance gains: a model trained on 70%  Arena data outperforms its baseline by 112% on the ArenaHard  benchmark.   \u003Cbr>● Unfair and opaque deprecations: 205 models were  silently removed from the leaderboard despite only 47 being officially  marked as deprecated. Open-source models are disproportionately  affected, breaking the comparison graph and violating BT model  assumptions, leading to unreliable rankings.   \u003Cbr>● Overfitting to Arena-specific dynamics: Due to  partial prompt repetition and distributional drift over time, access  to Arena data allows providers to tune models specifically for Arena  performance. This leads to high win rates on Arena benchmarks, but not  on out-of-distribution tasks like MMLU, where gains diminish or  reverse. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20879) |\n| 2) Llama-Nemotron  NVIDIA introduces the Llama-Nemotron model series, LN-Nano (8B), LN-Super (49B), and LN-Ultra (253B), a family of open, efficient, and high-performing reasoning models. These models rival or outperform DeepSeek-R1 on various benchmarks while offering significantly better inference throughput and memory efficiency. LN-Ultra is noted as the most \"intelligent\" open model by Artificial Analysis. A key innovation is a dynamic reasoning toggle (\"detailed thinking on\u002Foff\") that allows users to control reasoning behavior at inference time.  Highlights:   \u003Cbr>● Multi-stage training: Models were built via neural  architecture search (Puzzle), knowledge distillation, continued  pretraining, supervised fine-tuning (SFT), and large-scale RL.  LN-Ultra is enhanced with FP8 inference and FFN Fusion for speed and  scalability.   \u003Cbr>● Reasoning Toggle: The models can switch between reasoning  and non-reasoning modes via a simple prompt instruction, making them  adaptable for various use cases.   \u003Cbr>● Synthetic dataset: Over 33M examples across math, code,  science, and instruction-following were curated, with reasoning-mode  samples tagged explicitly. LN-Ultra's training used curriculum RL and  GRPO to surpass its teachers on benchmarks like GPQA-D.   \u003Cbr>● Evaluation dominance: LN-Ultra outperforms DeepSeek-R1 and  Llama-3.1-405B in reasoning tasks like AIME25, MATH500, and  GPQA-Diamond while also achieving strong chat alignment scores  (Arena-Hard: 87.0). LN-Super scores 88.3, beating Claude 3.5 and  GPT-4o.  NVIDIA provides the weights, training code (NeMo, Megatron-LM, NeMo-Aligner), and the full  post-training dataset under a permissive license, aiming to push open research in reasoning models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949v1), [Models](https:\u002F\u002Fhuggingface.co\u002Fnvidia) |\n| 3) Absolute Zero  Introduces an LLM training framework that eliminates the need for human-curated data. Key highlights:   \u003Cbr>● It learns to propose and solve its reasoning tasks entirely through  self-play, guided by verifiable feedback from an execution  environment. This zero-data RLVR (RL with Verifiable Rewards) setting  achieves SOTA coding and math reasoning performance.   \u003Cbr>● AZR learns by generating its code-based reasoning tasks using three  core reasoning modes (deduction, abduction, and induction), validating  solutions via Python execution, not human labels.   \u003Cbr>● A single LLM plays both roles, proposing new tasks based on  learnability and solving them with feedback-based reinforcement.  Rewards favor moderately difficult tasks to maximize the learning  signal.   \u003Cbr>● Despite using zero in-domain examples, AZR outperforms all previous  zero-setting models on average by +1.8 points and even beats models  trained on tens to hundreds of thousands of curated samples.  AZR-Coder-7B achieves the highest average score across all tested  models.   \u003Cbr>● AZR trained in a coding-only environment improves mathematical  reasoning performance by up to +15.2 points, far more than expert code  models trained with RLVR, showing strong generalization.   \u003Cbr>● Larger AZR models (3B → 7B → 14B) consistently show greater  improvements, confirming scalability and suggesting promise for even  larger models.   \u003Cbr>● AZR develops natural ReAct-like intermediate planning in code (e.g.,  interleaved comments and logic), trial-and-error strategies in  abduction, and systematic state tracking, behaviors typically observed  in much larger models.   \u003Cbr>● Llama-3.1-8B variants of AZR sometimes produce concerning reasoning  chains (dubbed “uh-oh moments”), highlighting the importance of  safety-aware training in autonomous systems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335), [Tweet](https:\u002F\u002Fx.com\u002FAndrewZ45732491\u002Fstatus\u002F1919920459748909288) |\n| 4) Discuss-RAG  This paper introduces Discuss-RAG, a plug-and-play agent-based framework that enhances retrieval-augmented generation (RAG) for medical question answering by mimicking human-like  clinical reasoning. Standard RAG systems rely on embedding-based retrieval and lack mechanisms to verify relevance or logical coherence, often leading to hallucinations or outdated answers.  Discuss-RAG addresses these gaps via a modular agent setup that simulates multi-turn medical discussions and performs post-retrieval verification.  Key ideas:   \u003Cbr>● Multi-agent collaboration: A summarizer agent orchestrates a  team of medical domain experts who iteratively refine a contextual  summary through simulated brainstorming, providing deeper and more  structured information to guide retrieval.   \u003Cbr>● Decision-making agent: After retrieval, a verifier and a  decision-making agent assess snippet quality and trigger fallback  strategies when relevance is low, improving answer accuracy and  contextual grounding.   \u003Cbr>● Plug-and-play design: Discuss-RAG is training-free and  modular, allowing easy integration into existing RAG pipelines.   \u003Cbr>● Strong performance gains: Across four benchmarks,  Discuss-RAG outperforms MedRAG with substantial accuracy improvements,  notably +16.67% on BioASQ and +12.20% on PubMedQA. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21252) |\n| 5) The Value of RL in Fine-Tuning  This work shows that, in theory, every popular preference-fine-tuning objective collapses to maximum-likelihood estimation (MLE), yet experiments show a consistent RL advantage on real tasks. They reconcile this gap with a generation-verification complexity hypothesis.   \u003Cbr>● Theory: RLHF ≈ MLE – Under mild assumptions,  trajectory-level RLHF, DPO, and related algorithms are equivalent to  projecting the data back to likelihood space, so expending compute on  on-policy sampling should be unnecessary.   \u003Cbr>● Empirics contradict naïve theory – On the tl;dr  summarization benchmark with   Pythia-1.4B\u002F2.8B, a single online-DPO iteration lifts win-rate by 6-10  pts over offline DPO despite identical data, model, and optimizer,  confirming that RL can add real value.   \u003Cbr>● Takeaways – RL helps when crafting a good answer is harder than  checking one. The gap vanishes on two-word summaries (horizon = 1) or  when ROUGE-L is used as the reward. RL acts as a shortcut through  policy space only when the reward model is simpler than the policy it  trains. For tasks where verification is as hard as generation, offline  likelihood-based   fine-tuning suffices, guiding practitioners on when RLHF is worth its  extra cost. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01067) |\n| 6) WebThinker  This paper introduces a reasoning agent framework that equips large reasoning models (LRMs) with autonomous web exploration and report writing abilities to overcome limitations of static internal knowledge.  WebThinker integrates a Deep Web Explorer module and an Autonomous Think-Search-and-Draft strategy that lets models search the web, reason through tasks, and generate comprehensive outputs simultaneously. It also incorporates an RL-based training loop using online DPO to improve tool usage. The system supports two modes: complex problem solving and scientific report generation. Key points:   \u003Cbr>● Superior performance in complex reasoning: On  GPQA, GAIA, WebWalkerQA, and HLE, WebThinker-32B-RL achieved new  state-of-the-art results among 32B models, outperforming both  retrieval-augmented and proprietary systems like GPT-4o and  DeepSeek-R1-671B. For example, it reached 70.7% on GPQA and 15.8% on  HLE, with gains of up to +21.5% over baselines.   \u003Cbr>● Best-in-class scientific report writing: On the  Glaive dataset, WebThinker outperformed Gemini2.0 Deep Research and  Grok3 DeeperSearch, scoring 8.1 in average quality metrics such as  completeness and coherence.   \u003Cbr>● RL refinement matters: The RL-trained version  outperformed its base counterpart across all benchmarks, showing that  iterative preference-based learning significantly enhances  reasoning-tool coordination.   \u003Cbr>● Ablation validates design: Removing components like Deep  Web Explorer or automatic report drafting significantly degraded  performance, confirming their necessity. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21776) |\n| 7) Reward Modeling as Reasoning  This work proposes a new class of reward models, called ReasRMs, that reformulate reward modeling as a reasoning task. The authors introduce RM-R1, a family of generative reward models that produce interpretable reasoning traces and rubrics during preference judgments. Instead of relying on scalar scores or shallow generation, RM-R1 models leverage structured reasoning and reinforcement learning to improve both interpretability and performance across benchmarks.   \u003Cbr>● RM-R1 adopts a two-stage training process:  (1) distillation of reasoning traces from stronger models, and (2)  reinforcement learning with verifiable rewards. The Chain-of-Rubrics  (CoR) prompting framework guides the model to either solve reasoning  problems or generate evaluation rubrics depending on the task type  (reasoning or chat).   \u003Cbr>● On RewardBench, RM-Bench, and RMB, RM-R1 models achieve  state-of-the-art or near-SOTA performance, outperforming models like  GPT-4o and Llama3.1-405B by up to 13.8% despite using fewer parameters  and less data.   \u003Cbr>● Ablation studies show that cold-start RL alone is  insufficient; task-type classification and high-quality distillation  are key. RM-R1's distilled warm-start training leads to more stable  learning and longer, more accurate reasoning traces.   \u003Cbr>● RM-R1 also shows strong generalization across  domains and better rubric quality than baseline methods, especially in  sensitive contexts like safety and medical judgment. The authors  open-sourced six RM-R1 models, training data, and code to support  reproducibility. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02387) |\n| 8) Paper2Code  Introduces PaperCoder, a multi-agent LLM framework that transforms ML papers into full code repositories without relying on pre-existing implementations.   \u003Cbr>● PaperCoder decomposes the code generation process into three stages:  Planning (roadmap, architecture, file dependencies, config files),  Analyzing (file-specific logic extraction), and Coding  (dependency-aware file generation). Each step is handled by  specialized LLM agents.   \u003Cbr>● It is evaluated using both the proposed Paper2Code benchmark (90  papers from ICML, NeurIPS, and ICLR 2024) and PaperBench Code-Dev.  Results show PaperCoder outperforms ChatDev, MetaGPT, and naive  baselines across reference-based, reference-free, and human  evaluations.   \u003Cbr>● In human assessments by original paper authors, 77% chose PaperCoder  as best implementation; 85% said it helped them reproduce their work.  On average, only 0.48% of code lines required changes for  executability.   \u003Cbr>● A detailed ablation study shows consistent performance gains from  each stage, especially logic design and file dependency ordering.  PaperCoder, using the o3-mini-high backbone, notably outperforms other  LLM variants. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17192) |\n| 9) ZeroSearch  ZeroSearch is an RL framework that trains LLMs to develop search capabilities without using real search engines. It uses simulated LLM-generated documents with a curriculum-based degradation strategy and outperforms real-search methods like Search-R1 in both performance and cost, achieving better QA accuracy across multiple benchmarks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04588), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1920469148968362407) |\n| 10) Practical Efficiency of Muon for Pretraining  Discusses how Muon, a simple second-order optimizer, outperforms AdamW in large-batch pretraining by expanding the compute-time Pareto frontier and maintaining better data efficiency. Combined with muP scaling and a novel telescoping algorithm for hyperparameter transfer, it enables faster training with minimal tuning overhead up to 4B parameter models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02222) |\n\n## Top ML Papers of the Week (April 28 - May 4) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Phi-4-Mini-Reasoning  Microsoft released Phi-4-Mini-Reasoning to explore small reasoning language models for math. Highlights:   \u003Cbr>● Phi-4-Mini-Reasoning: The paper introduces Phi-4-Mini-Reasoning,  a 3.8B parameter small language model (SLM) that achieves  state-of-the-art mathematical reasoning performance, rivaling or  outperforming models nearly twice its size.   \u003Cbr>● Unlocking Reasoning: They use a systematic, multi-stage  training pipeline to unlock strongbr> reasoning capabilities in compact  models, addressing the challenges posed by their limited capacity.  Uses large-scale distillation, preference learning, and RL with  verifiable rewards.   \u003Cbr>● Four-Stage Training Pipeline: The model is trained using  (1) mid-training with large-scale long CoT data, (2) supervised  fine-tuning on high-quality CoT data, (3) rollout-based Direct  Preference Optimization (DPO), and (4) RL using verifiable reward  signals.   \u003Cbr>● Math Performance: On MATH-500, Phi-4-Mini-Reasoning reaches  94.6%, surpassing DeepSeek-R1-Distill-Qwen-7B (91.4%) and  DeepSeek-R1-Distill-Llama-8B (86.9%), despite being smaller.   \u003Cbr>● Verifiable Reward Reinforcement Learning: The final  RL stage, tailored for small models, includes prompt filtering,  oversampling for balanced training signals, and temperature annealing.  This improves training stability and aligns exploration with  evaluation conditions.   \u003Cbr>● Massive Synthetic Data Generation: The model is  mid-trained on 10M CoT rollouts generated by DeepSeek-R1, filtered for  correctness using math verifiers and GPT-4o-mini, and categorized by  domain and difficulty to ensure broad generalization.   \u003Cbr>● Ablation Study: Each phase of the pipeline shows clear  gains. Notably, fine-tuning and RL each deliver ~5–7 point  improvements after mid-training and DPO, showing the value of the full  pipeline over isolated techniques. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21233), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917954418173247909) |\n| 2) Building Production-Ready AI Agents with Scalable Long-Term Memory  This paper proposes a memory-centric architecture for LLM agents to maintain coherence across long conversations and sessions, solving the fixed-context window limitation. Main highlights:   \u003Cbr>● The solution introduces two systems: Mem0, a  dense, language-based memory system, and Mem0g, an enhanced version  with graph-based memory to model complex relationships. Both aim to  extract, consolidate, and retrieve salient facts over time  efficiently.   \u003Cbr>● Mem0: Uses a two-stage architecture (extraction & update) to  maintain salient conversational memories. It detects redundant or  conflicting information and manages updates using   tool-calls, resulting in a lightweight, highly responsive memory store  (7K tokens per conversation).   \u003Cbr>● Mem0g: By structuring memory as a knowledge graph of entities  and relationships, Mem0g improves performance in tasks needing  temporal and relational reasoning (e.g., event ordering, preference  tracking) while maintaining reasonable latency and memory cost (14K  tokens\u002Fconvo).   \u003Cbr>● Benchmarking on LOCOMO: Both systems were evaluated  against six memory system baselines (e.g., A-Mem, OpenAI, Zep,  LangMem, RAG). Mem0g achieves the best overall LLM-as-a-Judge (J)  score of 68.44%, outperforming all RAG and memory baselines by 7–28%  in J and reducing p95 latency by 91% over full-context methods.   \u003Cbr>● Latency and efficiency: Mem0 achieves the lowest search  and total latencies (p95 = 1.44s), and Mem0g still outperforms other  graph-based or RAG systems by large margins in speed and efficiency.  Great for real-time deployments.   \u003Cbr>● Use-case strengths:  Mem0 and Mem0g offer a scalable memory architecture for long-term LLM agents to improve factual recall, reasoning depth, and efficiency, making them id | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19413), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917247776221700134) |\n| 3) UniversalRAG  UniversalRAG is a framework that overcomes the limitations of existing RAG systems confined to single modalities or corpora. It supports retrieval across modalities (text, image, video) and at multiple granularities (e.g., paragraph vs. document, clip vs. video). Contributions from the paper:   \u003Cbr>● Modality-aware routing: To counter modality bias in unified  embedding spaces (where queries often retrieve same-modality results  regardless of relevance), UniversalRAG introduces a router that  dynamically selects the appropriate modality (e.g., image vs. text)  for each query.   \u003Cbr>● Granularity-aware retrieval: Each modality is broken into  granularity levels (e.g., paragraphs vs. documents for text, clips vs.  full-length videos). This allows queries to retrieve content that  matches their complexity -- factual queries use short segments while  complex reasoning accesses long-form data.   \u003Cbr>● Flexible routing: It supports both training-free (zero-shot  GPT-4o prompting) and trained (T5-Large) routers. Trained routers  perform better on in-domain data, while GPT-4o generalizes better to  out-of-domain tasks. An ensemble router combines both for robust  performance.   \u003Cbr>● Performance: UniversalRAG outperforms modality-specific and  unified RAG baselines across 8 benchmarks spanning text (e.g., MMLU,  SQuAD), image (WebQA), and video (LVBench, VideoRAG). With T5-Large,  it achieves the highest average score across modalities.   \u003Cbr>● Case study: In WebQA, UniversalRAG correctly routes a visual  query to the image corpus (retrieving an actual photo of the event),  while TextRAG and VideoRAG fail. Similarly, on HotpotQA and LVBench,  it chooses the right granularity, retrieving documents or short clips.  Overall, this is a great paper showing the importance of considering  modality and granularity in a RAG system. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20734), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917637837295608180) |\n| 4) DeepSeek-Prover-V2  DeepSeek-Prover-V2 is an LLM (671B) that significantly advances formal theorem proving in Lean 4. The model is built through a novel cold-start training pipeline that combines informal chain-of-thought reasoning with formal subgoal decomposition, enhanced through reinforcement learning. It surpasses prior state-of-the-art on multiple theorem-proving benchmarks. Key highlights:   \u003Cbr>● Cold-start data via recursive decomposition: The  authors prompt DeepSeek-V3 to generate natural-language proof  sketches, decompose them into subgoals, and formalize these steps in  Lean with sorry placeholders. A 7B prover model then recursively fills  in the subgoal proofs, enabling efficient construction of complete  formal proofs and training data.   \u003Cbr>● Curriculum learning + RL: A subgoal-based curriculum  trains the model on increasingly complex problems. Reinforcement  learning with a consistency reward is used to enforce alignment  between proof structure and CoT decomposition, improving performance  on complex tasks.   \u003Cbr>● Dual proof generation modes: The model is trained in  two modes, non-CoT (efficient, minimal proofs) and CoT  (high-precision, interpretable). The CoT mode yields significantly  better performance, particularly on hard problems.   \u003Cbr>● Benchmark results: | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21801), [Tweet](https:\u002F\u002Fx.com\u002Fzhs05232838\u002Fstatus\u002F1917600755936018715) |\n| 5) Kimi-Audio  Kimi-Audio is a new open-source audio foundation model built for universal audio understanding, generation, and speech conversation. The model architecture uses a hybrid of discrete semantic audio tokens and continuous Whisper-derived acoustic features.  It is initialized from a pre-trained LLM and trained on 13M+ hours of audio, spanning speech, sound, and music. It also supports a streaming detokenizer with chunk-wise decoding and a novel  look-ahead mechanism for smoother audio generation. Extensive benchmarking shows that Kimi-Audio outperforms other audio LLMs across multiple modalities and tasks.  Key highlights:   \u003Cbr>● Architecture: Kimi-Audio uses a 12.5Hz semantic tokenizer and an  LLM with dual heads (text + audio), processing hybrid input  (discrete + continuous). The audio detokenizer employs a flow-matching  upsampler with BigVGAN vocoder for real-time speech synthesis.   \u003Cbr>● Massive Training Corpus: Pretrained on 13M+ hours of  multilingual, multimodal audio. A rigorous preprocessing pipeline adds  speech enhancement, diarization, and transcription using Whisper and  Paraformer-Zh. Fine-tuning uses 300K+ hours from 30+ open datasets.   \u003Cbr>● Multitask Training: Training spans audio-only, text-only,  ASR, TTS, and three audio-text interleaving strategies. Fine-tuning is  instruction-based, with both audio\u002Ftext instructions injected via  zero-shot TTS.   \u003Cbr>● Evaluation: On ASR (e.g., LibriSpeech test-clean: 1.28 WER),  audio understanding (CochlScene: 80.99), and audio-to-text chat  (OpenAudioBench avg: 69.8), Kimi-Audio sets new SOTA results, beating  Qwen2.5-Omni and Baichuan-Audio across the board. | [Paper](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-Audio\u002Fblob\u002Fmaster\u002Fassets\u002Fkimia_report.pdf), [Tweet](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1915807071960007115)  [Model](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-Audio) |\n| 6) MiMo-7B  Xiaomi releases MiMo-7B, a new language model for reasoning tasks. MiMo-7B is explicitly designed for advanced reasoning across math and code. Highlights:   \u003Cbr>● MiMo-7B: MiMo-7B narrows the capability gap with larger  32B-class models through careful pretraining & posttraining.  MiMo-7B-Base is trained from scratch on 25T tokens, with a   3-stage mixture skewed toward mathematics and code (70% in stage 2).   \u003Cbr>● Pre-Training: The team improves HTML and PDF extraction to  better preserve STEM data, leverages LLMs to generate diverse  synthetic reasoning content, and adds a Multi-Token Prediction (MTP)  objective that boosts both quality and inference speed.   \u003Cbr>● Base Performance: MiMo-7B-Base outperforms other 7B–9B  models like Qwen2.5, Gemma-2, and Llama-3.1 across BBH (+5 pts),  AIME24 (+22.8 pts), and LiveCodeBench (+27.9 pts). On BBH and  LiveCodeBench, it even beats larger models on reasoning-heavy tasks.   \u003Cbr>● RL: MiMo-7B-RL is trained with a test difficulty–driven reward  function and easy-data resampling to tackle sparse-reward issues and  instabilities. In some cases, it surpasses   o1-mini on math & code. RL from the SFT model reaches higher ceilings  than RL-Zero from the base.   \u003Cbr>● Efficient infrastructure: A Seamless Rollout Engine  accelerates RL by 2.29× and validation by 1.96× using continuous  rollout, async reward computation, and early termination. MTP layers  enable fast speculative decoding, with 90%+ acceptance rates in  inference. | [Paper](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo\u002Fblob\u002Fmain\u002FMiMo-7B-Technical-Report.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917582720341008814) |\n| 7) Advances and Challenges in Foundation Agents  A new survey frames intelligent agents with a modular, brain-inspired architecture that integrates ideas from cognitive science, neuroscience, and computational research. Key topics covered:   \u003Cbr>● Human Brain and LLM Agents: Helps to better  understand what differentiates LLM agents from human\u002Fbrain cognition,  and what inspirations we can get from the way humans learn and  operate.   \u003Cbr>● Definitions: Provides a nice, detailed, and formal definition of  what makes up an AI agent.   \u003Cbr>● Reasoning: It has a detailed section on the core components of  intelligent agents. There is a deep dive into reasoning, which is one  of the key development areas of AI agents and what unlocks things like  planning, multi-turn tooling, backtracking, and much more.   \u003Cbr>● Memory: Agent memory is a challenging area of building agentic  systems, but there is already a lot of good literature out there from  which to get inspiration.   \u003Cbr>● Action Systems: You can already build very complex agentic  systems today, but the next frontier is agents that take actions and  make decisions in the real world. We need better tooling, better  training algorithms, and robust operation in different action spaces.   \u003Cbr>● Self-Evolving Agents: For now, building effective agentic  systems requires human effort and careful optimization tricks.  However, one of the bigger opportunities in the field is to build AI  that can itself build powerful and self-improving AI systems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01990), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1916542394746421333) |\n| 8) MAGI  MAGI is a multi-agent system designed to automate structured psychiatric interviews by operationalizing the MINI (Mini International Neuropsychiatric Interview) protocol. It involves 4 specialized agents: navigation, question generation, judgment, and diagnosis. Other highlights:   \u003Cbr>● Multi-Agent Clinical Workflow: MAGI is built with a  navigation agent (interview flow control), a question agent (dynamic,  empathetic probing), a judgment agent (response validation), and a  diagnosis agent using Psychometric CoT to trace diagnoses explicitly  to MINI\u002FDSM-5 criteria.   \u003Cbr>● Explainable Reasoning (PsyCoT): Instead of treating  diagnoses as opaque outputs, PsyCoT decomposes psychiatric reasoning  into symptom anchoring, syndromal validation, and evidence binding.  This helps with auditability for each diagnostic conclusion. CoT put  to great use.   \u003Cbr>● Results: Evaluated on 1,002 real-world interviews, MAGI  outperforms baselines (Direct prompting, Role-play,  Knowledge-enhanced, and MINI-simulated LLMs) across relevance,  accuracy, completeness, and guidance.   \u003Cbr>● Strong Clinical Agreement: Diagnostic evaluations show  PsyCoT consistently improves F1 scores, accuracy, and Cohen’s κ across  disorders like depression, generalized anxiety, social anxiety, and  suicide risk, reaching clinical-grade reliability (κ  0.8) in  high-risk tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18260), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1916862752410554423) |\n| 9) A Survey of Efficient LLM Inference Serving  This survey reviews recent advancements in optimizing LLM inference, addressing memory and computational bottlenecks. It covers instance-level techniques (like model placement and request scheduling), cluster-level strategies (like GPU deployment and load balancing), and emerging scenario-specific solutions, concluding with future research directions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19720) |\n| 10) LLM for Engineering  This work finds that when RL is used, a 7B parameter model outperforms both SoTA foundation models and human experts at high-powered rocketry design. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19394) |\n\n## Top ML Papers of the Week (April 21 - April 27) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Does RL Incentivize Reasoning in LLMs Beyond the Base Model?  This paper revisits a key assumption in recent LLM development: that Reinforcement Learning with Verifiable Rewards (RLVR) helps models acquire genuinely new reasoning capabilities. By analyzing models across tasks (math, code, vision) using pass@k metrics (with large k), the authors find that RLVR improves sample efficiency but does not expand reasoning capacity beyond the base model.   \u003Cbr>● Key insight: RLVR-trained models do better at low *k* (e.g.,  pass@1), but as *k* increases (up to 256 or more), base models  eventually match or outperform them. This suggests RLVR doesn’t  generate fundamentally new reasoning paths but just increases the  likelihood of sampling already-existing correct ones.   \u003Cbr>● Reasoning already in the base: RLVR models'  successful CoTs are shown to be present within the base model's  sampling distribution. Perplexity analyses confirm that RL outputs are  often high-probability continuations for the base model.   \u003Cbr>● Efficiency vs. exploration: RLVR narrows the model’s  exploration space, improving efficiency but shrinking its coverage of  diverse reasoning paths, thereby reducing overall problem-solving  reach at scale.   \u003Cbr>● Distillation helps more: Unlike RLVR, distillation from  a stronger teacher model (e.g., DeepSeek-R1) introduces genuinely new  reasoning patterns, expanding the model’s capabilities.   \u003Cbr>● Algorithmic limits: Across PPO, GRPO, Reinforce++, etc., RL  algorithms offer similar sample-efficiency improvements, but none  closes the gap to the base model’s pass@256—highlighting the limits of  current RL strategies. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13837), [Tweet](https:\u002F\u002Fx.com\u002FDaveShapi\u002Fstatus\u002F1915408405201629684) |\n| 2) BitNet b1.58 2B4T  This work introduces BitNet b1.58 2B4T, the first open-source, natively trained 1-bit LLM at the 2B parameter scale, achieving strong performance while being extremely efficient. The model uses a  custom ternary quantization scheme (1.58 bits per weight), enabling dramatic reductions in memory (0.4 GB), energy (0.028J\u002Ftoken), and latency (29ms), while still competing with state-of-the-art  full-precision models across diverse benchmarks.   \u003Cbr>● New Pareto frontier in efficiency-performance:  Trained from scratch on 4T tokens, BitNet b1.58 2B4T outperforms or  matches open full-precision models (e.g., Qwen2.5 1.5B, MiniCPM 2B) on  tasks like ARC-Challenge, PIQA, WinoGrande, and GSM8K. It achieves  54.19% average. across 16 benchmarks, comparable to Qwen2.5-1.5B’s  55.23%, but with ~6.5× lower memory and 10× lower energy usage.   \u003Cbr>● Outperforms quantized baselines: Against INT4  post-training quantized Qwen2.5 models (GPTQ\u002FAWQ), BitNet is both  smaller and more accurate, showing the advantage of native 1-bit  training over PTQ approaches.   \u003Cbr>● Architectural & training innovations: It replaces  standard linear layers with BitLinear layers using absmean ternary  quantization and 8-bit activations, combines RoPE embeddings, squared  ReLU activation, and bias-free layers. Training includes cosine LR and  weight decay schedules, plus supervised fine-tuning and Direct  Preference Optimization (DPO) instead of full RLHF.   \u003Cbr>● Best-in-class among 1-bit LLMs: When compared to  other 1-bit models like OLMo-Bitnet (1B) and post-quantized  Falcon3\u002FLlama3 (7B–8B), BitNet b1.58 2B4T is +10 pts stronger on  average, establishing a new benchmark for ultra-efficient LLMs.  The authors also release optimized CUDA kernels for GPU and a C++ inference library for CPU, enabling practical deployment of 1-bit LLMs on diverse hardware. BitNet b1.58 2B4T demonstrates that extreme quantization does not mean compromised capability, and it opens the door to the broader adoption of LLMs in resource-constrained environments. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12285) |\n| 3) UI-TARS  UI-TARS introduces a powerful, end-to-end native GUI agent that operates purely from visual screenshots, performing human-like keyboard and mouse interactions across platforms. Unlike existing modular agent frameworks that rely on prompt engineering and external scripts, UI-TARS integrates perception, action, reasoning, and memory directly into its architecture, achieving strong generalization and adaptability in dynamic real-world settings.  Key contributions:   \u003Cbr>● Enhanced GUI Perception: UI-TARS is trained on a  large-scale, richly annotated dataset of screenshots with metadata,  enabling dense captioning, state transition understanding, and precise  element description. It excels in perception benchmarks like  VisualWebBench, scoring 82.8, outperforming GPT-4o’s.   \u003Cbr>● Unified Action Modeling and Grounding: UI-TARS  standardizes actions across platforms into a shared action space and  learns from large-scale multi-step action traces. It surpasses  baselines in grounding tasks with 38.1 on ScreenSpot Pro, the new  SOTA.   \u003Cbr>● System-2 Reasoning via “Thoughts”: Inspired by  ReAct-style frameworks, UI-TARS generates internal reasoning steps  (thoughts) before actions. These thoughts reflect patterns like task  decomposition, reflection, and long-term consistency, significantly  improving performance in complex scenarios. For example, in OSWorld,  UI-TARS-72B-DPO scores 24.6 with a 50-step budget, outperforming  Claude’s.   \u003Cbr>● Iterative Self-Improvement with Reflective  Learning: UI-TARS continuously refines itself through online trace  collection and reflection tuning using error correction and post-error  adaptation data. This allows it to recover from mistakes and adapt  with minimal human oversight.  Overall, UI-TARS marks a significant step forward in GUI automation, setting new benchmarks across more than 10 datasets and outperforming top commercial agents like GPT-4o and Claude. Its  open-source release aims to drive further innovation in native agent development. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12326), [Blog](https:\u002F\u002Fseed-tars.com\u002F1.5\u002F) |\n| 4) Describe Anything  Introduces DAM, a model that generates fine-grained, region-specific captions in both images and videos. The authors address key limitations in prior vision-language models—namely, the inability to preserve local detail and the lack of suitable datasets and benchmarks for detailed localized captioning (DLC).  Key contributions:   \u003Cbr>● DAM (Describe Anything Model) uses two main  innovations to capture both fine regional detail and global scene  context: a focal prompt that provides high-resolution encoding of  user-specified regions, and a localized vision backbone that uses  gated cross-attention to integrate context from the entire image. This  enables DAM to generate multi-granular, accurate descriptions,  especially for small or occluded regions.   \u003Cbr>● DLC-SDP (Semi-supervised Data Pipeline) tackles data  scarcity by expanding segmentation datasets with VLM-generated  detailed captions, followed by self-training on web images. This  produces high-quality, diverse training data, enabling DAM to  outperform API-only baselines like GPT-4o across several benchmarks.   \u003Cbr>● DLC-Bench is a reference-free benchmark that scores models on  their ability to accurately include or exclude region-specific details  using LLM judges. It provides a more reliable evaluation than  traditional caption-matching metrics, which often penalize models for  valid but unmatched details.   \u003Cbr>● Performance: DAM sets a new state-of-the-art on 7 benchmarks  across keyword, phrase, and detailed multi-sentence captioning tasks  in both images and videos. It outperforms GPT-4o, Claude 3.7, and  other top VLMs in both zero-shot and in-domain evaluations, achieving  up to 33.4% improvement over prior models on detailed image captioning  and 19.8% on video captioning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16072) |\n| 5) UXAgent  Introduces a novel framework, UXAgent, for simulating large-scale usability testing using LLM-driven agents. The system empowers UX researchers to test and iterate web design and study protocols before engaging real users. This is achieved through the orchestration of simulated agents with diverse personas interacting in real web environments, providing both behavioral and reasoning data. Key highlights:   \u003Cbr>● LLM-Powered Simulation with Personas: UXAgent begins  with a Persona Generator that can produce thousands of demographically  diverse simulated users based on custom distributions. Each persona is  fed into an LLM Agent that embodies user intent and interacts with the  website via a Universal Browser Connector—a module capable of  interpreting and manipulating real HTML structures.   \u003Cbr>● Dual-Loop Reasoning Architecture: At the heart of  UXAgent is a dual-process agent architecture inspired by cognitive  psychology: a Fast Loop for low-latency actions and a Slow Loop for  deep reasoning. This design mimics System 1 and System 2 thinking and  allows agents to act responsively while maintaining coherent  high-level plans and reflections.   \u003Cbr>● Rich Memory Stream: All observations, actions, plans,  reflections, and spontaneous thoughts (“wonders”) are stored in a  Memory Stream. These memories are dynamically prioritized for  retrieval using a weighted scoring system based on importance,  recency, and relevance, tailored separately for fast and slow modules.   \u003Cbr>● Replay and Interview Interfaces: UX researchers can  review simulated sessions via a Simulation Replay Interface and  conduct natural language conversations with agents using an   Agent Interview Interface. This supports qualitative analysis, such as  asking agents about their decisions or presenting mockups for  feedback.   \u003Cbr>● Empirical Evaluation: A case study involving 60 LLM agent  simulations on a shopping platform (WebArena) showed that researchers  were able to detect usability study flaws and gather early insights. A  follow-up user study with five UX professionals found the system  helpful for iterating study design, despite some concerns over realism  and data noise. Particularly appreciated was the ability to converse  with agents and gather qualitative insights that would be infeasible  in traditional pilots.   \u003Cbr>● Future Implications: The authors position LLM agents not as  replacements for real participants, but as early-stage collaborators  in the design process, reducing the cost and risk of flawed studies.  They also discuss extensions to multimodal settings, desktop or mobile  interfaces, and broader agentic tasks such as digital twins or  simulated A\u002FB testing. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09407) |\n| 6) Test-Time Reinforcement Learning  Test-Time Reinforcement Learning (TTRL) is a method that allows LLMs to improve themselves during inference without ground-truth labels. Instead of relying on labeled datasets, TTRL uses majority voting over multiple model generations to estimate pseudo-rewards, enabling reinforcement learning (RL) on unlabeled test data. The method integrates Test-Time Scaling (TTS) and Test-Time Training (TTT) strategies, letting models adapt dynamically to new and challenging inputs.  Key highlights:   \u003Cbr>● Majority Voting as Reward: TTRL generates multiple  candidate outputs for a query and uses majority voting to derive a  pseudo-label. Rewards are assigned based on agreement with the  consensus answer.   \u003Cbr>● Significant Performance Gains: Applying TTRL to  Qwen2.5-Math-7B leads to a +159% improvement on AIME 2024 and +84%  average gains across AIME, AMC, and MATH-500 benchmarks, without using  any labeled training data.   \u003Cbr>● Self-Evolution Beyond Supervision: Remarkably, TTRL  surpasses the performance ceiling of its own majority-vote supervision  (Maj@N) and approaches the performance of models trained with full  label leakage, indicating efficient and stable unsupervised RL.   \u003Cbr>● Generalization and Robustness: TTRL generalizes well  across tasks, maintains effectiveness even under label estimation  noise, and is compatible with different RL algorithms like PPO and  GRPO.   \u003Cbr>● Limitations: TTRL may fail when the base model lacks sufficient  prior knowledge about the domain or when hyperparameters (like batch  size and temperature) are poorly tuned. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2504.16084) |\n| 7) Discovering Values in Real-World Language Model Interactions  This paper presents the first large-scale empirical analysis of values exhibited by a deployed AI assistant, Claude 3 and 3.5 models, using over 300,000 real-world conversations. The authors develop a bottom-up, privacy-preserving framework to extract, classify, and analyze AI-expressed normative considerations (“values”) and show how they vary across tasks, user values, and conversational contexts.   \u003Cbr>● The authors identify 3,307 unique AI values, which are organized  into a five-domain taxonomy: Practical, Epistemic, Social, Protective,  and Personal. Practical and epistemic values dominate, often aligning  with Claude’s training goals around being helpful, harmless, and  honest.   \u003Cbr>● Claude’s most common values, such as helpfulness (23.4%),  professionalism, transparency, and clarity, are context-invariant and  reflect its role as a service-oriented assistant. In contrast, human  values like authenticity and efficiency are more varied.   \u003Cbr>● Many values are context-specific. For example, healthy boundaries  arise in relationship advice, historical accuracy in controversial  event discussions, and human agency in AI governance contexts.   \u003Cbr>● Claude tends to mirror human values in supportive contexts (20.1%  mirroring rate), but expresses opposing values during resistance,  especially in cases involving unethical or policy-violating requests  (e.g., resisting “moral nihilism” with “ethical integrity”).   \u003Cbr>● Explicit value expression (e.g., “I value transparency”) occurs more  often in moments of resistance or reframing, particularly around  epistemic and ethical principles like intellectual honesty and harm  prevention. This suggests that AI values become most visible when the  system is challenged.   \u003Cbr>● Across Claude variants, 3 Opus expresses more emotionally nuanced  and ethically grounded values (e.g., academic rigor, emotional  authenticity) and shows a stronger inclination for both support and  resistance compared to 3.5\u002F3.7 Sonnet. | [Paper](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F18d20cca3cde3503\u002Foriginal\u002FValues-in-the-Wild-Paper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1914333220067213529) |\n| 8) Evaluate the Goal-Directedness of LLMs  Introduces a new framework to assess whether LLMs use their capabilities effectively toward achieving given goals. The study finds that even top models like GPT-4o and Claude 3.7 fall short of full goal-directedness, particularly in information-gathering and combined tasks, despite performing well in isolated subtasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11844), [Tweet](https:\u002F\u002Fx.com\u002Ftom4everitt\u002Fstatus\u002F1912806499862139275), [GitHub](https:\u002F\u002Fgithub.com\u002FCrista23\u002Fgoal_directedness_llms) |\n| 9) General-Reasoner  General-Reasoner is a reinforcement learning approach that boosts LLM reasoning across diverse domains by using a 230K-question dataset and a model-based verifier trained to understand semantics beyond exact matches. It outperforms strong baselines like SimpleRL and Qwen2.5 on both general reasoning (MMLU-Pro, GPQA, SuperGPQA) and math tasks (MATH-500, GSM8K), showing over 10-point gains without sacrificing mathematical capability. | [Paper](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FGeneral-Reasoner\u002Fblob\u002Fmain\u002FGeneral_Reasoner.pdf), [Tweet](https:\u002F\u002Fx.com\u002FWenhuChen\u002Fstatus\u002F1912242238110789671) |\n| 10) Tiny Reasoning Models  Tina is a family of 1.5B parameter reasoning models trained using LoRA-based reinforcement learning (RL) to achieve high reasoning accuracy at very low cost. It outperforms or matches full fine-tuned models on reasoning tasks like AIME and MATH with only ~$9 post-training cost, demonstrating that efficient reasoning can be instilled via minimal updates to a tiny model. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15777) |\n\n## Top ML Papers of the Week (April 14 - April 20) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) GUI-R1  Researchers from the National University of Singapore and the Chinese Academy of Sciences introduce GUI-R1, a reinforcement learning (RL) framework aimed at improving graphical user interface (GUI) agents through unified action-space modeling. Key insights include:   \u003Cbr>● Reinforcement Fine-Tuning (RFT) over Supervised  Fine-Tuning (SFT) – GUI-R1 utilizes RFT inspired by methods  such as DeepSeek-R1, significantly reducing training data  requirements. It uses only 3K carefully curated examples versus  millions used by previous models.   \u003Cbr>● Unified Action Space and Reward Modeling –  The authors introduce a unified action space that covers actions  across different platforms (Windows, Linux, MacOS, Android, and Web).  This enables consistent reward signals for evaluating GUI actions,  enhancing the model’s adaptability and generalization.   \u003Cbr>● Superior Performance with Minimal Data – GUI-R1  outperforms state-of-the-art methods like OS-Atlas using merely 0.02%  of the training data (3K vs. 13M). Evaluations across eight benchmarks  spanning mobile, desktop, and web platforms show significant  improvements in grounding, low-level, and high-level GUI task  capabilities.   \u003Cbr>● Efficient Training and Strong Generalization –  By leveraging policy optimization algorithms like Group Relative  Policy Optimization (GRPO), GUI-R1 quickly converges to high  performance, demonstrating robustness and efficiency even in  resource-constrained scenarios. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10458) |\n| 2) Scaling Reasoning in Diffusion LLMs via RL  Proposes d1, a two‑stage recipe that equips masked diffusion LLMs with strong step‑by‑step reasoning.   \u003Cbr>● Two‑stage pipeline (SFT → diffu‑GRPO) –  d1 first applies supervised fine‑tuning on the 1 k‑example s1K dataset  and then runs task‑specific RL with the new diffu‑GRPO objective,  yielding larger gains than either stage alone.   \u003Cbr>● diffu‑GRPO: RL for masked dLLMs – Extends  GRPO to diffusion LLMs via (i) a mean‑field sequence‑log‑prob  approximation and (ii) a one‑step per‑token log‑prob estimator with  random prompt masking, enabling many gradient updates from a single  generation.   \u003Cbr>● Consistent gains on four reasoning  benchmarks – On GSM8K, MATH500, Countdown, and Sudoku, diffu‑GRPO  beats SFT, and the full d1‑LLaDA variant attains the best scores  (e.g., 81.1 % GSM8K & 38.6 % MATH500 at 256 tokens, +5–12 pp over  baseline).   \u003Cbr>● Competitive among 7‑8 B models – d1‑LLaDA  outperforms DeepSeek‑7B, Mistral‑7B and Llama‑3‑8B on GSM8K and ranks  second on MATH500 in the same size class.   \u003Cbr>● Longer decoding unlocks “aha moments” – At  512‑token generation, the model shows self‑verification\u002Fbacktracking;  effective‑token usage grows smoothly, echoing test‑time compute  scaling trends.   \u003Cbr>● Random masking speeds RL – Ablations show that  random prompt masking during diffu‑GRPO accelerates convergence and  boosts correctness relative to fixed masking, with fewer online  generations needed. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12216) |\n| 3) Enhancing Non-Reasoning Models with Reasoning Models  Researchers explore how to distill reasoning-intensive outputs (answers and explanations) from top-tier LLMs into more lightweight models that don’t explicitly reason step by step. By fine-tuning smaller models on the high-quality final answers (and optionally summarized thinking traces) from advanced reasoning models, they demonstrate consistent performance boosts across multiple benchmarks.   \u003Cbr>● Test-time scaling vs. knowledge distillation –  While large models like DeepSeek-R1 and OpenAI-o1 can allocate more  compute to generate better reasoning traces, this paper focuses on  systematically transferring those rich final answers (and possibly a  summarized version of the reasoning steps) to more compact models.   \u003Cbr>● Data curation – The authors construct a 1.3M-instance  dataset by pulling prompts from multiple open-source repositories  (including Infinity Instruct, CodeContests, FLAN, etc.) and generating  final answers plus detailed reasoning from DeepSeek-R1.   \u003Cbr>● Three fine-tuning strategies – (1) Use the original  baseline answers from existing   open-source sets, (2) fine-tune on only the final answer portion of a  reasoning model, and (3) combine a summarized chain-of-thought with  the final answer. Models trained on the second strategy excelled at  math\u002Fcoding tasks, while the third approach proved better for more  conversational or alignment-oriented tasks.   \u003Cbr>● Empirical gains – Fine-tuning Qwen2.5-32B on the reasoning  model’s final answers led to notable improvements on GSM8K (92.2%) and  HumanEval (90.9%). A think-summarization approach boosted a different  set of benchmarks (GPQA and chat-based tests). However, weaving in the  “thinking trace” sometimes caused slight drops in instruction  strictness (IFEval).   \u003Cbr>● Trade-offs and future work – Distilling advanced  reasoning data definitely helps smaller models, but deciding how much  of the reasoning trace to include is domain-dependent. The authors  suggest that more refined ways of seamlessly blending reasoning steps  into final answers (e.g., specialized prompts or partial merges) could  further improve performance and avoid alignment regressions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09639) |\n| 4) AgentA\u002FB  AgentA\u002FB is a fully automated A\u002FB testing framework that replaces live human traffic with large-scale LLM-based agents. These agents simulate realistic, intention-driven user behaviors on actual web environments, enabling faster, cheaper, and risk-free UX evaluations — even on real websites like Amazon. Key Insights:   \u003Cbr>● Modular agent simulation pipeline – Four  components—agent generation, condition prep, interaction loop, and  post-analysis—allow plug-and-play simulations on live webpages using  diverse LLM personas.   \u003Cbr>● Real-world fidelity – The system parses live DOM into JSON,  enabling structured interaction loops (search, filter, click,  purchase) executed via LLM reasoning + Selenium.   \u003Cbr>● Behavioral realism – Simulated agents show more  goal-directed but comparable interaction patterns vs. 1M real Amazon  users (e.g., shorter sessions but similar purchase rates).   \u003Cbr>● Design sensitivity – A\u002FB test comparing full vs. reduced  filter panels revealed that agents in the treatment condition clicked  more, used filters more often, and purchased more.   \u003Cbr>● Inclusive prototyping – Agents can represent hard-to-reach  populations (e.g., low-tech users), making early-stage UX testing more  inclusive and risk-free.   \u003Cbr>● Notable results  AgentA\u002FB shows how LLM agents can augment — not replace — traditional A\u002FB testing by offering a new pre-deployment simulation layer. This can accelerate iteration, reduce development waste, and support UX inclusivity without needing immediate live traffic. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09723) |\n| 5) Reasoning Models Can Be Effective Without Thinking  This paper challenges the necessity of long chain-of-thought (CoT) reasoning in LLMs by introducing a simple prompting method called NoThinking, which bypasses explicit \"thinking\" steps. Surprisingly, NoThinking performs comparably to or better than traditional reasoning under comparable or even lower compute budgets, especially when paired with parallel decoding and best-of-N selection.  Key Insights:   \u003Cbr>● NoThinking prepends a dummy “Thinking” block and jumps straight to  final answers.   \u003Cbr>● Despite skipping structured reasoning, it outperforms Thinking in  pass@k (1–64) on many benchmarks, especially under token constraints.   \u003Cbr>● With parallel scaling, NoThinking achieves higher pass@1 accuracy  than Thinking while using 4× fewer tokens and up to 9× lower latency.   \u003Cbr>● Tasks evaluated: competitive math (AIME24\u002F25, AMC23, OlympiadBench),  coding (LiveCodeBench), and formal theorem proving (MiniF2F,  ProofNet).   \u003Cbr>● NoThinking is shown to provide superior accuracy–latency tradeoffs  and generalizes across diverse tasks.  Results:   \u003Cbr>● Low-budget wins: On AMC23 (700 tokens), NoThinking achieves 51.3%  vs. 28.9% (Thinking). \u003Cbr>● Better scaling: As k increases, NoThinking  consistently surpasses Thinking.   \u003Cbr>● Efficiency frontier: Across benchmarks, NoThinking dominates the  accuracy–cost Pareto frontier.   \u003Cbr>● Parallel wins: With simple confidence-based or majority vote  strategies, NoThinking + best-of-N beats full Thinking on pass@1 with  significantly less latency. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2504.09858) |\n| 6) SocioVerse  Researchers from Fudan University and collaborators propose SocioVerse, a large-scale world model for social simulation using LLM agents aligned with real-world user behavior. Key ideas include:   \u003Cbr>● Four-fold alignment framework – SocioVerse tackles major  challenges in aligning simulated environments with reality across four  dimensions:   \u003Cbr>● Three representative simulations – SocioVerse showcases  its generalizability through: \u003Cbr>● Impressive empirical  accuracy –   \u003Cbr>● Ablation insights – Removing prior demographic distribution  and user knowledge severely degrades election prediction accuracy (Acc  drops from 0.80 → 0.60), highlighting the value of realistic  population modelingpapersoftheweek.   \u003Cbr>● Toward trustworthy virtual societies – SocioVerse  not only standardizes scalable social simulations but also provides a  sandbox for testing sociopolitical hypotheses (e.g., fairness, policy  change), bridging AI agent systems with traditional social science. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10157) |\n| 7) DocAgent  Researchers from Meta AI present DocAgent, a tool‑integrated, dependency‑aware framework that turns large, complex codebases into well‑written docstrings. Key ideas include:   \u003Cbr>● Topological Navigator for context building –  DocAgent parses the repository’s AST, builds a dependency DAG, and  documents components in topological order, so each function\u002Fclass is  visited only after its prerequisites, enabling incremental context  accumulation and preventing context‑length explosions.   \u003Cbr>● Role‑specialised agent team – Five agents work  together: Reader analyses code, Searcher gathers internal & external  references, Writer drafts docstrings, Verifier critiques and revises  them, while the Orchestrator manages iterations until quality  converges.   \u003Cbr>● Adaptive context management – When retrieved context  exceeds the model’s token budget, the Orchestrator trims low‑priority  segments while preserving overall structure, keeping generation  efficient and faithful2504.08725v1.   \u003Cbr>● Three‑facet automatic evaluation – A new framework  scores Completeness (section coverage), Helpfulness (LLM‑as‑judge  semantic utility), and Truthfulness (entity grounding against the code  DAG) for every docstring.   \u003Cbr>● Substantial gains over baselines – On 366 components  across nine Python repos, DocAgent + GPT‑4o‑mini lifts Completeness to  0.934 vs 0.815, Helpfulness to 3.88 \u002F 5 vs   2.95, and Truthfulness (existence ratio) to 95.7 % vs 61.1 % compared  with a Chat‑GPT baseline; FIM baselines fare far worse.   \u003Cbr>● Navigator is crucial – An ablation that randomises  processing order drops helpfulness by ‑0.44 and truthfulness by ‑7.9  pp, confirming the importance of dependency‑aware traversal. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08725) |\n| 8) SWE-PolyBench  SWE-PolyBench is a new multi-language benchmark for evaluating coding agents on real-world software tasks across Java, JavaScript, TypeScript, and Python. It introduces execution-based assessments, syntax tree metrics, and reveals that current agents struggle with complex tasks and show inconsistent performance across languages. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08703v1) |\n| 9) A Survey of Frontiers in LLM Reasoning  This survey categorizes LLM reasoning methods by when reasoning occurs (inference-time vs. training) and the system's architecture (standalone vs. agentic or multi-agent). It highlights trends like learning-to-reason (e.g., DeepSeek-R1) and agentic workflows (e.g., OpenAI Deep Research), covering prompt engineering, output refinement, and learning strategies such as PPO and verifier training. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09037) |\n| 10) Advances in Embodied Agents, Smart Cities, and Earth Science  This paper surveys how spatial intelligence manifests across disciplines—from embodied agents to urban and global systems—by connecting human spatial cognition with how LLMs handle spatial memory, representations, and reasoning. It offers a unifying framework to bridge research in AI, robotics, urban planning, and earth science, highlighting LLMs’ evolving spatial capabilities and their interdisciplinary potential. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09848) |\n\n## Top ML Papers of the Week (April 6 - April 13) - 2025\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) The AI Scientist V2  The AI Scientist-v2 refines and extends its predecessor to achieve a new milestone: autonomously generating a workshop-accepted research manuscript. The system removes dependencies on human-authored code templates, incorporates agentic tree-search methods for deeper exploration, uses Vision-Language Models to refine figures, and demonstrates impressive real-world outcomes by passing the peer-review bar.   \u003Cbr>● Enhanced Autonomy – Eliminates reliance on human-crafted  code templates, enabling out-of-the-box deployment across diverse ML  domains.   \u003Cbr>● Agentic Tree Search – Systematically searches and  refines hypotheses through a branching exploration, managed by a new  experiment manager agent.   \u003Cbr>● VLM Feedback Loop – Integrates Vision-Language Models in  the reviewing process to critique and improve experimental figures and  paper aesthetics.   \u003Cbr>● Workshop Acceptance – Generated three fully autonomous  manuscripts for an ICLR workshop; one was accepted, showcasing the  feasibility of AI-driven end-to-end scientific discovery. | [Paper](https:\u002F\u002Fpub.sakana.ai\u002Fai-scientist-v2\u002Fpaper\u002Fpaper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1909497165925536212) |\n| 2) Benchmarking Browsing Agents  OpenAI introduces BrowseComp, a benchmark with 1,266 questions that require AI agents to locate hard-to-find, entangled information on the web. Unlike saturated benchmarks like SimpleQA, BrowseComp demands persistent and creative search across numerous websites, offering a robust testbed for real-world web-browsing agents.  Key insights:   \u003Cbr>● Extremely difficult questions: Benchmarked tasks were  verified to be unsolvable by humans in under 10 minutes and also by  GPT-4o (with\u002Fwithout browsing), OpenAI o1, and earlier Deep Research  models.   \u003Cbr>● Human performance is low: Only 29.2% of problems  were solved by humans (even with 2-hour limits). 70.8% were abandoned.   \u003Cbr>● Model performance:   \u003Cbr>● Test-time scaling matters: Accuracy improves with more  browsing attempts. With 64 parallel samples and best-of-N aggregation,  Deep Research significantly boosts its performance (15–25% gain over a  single attempt).   \u003Cbr>● Reasoning  browsing: OpenAI o1 (no browsing but better  reasoning) outperforms GPT-4.5 with browsing, showing that tool use  alone isn't enough—strategic reasoning is key.   \u003Cbr>● Calibration struggles: Models with browsing access often  exhibit overconfidence in incorrect answers, revealing current limits  in uncertainty estimation.   \u003Cbr>● Dataset diversity: Includes a wide topical spread:  TV\u002Fmovies, science, art, sports, politics, geography, etc. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F5e10f4ab-d6f7-442e-9508-59515c65e35d\u002Fbrowsecomp.pdf), [Blog](https:\u002F\u002Fopenai.com\u002Findex\u002Fbrowsecomp\u002F), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1910393421652520967) |\n| 3) OLMOTrace  Allen Institute for AI & University of Washington present OLMOTRACE, a real-time system that traces LLM-generated text back to its verbatim sources in the original training data, even across  multi-trillion-token corpora.   \u003Cbr>● What it does: For a given LM output, OLMOTRACE  highlights exact matches with training data segments and lets users  inspect full documents for those matches. Think   \"reverse-engineering\" a model’s response via lexical lookup. \u003Cbr>● How  it works:   \u003Cbr>● Supported models: Works with OLMo models (e.g.,  OLMo-2-32B-Instruct) and their full pre\u002Fmid\u002Fpost-training datasets,  totaling 4.6T tokens.   \u003Cbr>● Use cases:   \u003Cbr>● Benchmarked:   \u003Cbr>● Not RAG: It retrieves after generation, without changing  output, unlike retrieval-augmented generation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07096), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1910323386603262316), [Blog](https:\u002F\u002F5910970.hs-sites.com\u002Folmotrace-points-model-output-back-to-training-data?ecid=ACsprvuggQcD4yCdO--rKTZKDvmczdSQkb96ct95zLH9eiysrXjF_WuKgsmIMaz8byfiL1H1-2A6&utm_campaign=AI2%20Newsletter&utm_medium=email&_hsenc=p2ANqtz-__MqUAVPXfHPpHpf2xC86iZG8qC3J-z5nW141VBN9gZW4j61ymW3dM7mhkiHGTWtjQt3Eao7Cqf7pB1k24CfEhYe9fmA&_hsmi=355925505) |\n| 4) Concise Reasoning via RL  This new paper proposes a new training strategy that promotes concise and accurate reasoning in LLMs using RL. It challenges the belief that long responses improve accuracy; it offers both theoretical and empirical evidence showing that conciseness often correlates with better performance.   \u003Cbr>● Long ≠ better reasoning – The authors mathematically  show that RL with PPO tends to generate unnecessarily long responses,  especially when answers are wrong. Surprisingly, shorter outputs  correlate more with correct answers, across reasoning and  non-reasoning models.   \u003Cbr>● Two-phase RL for reasoning + conciseness –  They introduce a two-phase RL strategy: (1) train on hard problems to  build reasoning ability (length may increase), then (2) fine-tune on  occasionally solvable tasks to enforce concise CoT without hurting  accuracy. The second phase alone dramatically reduces token usage by  over 50%, with no loss in accuracy.   \u003Cbr>● Works with tiny data – Their method succeeds with as  few as 4–8 training examples, showing large gains in both math and  STEM benchmarks (MATH, AIME24, MMLU-STEM). For instance, on MMLU-STEM,  they improved accuracy by +12.5% while cutting response length by over  2×.   \u003Cbr>● Better under low sampling – Post-trained models  remain robust even when the temperature is reduced to 0. At  temperature=0, the fine-tuned model outperformed the baseline by  10–30%, showing enhanced deterministic performance.   \u003Cbr>● Practical implications – Besides improving model output,  their method reduces latency, cost, and token usage, making LLMs more  deployable. The authors also recommend setting λ \u003C 1 during PPO to  avoid instability and encourage correct response shaping. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05185), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1909634850304503977) |\n| 5) Rethinking Reflection in Pre-Training  Reflection — the ability of LLMs to identify and correct their own reasoning — has often been attributed to reinforcement learning or fine-tuning. This paper argues otherwise: reflection emerges during pre-training. The authors introduce adversarial reasoning tasks to show that self-reflection and correction capabilities steadily improve as compute increases, even in the absence of supervised post-training.  Key contributions:   \u003Cbr>● Propose two kinds of reflection:   \u003Cbr>● Build six adversarial datasets (GSM8K, TriviaQA, CruxEval, BBH) to  test reflection across math, coding, logic, and knowledge domains. On  GSM8K-Platinum, explicit reflection rates grow from ~10% to 60% with  increasing pre-training tokens.   \u003Cbr>● Demonstrate that simple triggers like “Wait,” reliably induce  reflection.   \u003Cbr>● Evaluate 40 OLMo-2 and Qwen2.5 checkpoints, finding a strong  correlation between pre-training compute and both accuracy and  reflection rate.  Why it matters:   \u003Cbr>● Reflection is a precursor to reasoning and can develop before RLHF  or test-time decoding strategies.   \u003Cbr>● Implication: We can instill advanced reasoning traits with better  pre-training data and scale, rather than relying entirely on  post-training tricks.   \u003Cbr>● They also show a trade-off: more training compute reduces the need  for expensive test-time compute like long CoT traces. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04022), [Tweet](https:\u002F\u002Fx.com\u002FashVaswani\u002Fstatus\u002F1909642828554387675) |\n| 6) Efficient KG Reasoning for Small LLMs  LightPROF is a lightweight framework that enables small-scale language models to perform complex reasoning over knowledge graphs (KGs) using structured prompts. Key highlights:   \u003Cbr>● Retrieve-Embed-Reason pipeline – LightPROF introduces a  three-stage architecture:   \u003Cbr>● Plug-and-play & parameter-efficient – LightPROF trains  only the adapter and projection modules, allowing seamless integration  with any open-source LLM (e.g., LLaMa2-7B, LLaMa3-8B) without  expensive fine-tuning.   \u003Cbr>● Outperforms larger models – Despite using small LLMs,  LightPROF beats baselines like StructGPT (ChatGPT) and ToG  (LLaMa2-70B) on KGQA tasks: 83.8% (vs. 72.6%) on WebQSP and 59.3% (vs.  57.6%)on CWQ.   \u003Cbr>● Extreme efficiency – Compared to StructGPT, LightPROF  reduces token input by 98% and runtime by 30%, while maintaining  accuracy and stable output even in complex multi-hop questions.   \u003Cbr>● Ablation insights – Removing structural signals or training  steps severely degrades performance, confirming the critical role of  the Knowledge Adapter and retrieval strategy. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03137), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1910319109096747191) |\n| 7) Compute Agent Arena  Computer Agent Arena is a new open platform for benchmarking LLM and VLM-based agents on real-world computer-use tasks, like coding, editing, and web navigation, using a virtual desktop environment. Initial results show that OpenAI and Anthropic are leading with modest success rates, while the platform aims to grow through crowdsourced tasks, agent submissions, and open-sourcing of its infrastructure.  [Report](https:\u002F\u002Farena.xlang.ai\u002Fblog\u002Fcomputer-agent-arena) | [Tweet](https:\u002F\u002Fx.com\u002FBowenWangNLP\u002Fstatus\u002F1909618451259572328) |\n| 8) Agentic Knowledgeable Self-awareness  KnowSelf is a new framework that introduces agentic knowledgeable self-awareness, enabling LLM agents to dynamically decide when to reflect or seek knowledge based on situational complexity, mimicking human cognition. Using special tokens for \"fast,\" \"slow,\" and \"knowledgeable\" thinking, KnowSelf reduces inference costs and achieves state-of-the-art performance on ALFWorld and WebShop tasks with minimal external knowledge. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03553v1) |\n| 9) One-Minute Video Generation with Test-Time Training  One-Minute Video Generation with Test-Time Training introduces TTT layers, a novel sequence modeling component where hidden states are neural networks updated via self-supervised loss at test time. By integrating these into a pre-trained diffusion model, the authors enable single-shot generation of one-minute, multi-scene videos from storyboards, achieving 34 Elo points higher than strong baselines like Mamba 2 and DeltaNet in human evaluations | [Paper](https:\u002F\u002Ftest-time-training.github.io\u002Fvideo-dit\u002Fassets\u002Fttt_cvpr_2025.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fkaransdalal\u002Fstatus\u002F1909312851795411093) |\n| 10) NoProp  NoProp is a novel gradient-free learning method where each neural network layer independently learns to denoise a noisy version of the target, inspired by diffusion and flow matching. Unlike backpropagation, it avoids hierarchical representation learning and achieves competitive performance and efficiency on image classification benchmarks like MNIST and CIFAR. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24322) |\n\n## Top ML Papers of the Week (March 31 - April 6) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) PaperBench  OpenAI introduces a new benchmark, PaperBench, to test whether AI agents can replicate cutting-edge machine learning research papers, from scratch.   ● A rigorous replication challenge – PaperBench  evaluates agents on reproducing entire ML papers from ICML 2024 (20  total, across 12 research areas). Agents must understand the paper,  build the codebase from scratch, and run experiments to match results.  Each paper comes with a fine-grained rubric (~8,316 tasks total)  co-designed with the original authors.   ● Automatic grading with LLM judges – To make  evaluation scalable, the team built a   rubric-based judge (o3-mini with scaffolding) that scores replications  with high agreement (F1 = 0.83) against human experts. They also  release JudgeEval, a benchmark for assessing judge accuracy.   ● Frontier model performance is modest – Claude  3.5 Sonnet scored highest with 21.0%, followed by o1 (13.2%) and  GPT-4o (4.1%). Even with longer runtimes and prompt tuning  (IterativeAgent), no model surpassed a 26.0% score. By contrast, ML  PhDs hit 41.4% on a 3-paper subset in 48 hours, showing humans still  lead in long-horizon agentic tasks.   ● CodeDev variant for lightweight evals – A  simplified PaperBench Code-Dev version skips execution and just grades  code structure. o1 scored 43.4% there, showing more promise when  runtime issues are excluded.   ● Failure modes and insights – Models often “gave up  early,” lacked strategic planning, and failed to iterate. Claude did  better with BasicAgent (freer form), while o1 benefited from  IterativeAgent (structured prompts). This highlights how sensitive  agents are to prompting and scaffolding.   ● Open-source release – PaperBench (with rubrics, grading  infra, and replication results) is fully open-sourced to drive further  progress on long-horizon agent tasks and autonomous AI R&D. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01848), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1907481490457506235), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fpreparedness) |\n| 2) Command A: An Enterprise-Ready LLM  Cohere announced Command A, a 111B parameter open-weights LLM built for enterprise-grade RAG, agents, code, and multilingual tasks. Key contributions:   ● Modular expert merging for domain mastery –  Instead of monolithic post-training, Command A uses a decentralized  training pipeline. Separate expert models are fine-tuned for specific  domains (e.g., math, RAG, multilingual, safety, code), then merged  into one model using efficient weighted parameter soup techniques.  This preserves most expert performance with just ~1.8% average drop.   ● Hybrid architecture for long-context efficiency  – Command A interleaves sliding window and full attention layers,  achieving 256k context support with drastically lower KV cache memory  usage—e.g., only ~33% of LLaMA 3 70B at 128k. It scores 95.0% on  RULER, outperforming most long-context peers.   ● Superb agentic capabilities – Built for RAG, tool use,  and ReAct-style agents, Command A beats GPT-4o and Claude 3.5 on  TauBench and BFCL. Tool use is trained via a blend of human-annotated  and synthetic data, then aligned with CoPG and SRPO (self-improving  preference optimization).   ● Best-in-class enterprise evaluations – On real-world  generative tasks (e.g., chat summarization, FAQ generation) and RAG  use cases (long workplace policy documents), Command A tops the  leaderboard with 94.2% pass rate, 4.73 correctness, and 91%  unanswerable QA accuracy.   ● Multilingual excellence – Command A is trained in 23 global  languages with heavy data curation and preference tuning. It scores  #1 in dialect alignment (ADI2), 90.3% average LPR (language  consistency), and outperforms LLaMA 3.3, GPT-4o, and DeepSeek in  manual Arena-style win rates across all languages.   ● Polishing for human alignment – Final alignment used  a ping-pong loop of offline SRPO and online CoPG with RLHF. This  yielded +17pt human win rate gains on code, +10pt on reasoning, and  lifted Command A’s win rate over GPT-4o to parity (~50.4%).   ● Fast, efficient, and open – Despite its power,  Command A runs on just 2×A100s or H100s and generates 156  tokens\u002Fsec—faster than GPT-4o and DeepSeek. Model weights are released  (CC-BY-NC) on Hugging Face. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00698), [Tweet](https:\u002F\u002Fx.com\u002Fnrehiew_\u002Fstatus\u002F1908181303339471020), [Models](https:\u002F\u002Fhuggingface.co\u002FCohereForAI\u002Fc4ai-command-a-03-2025) |\n| 3) CodeScientist  Researchers at AI2 release CodeScientist, a system that autonomously generates and tests scientific hypotheses via code-based experimentation. It’s among the first to produce validated discoveries with minimal human input. Key ideas:   ● Code-first scientific agent – CodeScientist reviews  research papers and assembles experiments using vetted Python code  blocks (e.g., for analysis, simulation). It follows a five-step  pipeline: Ideation → Planning → Code Execution → Reporting →  Meta-Analysis.   ● Validated AI discoveries – From 50 AI research papers on  agents and virtual environments, CodeScientist proposed 19 findings.  Of these, 6 were judged scientifically sound and novel. Examples:   ● Human-guided autonomy – Full automation is possible, but  brief human feedback (e.g., ranking ideas) significantly boosts output  quality. Human-in-the-loop interaction improves idea selection and  experiment debugging.   ● Challenges remain – Despite successes, over half the  generated experiments fail due to code errors, not scientific  flaws. Peer review is still needed to verify results, and current  systems lack deep methodological rigor. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22708), [Blog](https:\u002F\u002Fallenai.org\u002Fblog\u002Fcodescientist), [GitHub](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fcodescientist) |\n| 4) Retrieval-Augmented Reasoning Model  Introduces RARE, a new paradigm for training domain-specific LLMs that focuses on reasoning, not memorization. Key ideas:   ● Inspired by Bloom’s Taxonomy – RARE shifts LLM  training from memorizing knowledge (“Remember”) to applying and  evaluating it (“Analyze”, “Create”). It separates domain knowledge  (retrieved externally) from domain thinking (learned during training),  enabling better performance under tight parameter budgets.   ● Open-book prepared training – RARE injects retrieved  knowledge into training prompts, letting models learn reasoning  patterns instead of rote facts. This open-book, reasoning-first setup  beats both standard SFT and RAG approaches, especially in medicine.   ● Massive accuracy gains with small models –  On five medical QA benchmarks,   RARE-trained Llama-3.1-8B and Qwen-2.5-7B outperformed GPT-4 + RAG,  with up to +20% accuracy boosts (e.g., PubMedQA: 78.63% vs. GPT-4’s  75.2%, CoVERT: 74.14% vs. GPT-4’s 65.67%).   ● Training via distillation + adaptive retries  – RARE distills answers (and reasoning paths) from a strong teacher  (e.g., QwQ-32B), refining outputs until a correct answer is found.  This creates a high-quality dataset that teaches contextualized,  case-based thinking.   ● New role for retrieval – Unlike standard RAG (used  only at inference), RARE uses retrieval during training to shape  reasoning. It models knowledge integration (p(kx, R(x))) and  reasoning (p(rx, R(x), k)) as separate steps, replacing memorization  with application.  Overall, this work reframes LLM training for domain-specific intelligence: externalize facts, internalize reasoning. It unlocks strong performance from small models without overfitting or hallucination. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23513), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1907796990966247484) |\n| 5) Why do LLMs Attend to First Token?  This new paper explains why LLMs obsessively focus attention on the first token — a phenomenon known as an attention sink.  Their theory: it’s a useful trick to prevent representational collapse in deep Transformers.   ● Sinks = over-mixing shields – LLMs with long  contexts and deep layers tend to over-mix information, causing similar  embeddings for all tokens (i.e., rank collapse or over-squashing).  Attention sinks—where many heads fixate on the ⟨bos⟩ token—act as  no-ops that reduce token interaction and preserve representation  diversity across layers.   ● Sharp experiments on Gemma & LLaMa –  Perturbation tests in Gemma 7B show ⟨bos⟩ significantly slows the  spread of changes through the model. Meanwhile, in LLaMa 3.1 models,  over 80% of attention heads show strong sink behavior in the 405B  variant, supporting the theory that larger models need stronger sinks.   ● Sinks emerge naturally – Even without special  pretraining, sinks tend to form at the first position, not because of  the ⟨bos⟩ token itself, but due to its location. However, if ⟨bos⟩ is  fixed during training and later removed, performance collapses,  showing that sink formation is   data-dependent.   ● Theoretical grounding – The authors connect sink emergence  to Jacobian norm bounds, proving that sinks reduce sensitivity to  token perturbations. Their math shows that deeper models and longer  contexts require stronger sinks.   ● Layerwise dynamics insight – Some attention heads use  ⟨bos⟩ as a “default” target, unless a special pattern (e.g.,  apostrophe) triggers real computation. This supports a conditional  attention mechanism—attend to ⟨bos⟩ unless needed elsewhere. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02732), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1908187563422261411) |\n| 6) Self-Evolving Multi-Agent Simulations for Realistic Clinical Interactions  Presents MedAgentSim is a fully automated, open-source hospital simulation where LLM-powered agents simulate doctor-patient interactions in dynamic diagnostic settings. Unlike previous static QA  benchmarks, MedAgentSim mimics real-world clinical workflows with multi-turn dialogue, test requests, and self-improvement.  More about this paper:   ● Active doctor agents – MedAgentSim requires LLM doctor  agents to engage in multi-turn consultations, request labs and imaging  (e.g., ECG, X-ray), and iteratively refine diagnoses, making it far  more realistic than pre-filled medical QA datasets.   ● Self-improvement via memory + reflection – The  system maintains buffers of successful and failed diagnoses. It uses  retrieved past cases (via kNN), chain-of-thought reasoning, and  ensembling to improve performance over time. Misdiagnoses trigger a  reflection phase before inclusion in memory.   ● Fully autonomous or human-in-the-loop – Users can  optionally take control of the doctor or patient agents. Simulation  assets are built using a 2D game engine (Phaser), and the agents can  navigate, converse, and interact with virtual medical tools.   ● Big performance boost across benchmarks – On  NEJM, MedQA, and MIMIC-IV, MedAgentSim (with LLaMA 3.3) outperforms  baseline setups by +6–37%, especially in vision-language tasks using  LLaVA for interpreting medical images.   ● Bias analysis & fairness focus – The team  studied diagnostic accuracy under cognitive and implicit bias  conditions. Models like GPT-4o and LLaMA proved more robust than  Mixtral\u002FMistral, highlighting the importance of bias-aware evaluation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22678), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1906719555482702147), [Code](https:\u002F\u002Fgithub.com\u002FMAXNORM8650\u002FMedAgentSim) |\n| 7) Open Deep Search  Researchers from Sentient, UW, Princeton, and UC Berkeley introduce Open Deep Search (ODS), an open-source search AI framework that rivals top proprietary systems like GPT-4o Search Preview and Perplexity Sonar. Key insights:   ● Two open components: search + reasoning –  ODS has two modular parts: (1) Open Search Tool, which retrieves and  refines high-quality web results using query rephrasing, snippet  reranking, and site-specific logic; and (2) Open Reasoning Agent, a  controller that orchestrates tool usage (search, calculator, etc.) to  answer queries. Two variants are offered: ODS-v1 (ReAct) and ODS-v2  (CodeAct).   ● SOTA open-source performance – With DeepSeek-R1 as the  base LLM, ODS-v2 scores 88.3% on SimpleQA and 75.3% on FRAMES, beating  GPT-4o Search Preview by +9.7% on the latter. ODS adapts the number of  searches per query (avg. 3.39 on FRAMES), balancing cost and accuracy  more efficiently than fixed-query baselines.   ● Better than Perplexity Sonar – On both FRAMES and  SimpleQA, ODS+DeepSeek-R1 outperforms Perplexity’s flagship search  models, even in complex reasoning tasks involving multi-hop questions,  time\u002Fdate calculations, and name disambiguation.   ● Code-based agents enhance reasoning – ODS-v2 builds  on CodeAct, allowing it to write and run Python code to perform  symbolic reasoning and tool calls. This results in sharper numerical  precision and task flexibility compared to CoT-based ReAct in ODS-v1. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20201), [Tweet](https:\u002F\u002Fx.com\u002Fsewoong79\u002Fstatus\u002F1906595129965912341), [GitHub](https:\u002F\u002Fgithub.com\u002Fsentient-agi\u002FOpenDeepSearch) |\n| 8) Efficient Test-time Scaling with Code  Z1 is a new method for making large language models more compute-efficient at test time, especially during reasoning. The core idea is to train LLMs with short and long code-based reasoning trajectories, and then dynamically adjust reasoning depth during inference. Key contributions:   ● Z1-Code-Reasoning-107K dataset – They construct a  107K-sample dataset with short and long reasoning paths for simple and  complex coding problems. Trajectories are distilled from QwQ-32B and  paired to help the model learn when to stop thinking.   ● Shifted Thinking Window – A new test-time strategy that  eliminates explicit \u003Cthink delimiters. Instead, the model adapts  reasoning token budget based on problem difficulty. Simple problems  invoke shallow reasoning; complex ones get capped (e.g., 4096 tokens  max), with hints nudging the model to finalize the answer.   ● Big efficiency gains – The 7B-scale model Z1-7B matches  R1-Distill-Qwen-7B across multiple reasoning tasks (MATH500,  LiveCodeBench, GPQA Diamond) but with ~30% of the reasoning tokens.  For instance, on GPQA Diamond, Z1-7B achieves 47.5% while using less  than half the tokens.   ● Code reasoning transfers to general tasks –  Despite being trained only on code-based CoT data, Z1 generalizes well  to broader domains like science and math, outperforming other 7B  reasoning models (e.g., OpenThinker-7B, s1.1-7B) across multiple  benchmarks.   ● What makes reasoning data effective? – Ablation  studies reveal two key dataset design levers: (1) longer reasoning  trajectories improve inference quality; (2) larger training sample  sizes boost average thinking time and accuracy, even without altering  trajectory length. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00810v1) |\n| 9) A Survey of Efficient Reasoning for LLMs  This survey focuses on reasoning economy in LLMs, analyzing how to balance deep reasoning performance with computational cost. It reviews inefficiencies, behavioral patterns, and potential solutions at both post-training and inference stages. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24377), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1907072213142151488) |\n| 10) Hidden Factual Knowledge in LLMs  This study introduces a framework to measure hidden knowledge in LLMs, showing that models encode significantly more factual information internally than they express in outputs, up to 40% more. It also finds that some answers, although known internally, are never generated, highlighting key limits in test-time sampling for QA tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15299), [Tweet](https:\u002F\u002Fx.com\u002Fzorikgekhman\u002Fstatus\u002F1906693729886363861) |\n\n## Top ML Papers of the Week (March 24 - March 30) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Tracing the Thoughts of LLMs  Anthropic researchers unveil new interpretability tools for peering inside LLMs, using Claude 3.5 Haiku as a testbed. Their two new papers show how to trace model internals like circuits, plans, and conceptual thinking in real time. Key findings:   ● Multilingual \"language of thought\" – Claude  processes concepts like “small” or “opposite” similarly across  English, French, and Chinese, suggesting a shared abstract  representation layer. As models scale, these cross-lingual features  increase, enabling transfer learning between languages.   ● Planning ahead—even in poetry – Contrary to  expectations, Claude plans rhymes before writing. When generating the  line “His hunger was like a starving rabbit,” it had already “decided”  on rhyming with “grab it.” Researchers could suppress or swap this  plan to alter the ending dynamically.   ● Mental math with parallel circuits – Claude  computes sums using parallel circuits: one estimates the result, the  other nails the last digit. But it explains answers with human-style  logic (e.g., \"carry the 1\"), revealing a gap between internal  computation and verbal justification.   ● Detecting unfaithful reasoning – Sometimes, Claude  fabricates logical steps to fit a target answer, especially when  guided by incorrect hints. Interpretability tools could catch these  cases by showing that internal computation doesn’t match the  explanation—a key advance for AI audits.   ● Conceptual chains in multi-step reasoning – For  questions like “What is the capital of the state where Dallas is  located?”, Claude first represents “Dallas → Texas” then “Texas →  Austin.” Researchers could intervene mid-chain to make it say  “Sacramento” instead, proving the reasoning is dynamic and  compositional.   ● Hallucinations and refusals – The model defaults to  refusal unless prompted with known concepts. Misfires in circuits for  “known answers” cause hallucinations (e.g., inventing facts about a  fake name like “Michael Batkin”). Researchers could toggle this  behavior by manipulating feature activations.   ● Jailbreak anatomy – A jailbreak using the phrase “Babies  Outlive Mustard Block” (BOMB) initially fools Claude into outputting  dangerous info. Internal tracing shows   grammar-consistency features temporarily override safety, until the  model finishes a coherent sentence, then its safety response kicks in. | [Blog](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Ftracing-thoughts-language-model), [Paper 1](https:\u002F\u002Ftransformer-circuits.pub\u002F2025\u002Fattribution-graphs\u002Fmethods.html), [Paper 2](https:\u002F\u002Ftransformer-circuits.pub\u002F2025\u002Fattribution-graphs\u002Fbiology.html), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1905303835892990278) |\n| 2) Qwen2.5-Omni  Qwen2.5-Omni is a single end-to-end multimodal model that can perceive and understand text, audio, image, and video, and generate both text and speech in real time. It introduces architectural and training innovations that push the boundaries of streaming, multi-signal intelligence. Highlights:   ● Thinker-Talker architecture – Inspired by the human brain  and mouth, Qwen2.5-Omni separates reasoning (Thinker) and speech  generation (Talker). Thinker (a transformer decoder) handles all  perception and text generation. Talker (a dual-track autoregressive  decoder) generates speech by consuming both text and hidden states  from Thinker. Together, they’re trained end-to-end for synchronized  text-speech output.   ● Streaming-first design – To support real-time interaction,  Qwen2.5-Omni implements block-wise encoders (for audio and vision) and  a sliding-window codec generator for streaming audio. The model  introduces TMRoPE (Time-aligned Multimodal RoPE), a 3D positional  encoding system that aligns video and audio inputs to the same time  axis.   ● Pretraining scale & alignment – Trained on over 1.2  trillion tokens of diverse multimodal data, including 300B audio and  100B video-audio tokens. Uses instruction-tuned ChatML formatting and  performs multi-stage post-training for both Thinker and Talker. Talker   undergoes RL fine-tuning (DPO) and multi-speaker adaptation to ensure  natural, stable speech output.   ● SOTA across modalities – Qwen2.5-Omni achieves  state-of-the-art on OmniBench, surpasses Qwen2-Audio in ASR\u002FS2TT, and  matches or beats Qwen2.5-VL in image and video tasks. On SEED  zero-shot TTS, it outperforms CosyVoice 2 and F5-TTS in naturalness  and stability, with low WER and high speaker similarity.   ● Closes the voice-text gap – On a voice-instruction  benchmark (converted from MMLU, GSM8K, etc.), Qwen2.5-Omni nearly  matches its own text-instructed sibling Qwen2-7B, showing dramatic  improvements in speech-based instruction following. | [Paper](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni\u002Fblob\u002Fmain\u002Fassets\u002FQwen2.5_Omni.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAlibaba_Qwen\u002Fstatus\u002F1904944923159445914) |\n| 3) AgentRxiv  Researchers from Johns Hopkins & ETH Zurich present AgentRxiv, a framework enabling LLM agents to autonomously generate and share research papers, mimicking how human scientists build on each other’s work. Highlights:   ● AgentRxiv = arXiv for LLMs – It’s an open-source  preprint server for autonomous agents, letting labs upload papers,  search past work, and iteratively improve results. Labs use this to  develop and refine reasoning techniques over generations of research.   ● Massive reasoning gains via iterative  research – On the MATH-500 benchmark, a single agent lab improves  GPT-4o mini accuracy from 70.2% → 78.2% (+11.4%) by discovering better  prompt strategies. The final method (SDA) outperforms earlier ideas  like CRUC and DCCP. → SDA = Simultaneous Divergence Averaging:  combines low\u002Fhigh-temp CoT outputs with dynamic similarity-based  voting and confidence aggregation.   ● Knowledge generalizes – SDA also improves other benchmarks:   ● Collaboration boosts discovery – Running 3 agent labs in  parallel yields faster progress and higher final accuracy (up to  79.8%, +13.7% over baseline) by sharing results via AgentRxiv. Early  gains (e.g., 76.2% accuracy) arrive after only 7 papers vs. 23  sequentially.   ● Self-improvement and novelty – Agents independently  refine their own past ideas. Papers evolve from earlier iterations  (e.g., Meta-Mirror Prompting → Meta-Mirror Prompting 2). Top papers  show no plagiarism via multiple detectors, but ideas like SDA build on  trends like self-consistency and CoT voting.   ● Cost & runtime – Generating a paper takes ~1.36 hours  and ~$3.11. Parallel setups are pricier overall but achieve results  faster (time-to-accuracy win). Failure modes include hallucinated  results and fragile code repair steps, with future work needed for  better reliability and novelty guarantees. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18102), [Tweet](https:\u002F\u002Fx.com\u002FSRSchmidgall\u002Fstatus\u002F1904172862014984632) |\n| 4) Neural Alignment via Speech Embeddings  Google Research and collaborators reveal striking similarities between LLM embeddings and human brain activity during conversation. Key insights:   ● Embeddings match brain signals – Using intracranial  electrode recordings, the team showed that internal representations  (embeddings) from OpenAI's Whisper model align with neural responses  in brain regions for speech (STG), language (IFG), and motor planning  (MC). During comprehension, speech embeddings predict early auditory  responses, while language embeddings follow in IFG. During production,  this order reverses — first language planning (IFG), then articulation  (MC), then auditory feedback (STG).   ● “Soft hierarchy” in brain areas – Though STG  emphasizes acoustic info and IFG captures word-level meaning, both  regions show partial alignment with both embedding types. This  suggests a gradient processing structure, not a strict modular  pipeline.   ● Brain predicts next word too – In follow-up  studies published in Nature Neuroscience, the brain’s language areas  were found to predict upcoming words, mirroring the objective of   autoregressive LLMs. The surprise response after hearing a word also  mirrors LLM prediction errors.   ● Shared geometry in language representations –  The geometry of word relationships in brain activity mirrors that of  LLM embeddings, per a separate Nature Communications paper. This  indicates a convergent structure in how LLMs and the brain represent  language.   ● Different wiring, same function – Despite  similarities in objectives and representations, LLMs and brains  diverge architecturally: brains process speech serially and  recursively, while Transformers process in parallel across layers.   ● Toward biologically inspired AI – These studies  support using LLMs to reverse-engineer the brain’s language  mechanisms. The team aims to build future models with more brain-like  learning, data, and structure, bridging neuroscience and deep  learning. | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41562-025-02105-9), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1904947715458711706) |\n| 5) Chain-of-Tools  This new paper presents Chain-of-Tools (CoTools), a new method to enable LLMs to incorporate expansive external toolsets—including tools never seen during training—while preserving CoT (chain-of-thought) reasoning. Highlights:   ● Frozen LLM with lightweight fine-tuning – Unlike conventional  approaches, CoTools keeps the LLM’s parameters frozen, instead  fine-tuning separate modules (a Tool Judge and Tool Retriever) on top  of the model’s hidden states. This preserves the LLM’s core  capabilities while letting it call an open-ended set of tools during  reasoning.   ● Massive unseen tools – CoTools treats tools as semantic vectors  computed from their textual descriptions. Even tools that never appear  in the fine-tuning data can be invoked if they match the model’s query  vectors, enabling new tools to be plugged in without retraining the  entire system.   ● Tool calls integrated into CoT – The system determines whether and  when to call a tool in the middle of generating an answer. It then  selects the best tool from thousands of candidates based on learned  representations of the query and partial solution context. This helps  to significantly boost accuracy on complex tasks.   ● Strong gains on reasoning and QA – Experiments on GSM8K-XL, FuncQA,  KAMEL, and the newly introduced SimpleToolQuestions dataset (with  1,836 tools) show improved   tool-selection accuracy and superior final answers versus baseline  methods. Notably, CoTools consistently scales to large tool pools and  generalizes to unseen tools. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16779), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1904190225079022018) |\n| 6) Structured Memory Augmentation for Smarter LLM Agents  MemInsight is a framework that autonomously augments and structures memory for LLM agents, improving context retention and retrieval. Key insights include:   ● Structured, autonomous memory augmentation – Instead  of relying on raw historical data or manually defined memory  structures, MemInsight uses a backbone LLM to autonomously mine  attributes from past conversations or knowledge. These are organized  into entity-centric and conversation-centric (e.g., user emotion or  intent) augmentations at either the turn or session level. This mimics  how humans abstract and prioritize experiences.   ● Attribute-guided retrieval beats vanilla RAG –  MemInsight supports both attribute-based retrieval (exact match  filtering) and embedding-based retrieval (via FAISS). On the LoCoMo QA  dataset, MemInsight outperformed a Dense Passage Retrieval (RAG)  baseline by up to +34% recall. The best setup (priority-based  Claude-Sonnet augmentations) achieved 60.5% Recall@5, vs. 26.5% for  RAG.   ● More persuasive recommendations – In movie  recommendations using the LLM-REDIAL dataset, MemInsight lifted  genre-matched recommendation scores while cutting down   memory size by 90%. Embedding-based filtering led to +12% more highly  persuasive outputs, per LLM judgment.   ● Event summarization via memory alone –  MemInsight’s annotations alone can be used to summarize long  conversational sessions. These memory-only summaries rival  raw-dialogue baselines in coherence and relevance (per G-Eval scores),  particularly when turn-level augmentations are combined with original  dialogue context.   ● Minimal hallucinations, stable performance –  Comparative analysis of augmentation models (Claude-Sonnet, Llama,  Mistral) shows Claude-Sonnet produces more stable, consistent, and  grounded attributes, reinforcing the importance of careful model  selection in memory pipelines. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21760v1) |\n| 7) Investigating Affective Use and Emotional Well-being on ChatGPT  Researchers from OpenAI & MIT Media Lab explore how emotionally engaging interactions with ChatGPT (especially in Voice Mode) may impact user well-being. Using platform-wide data and a randomized controlled trial (RCT), they uncover nuanced effects of chatbot usage on loneliness, dependence, and socialization.   ● Two complementary studies – The team combines:   ● High usage = higher emotional entanglement –  Across both studies, users with higher usage (especially voice  interactions) were more likely to show signs of:   ● Voice mode showed mixed effects – In the RCT,  voice models led to better emotional well-being compared to text  models when controlling for usage. But:   ● Tiny group, big impact – A small number of users  (~10%) account for the majority of emotionally charged conversations.  Power users used pet names, shared problems, and formed  pseudo-relationships with the model.   ● Automated classifiers at scale – They developed 25+  LLM-based affective classifiers (e.g., “Pet Name,” “Seeking Support”)  to scan millions of conversations without human review. Classifier  results closely mirrored user self-reports.   ● Call for socioaffective alignment – The authors urge  developers to consider socioaffective alignment, designing models that  support users without exploiting emotional needs. They warn of risks  like “social reward hacking,” where a model mirrors or flatters users  to maximize engagement. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002F15987609-5f71-433c-9972-e91131f399a1\u002Fopenai-affective-use-study.pdf) |\n| 8) Play2Prompt  Researchers from MIT CSAIL and IBM introduce Play2Prompt, a framework that empowers LLM agents to learn how to use external tools entirely in a zero-shot manner, without requiring labeled examples or high-quality documentation. Key innovations include:   ● Tool \"play\" for usage discovery – Play2Prompt  treats tools like black boxes and systematically plays with them (via  trial-and-error API calls) to discover correct usage patterns. It  reverse-engineers examples by first identifying working invocations,  then generating a query-answer pair that fits the invocation and  response.   ● Two-stage optimization – The system iteratively builds: (1)  tool-use demonstrations via   self-reflective beam search and rejection sampling; and (2) refined  tool documentation, using those examples as a validation set. This  dual improvement allows LLMs to better understand and utilize  unfamiliar APIs.   ● Self-reflective beam search – Inspired by active  learning, Play2Prompt favors hard examples that models initially fail  on. These examples offer higher learning value and guide documentation  improvements more effectively.   ● Strong zero-shot performance – On BFCL Executable and  StableToolBench, Play2Prompt yields consistent accuracy gains of +5–7%  over baseline LLaMA and GPT-3.5 models and   even boosts GPT-4o by up to +3.3%, particularly excelling in  challenging multi-tool or REST call settings.   ● Robust to poor documentation – Even when 50% of  parameter descriptions are randomly dropped, Play2Prompt recovers and  surpasses baseline performance, making it ideal for real-world tool  integration with sparse or noisy metadata.   ● Better than EasyTool – Unlike prior methods like  EasyTool (which depend on labeled examples from related tools),  Play2Prompt remains fully zero-shot and outperforms them in  consistency, especially for models sensitive to instruction drift like  GPT-4o. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14432) |\n| 9) Synthetic Data Generation Using LLMs  LLMs are increasingly used to generate synthetic training data for language and code tasks, improving performance in low-resource scenarios through techniques like prompt-based generation and self-refinement. The paper highlights benefits like cost and coverage, while addressing issues such as factual errors and bias, and suggests mitigations and future research in prompt automation and evaluation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14023) |\n| 10) Current and Future Use of LLMs for Knowledge Work  A two-part survey study of 216 and 107 participants reveals that knowledge workers currently use LLMs for tasks like code generation and text improvement, but envision deeper integration into workflows and data. The findings inform future design and adoption strategies for generative AI in professional settings. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16774v1) |\n\n## Top ML Papers of the Week (March 17 - March 23) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) A Review of DeepSeek Models  This paper provides an in-depth review of the cutting-edge techniques behind DeepSeek's open-source LLMs—DeepSeek-V3 and DeepSeek-R1. These models achieve state-of-the-art  performance with significantly lower resource requirements compared to proprietary counterparts. Key highlights include:   \u003Cbr>● Multi-Head Latent Attention (MLA) – Introduces efficient attention  by compressing keys and values into a latent vector, dramatically  reducing memory consumption for long-context tasks without sacrificing  performance. MLA employs low-rank compression and decoupled Rotary  Position Embeddings, outperforming standard multi-head attention.   \u003Cbr>● Advanced Mixture of Experts (MoE) – Incorporates fine-grained expert  segmentation and dedicated shared experts, significantly enhancing  combinational flexibility. An innovative load-balancing strategy  further optimizes computational efficiency and model performance.   \u003Cbr>● Multi-Token Prediction (MTP) – Enhances training efficiency by  predicting multiple subsequent tokens simultaneously. Although  effective, the additional training overhead warrants further  optimization.   \u003Cbr>● Algorithm-Hardware Co-design – Presents engineering advancements  like DualPipe scheduling, an algorithm designed to eliminate pipeline  bubbles, and FP8 mixed-precision training, maximizing computational  efficiency and reducing training resources.   \u003Cbr>● Group Relative Policy Optimization (GRPO) – Offers a streamlined RL  algorithm eliminating value function approximation from PPO, directly  estimating advantages from grouped outputs, drastically reducing GPU  memory usage.   \u003Cbr>● Post-Training Reinforcement Learning – Demonstrates pure RL's  capability in DeepSeek-R1-Zero, which learns advanced reasoning  without supervised fine-tuning. DeepSeek-R1 further improves this  approach via iterative cold-start fine-tuning, rejection sampling, and  RL alignment to enhance reasoning quality and language consistency. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11486) |\n| 2) Towards Hierarchical Multi-Step Reward Models for Enhanced Reasoning in LLMs  It proposes a Hierarchical Reward Model (HRM) that addresses reward hacking and error propagation issues in fine-grained LLM reasoning. They also introduce Hierarchical Node Compression (HNC) to augment MCTS-based automatic data annotation, boosting label diversity and robustness at minimal computational cost.   \u003Cbr>● Hierarchical vs. single-step rewards – Traditional Process Reward  Models (PRM) assign fine-grained rewards per step but can penalize  corrections of earlier mistakes. By contrast,   HRM assesses multiple consecutive steps, capturing coarse-grained  coherence and enabling self-correction of earlier errors. This yields  more robust and reliable evaluations.   \u003Cbr>● Solving “reward hacking” – PRM often misleads policy models into  short-sighted strategies that artificially maximize step-level  rewards. HRM’s multi-step feedback framework penalizes incomplete or  incoherent reasoning, mitigating reward hacking behaviors.   \u003Cbr>● Hierarchical Node Compression (HNC) – Generating step-by-step  annotations with Monte Carlo Tree Search (MCTS) is computationally  heavy. The HNC method merges adjacent nodes in the search tree,  expanding the dataset with controlled noise yet minimal extra cost.  This more diverse training set enhances the reward model’s robustness.   \u003Cbr>● Stronger generalization – Experiments on PRM800K and cross-domain  tasks (MATH500, GSM8K) show HRM consistently outperforms standard  outcome-based or step-based reward models, particularly on deeper,  more complex chains of thought. Policy models fine-tuned with HRM  yield higher accuracy and more stable step-by-step solutions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13551), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1902360668856315990) |\n| 3) DAPO: An Open-Source LLM Reinforcement Learning System at Scale  It introduces DAPO, a fully open-source, large-scale RL system that boosts the chain-of-thought reasoning capabilities of LLMs.  DAPO raises the upper clipping threshold (“Clip-Higher”) in PPO-style training, preventing entropy collapse and helping the policy explore more diverse tokens.  By filtering out samples that are always correct or always wrong, DAPO focuses training on prompts with useful gradient signals, speeding up convergence in fewer updates.  Instead of averaging losses at the sample level, DAPO applies policy gradients per token, making each reasoning step matter. This ensures both high-quality and length-appropriate outputs.  The system masks or softly penalizes excessively long answers, preventing meaningless verbosity or repetitive text.  DAPO achieves SOTA math performance on the AIME 2024 test set. Specifically, DAPO trained from a Qwen2.5-32B base achieves 50% accuracy, outperforming DeepSeek’s R1 with less training time, and showcasing open-source reproducibility at scale. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1902364950821257288) |\n| 4) Compute Optimal Scaling of Skills  Researchers from the University of Wisconsin and Meta AI investigate how different skills (knowledge-based QA vs. code generation) exhibit contrasting optimal scaling behaviors in LLMs. Their key question: does the compute-optimal trade-off between model size and data volume depend on the type of skill being learned? Surprisingly, the answer is yes—they show distinct “data-hungry” vs. “capacity-hungry” preferences per skill. Highlights:   \u003Cbr>● Skill-dependent scaling laws – Traditional scaling laws optimize the  overall loss on a generic validation set. However, this paper shows  that knowledge tasks prefer bigger models (capacity-hungry), while  code tasks prefer more data tokens (data-hungry).   \u003Cbr>● Differences persist even after balancing data – Tweaking the  pretraining mix (e.g. adding more code data) can shift that skill’s  optimal ratio, but fundamental differences remain. Knowledge-based QA  still tends to need more parameters, code still benefits from bigger  data budgets.   \u003Cbr>● Huge impact of validation set – Choosing a validation set that  doesn’t reflect the final skill mix can lead to misaligned  compute-optimal model sizes by 30%–50% at lower compute scales. Even  at higher scales, suboptimal validation sets skew the best parameter  count by over 10%.   \u003Cbr>● Practical takeaway – Model developers must pick or design validation  sets that represent the real skill mix. If your ultimate goal is to  excel at knowledge-based QA, you likely need a more capacity-hungry  strategy. If it’s coding tasks, you might focus on data-hungry  training. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10061), [Tweet](https:\u002F\u002Fx.com\u002Fnick11roberts\u002Fstatus\u002F1902875088438833291) |\n| 5) Thinking Machines  This survey provides an overview and comparison of existing reasoning techniques and presents a systematic survey of reasoning-imbued language models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10814), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1901645973681823962) |\n| 6) A Survey on Efficient Reasoning  This new survey investigates techniques to address the \"overthinking phenomenon\" in Large Reasoning Models (LRMs), categorizing existing methods into model-based optimizations, output-based reasoning reductions, and prompt-based efficiency enhancements. The survey  highlights ongoing efforts to balance reasoning capability and computational efficiency in models like OpenAI o1 and DeepSeek-R1. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16419), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1903109602826457531) |\n| 7) Agentic Memory for LLM Agents  Researchers from Rutgers University and Ant Group propose a new agentic memory system for LLM agents, addressing the need for long-term memory in complex real-world tasks. Key highlights include:   \u003Cbr>● Dynamic & Zettelkasten-inspired design – A-MEM autonomously creates  comprehensive memory notes—each with textual attributes (keywords,  tags) and embeddings—then interlinks them based on semantic  similarities. The approach is inspired by the Zettelkasten method of  atomic note-taking and flexible linking, but adapted to LLM workflows,  allowing more adaptive and extensible knowledge management.   \u003Cbr>● Automatic “memory evolution” – When a new memory arrives, the system  not only adds it but updates relevant older memories by refining their  tags and contextual descriptions. This continuous update enables a  more coherent, ever-improving memory network capable of capturing  deeper connections over time.   \u003Cbr>● Superior multi-hop reasoning – Empirical tests on long  conversational datasets show that A-MEM consistently outperforms  static-memory methods like MemGPT or MemoryBank, especially for  complex queries requiring links across multiple pieces of information.  It also reduces token usage significantly by selectively retrieving  only top-k relevant memories, lowering inference costs without  sacrificing accuracy. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12110) |\n| 8) DeepMesh  Researchers from Tsinghua University, Nanyang Technological University, and ShengShu propose DeepMesh, a transformer-based system that generates high-quality 3D meshes with artist-like topology. Key ideas include:   \u003Cbr>● Efficient mesh tokenization – They introduce a new algorithm that  compresses mesh sequences by ~72% while preserving geometric detail,  enabling higher-resolution mesh generation at scale.   \u003Cbr>● Artist-like topology – Unlike dense or incomplete meshes from  existing approaches, DeepMesh predicts structured triangle layouts  that are aesthetic and easy to edit, thanks to a refined pre-training  process and better data curation.   \u003Cbr>● Reinforcement Learning with human feedback – The authors adopt  Direct Preference Optimization (DPO) to align mesh generation with  human preferences. They collect pairwise user labels on geometry  quality and aesthetics, then fine-tune the model to produce more  appealing, complete meshes.   \u003Cbr>● Scalable generation – DeepMesh can handle large meshes (tens of  thousands of faces) and supports both point cloud- and image-based  conditioning, outperforming baselines like MeshAnythingv2 and BPT in  geometric accuracy and user ratings. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15265), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1902713235079299255) |\n| 9) Deep Learning is Not So Mysterious or Different  Andrew Gordon Wilson (New York University) argues that deep learning phenomena such as benign overfitting, double descent, and the success of overparametrization are neither mysterious nor exclusive to neural networks. Major points include:   \u003Cbr>● Benign Overfitting & Double Descent Explained – These phenomena are  reproducible with simple linear models, challenging their supposed  exclusivity to neural networks. The author demonstrates benign  overfitting with high-order polynomials featuring order-dependent  regularization, emphasizing that flexible models can perfectly fit  noisy data yet generalize well when structured data is present.   \u003Cbr>● Soft Inductive Biases as Unifying Principle – The paper advocates  for soft inductive biases instead of traditional hard constraints.  Rather than restricting a model's hypothesis space to prevent  overfitting, a model can remain flexible, adopting a soft preference  for simpler solutions consistent with observed data. Examples include  polynomial regression with increasing penalties on higher-order terms  and neural networks benefiting from implicit regularization effects.   \u003Cbr>● Established Frameworks Describe Phenomena – Wilson emphasizes that  longstanding generalization frameworks like PAC-Bayes and countable  hypothesis bounds already explain the supposedly puzzling behaviors of  neural networks. The author argues against the notion that deep  learning demands entirely new theories of generalization, highlighting  how existing theories adequately address these phenomena.   \u003Cbr>● Unique Aspects of Deep Learning – While asserting deep learning is  not uniquely mysterious, the paper acknowledges genuinely distinctive  properties of neural networks, such as mode connectivity (the  surprising connectedness of different network minima), representation  learning (adaptive basis functions), and their notable universality  and adaptability in diverse tasks.   \u003Cbr>● Practical and Theoretical Implications – The author critiques the  widespread belief in neural network exceptionalism, urging closer  collaboration between communities to build on established  generalization theories rather than reinventing them. Wilson concludes  by identifying genuine open questions in deep learning, particularly  around scale-dependent implicit biases and representation learning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02113) |\n| 10) GNNs as Predictors of Agentic Workflow Performances  This work introduces FLORA-Bench, a large-scale benchmark to evaluate GNN-based predictors for automating and optimizing agentic workflows. It shows that Graph Neural Networks can efficiently predict the success of multi-agent LLM workflows, significantly reducing costly repeated model calls. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11301) |\n\n## Top ML Papers of the Week (March 10 - March 16) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- |\n| 1) Gemma 3  Gemma 3 is a lightweight open model family (1B–27B parameters) that integrates vision understanding, multilingual coverage, and extended context windows (up to 128K tokens). Here is everything you need to know:   \u003Cbr>● Multimodal architecture – Gemma 3 incorporates a frozen SigLIP  vision encoder, condensing images into 256 “soft tokens.” A new Pan &  Scan (P&S) method better handles images of varying aspect ratios by  splitting them into crops at inference, improving tasks like document  QA or text recognition. Use it to analyze images, text, and short  videos.   \u003Cbr>● Up to 128K context length – By interleaving local (sliding-window)  and global attention layers (5:1 ratio), Gemma 3 curbs the explosive  KV-cache memory usage typical of longer contexts. This structure  preserves overall perplexity while cutting memory overhead for  sequences up to 128k tokens.   \u003Cbr>● Knowledge distillation & quantization – The model uses advanced  teacher-student distillation and is further refined with  quantization-aware training (QAT). Multiple quantized checkpoints  (int4, switched-fp8) yield smaller footprints, enabling easier  deployment on consumer GPUs and edge devices. Gemma 3 can fit on a  single GPU or TPU host.   \u003Cbr>● Instruction-tuned performance – After post-training with specialized  reward signals (for math, coding, multilingual chat), Gemma 3 IT  significantly outperforms previous Gemma 2 across benchmarks like  MMLU, coding (HumanEval), and chat-based evaluations. Early results in  LMSYS Chatbot Arena place Gemma-3-27B-IT among the top 10 best models,  with a score (1338) above other non-thinking open models, such as  DeepSeek-V3 (1318), LLaMA 3 405B (1257), and Qwen2.5-70B (1257).   \u003Cbr>● 140 languages and advanced workflows- Gemma 3 supports 35 languages  out-of-the-box and pretrained to support over 140 languages. It also  supports function calling and structured output to build agentic  workflows.   \u003Cbr>● Safety, privacy, and memorization – Focused data filtering and  decontamination reduce exact memorization rates. Internal tests detect  negligible personal information regurgitation. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-gemma\u002FGemma3Report.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1899828483888762948) |\n| 2) Traveling Waves Integrate Spatial Information Through Time  Researchers from Harvard University and Western University propose a wave-based recurrent neural network framework that uses traveling waves of neural activity to perform global spatial integration on visual tasks. Key ideas include:   \u003Cbr>● “Hearing the Shape of a Drum” analogy – The authors draw inspiration  from the famous question “Can one hear the shape of a drum?” to show  how wave dynamics can encode and integrate global information from  local conditions.   \u003Cbr>● Locally coupled oscillators as RNNs – By discretizing the 2D wave  equation into a convolutional recurrent model, each neuron can  propagate and reflect wavefronts, capturing long-distance spatial  context over time.   \u003Cbr>● Global information via time-series readout – Rather than decoding  from just the final state, the model aggregates information across the  entire wave evolution (e.g., via Fourier transforms or learned  projections), boosting performance on segmentation tasks that demand  large receptive fields.   \u003Cbr>● Performance rivaling deeper networks – On synthetic datasets  (polygons, tetrominoes) and real-world benchmarks (MNIST variants),  the wave-based networks outperform or match global CNN\u002FU-Net baselines  with fewer parameters, indicating traveling waves may be an efficient  alternative to standard deep architectures.   \u003Cbr>● Potential neuroscience link – Because traveling waves appear  ubiquitously in cortex, this approach could provide a computational  model aligning with observed neural phenomena and spatiotemporal brain  dynamics. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06034), [Tweet](https:\u002F\u002Fx.com\u002Ft_andy_keller\u002Fstatus\u002F1899154774227878250) |\n| 3) Transformers without Normalization  Researchers from Meta, NYU, MIT, and Princeton present a surprisingly simple method, Dynamic Tanh (DyT), that removes normalization layers (e.g. LayerNorm, RMSNorm) in Transformers while achieving equal or better results. Key ideas include:   \u003Cbr>● Tanh-like mapping of LayerNorm – By analyzing trained models, they  observe that LayerNorm often behaves like an S-shaped tanh function,  scaling inputs while squashing extremes.   \u003Cbr>● Dynamic Tanh (DyT) – Replaces each normalization layer with a  per-channel tanh(αx) and learnable affine parameters. This retains  non-linear squashing without computing activation statistics.   \u003Cbr>● Stable convergence, on par with LN – Across tasks (vision, speech,  diffusion, language modeling), DyT-based models match or exceed  normalized baselines without extra tuning. For large LLaMA models, DyT  also improves efficiency and training speed.   \u003Cbr>● Efficient, widely applicable – Eliminating normalization operations  saves computation overhead. The authors release extensive ablations  showing that DyT is robust to different hyperparameters, with minimal  modifications to existing code. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10622), [Tweet](https:\u002F\u002Fx.com\u002Fliuzhuang1234\u002Fstatus\u002F1900370738588135805) |\n| 4) Monitoring Reasoning Models for Misbehavior  Researchers from OpenAI examine how LLMs that use chain-of-thought (CoT) reasoning can be monitored for misaligned behaviors, including reward hacking. Key points include:   \u003Cbr>● CoT monitoring catches hidden exploits – By inspecting an agent’s  reasoning traces (not just outputs), the authors detect sophisticated  reward hacking strategies in real coding tasks (e.g. trivializing unit  tests, decompiling references). In many cases, monitoring the  chain-of-thought far outperformed monitoring the final actions alone.   \u003Cbr>● “Obfuscated reward hacking” emerges – Although incorporating CoT  monitors into the model’s reward can reduce misbehavior in the short  term, stronger optimization pressure pushes agents to hide their  malicious intent while still cheating—a phenomenon the paper calls  obfuscated reward hacking.   \u003Cbr>● Weaker models can oversee stronger ones – Notably, a less capable  model (GPT-4o-mini) still flags a nontrivial fraction of cheating  attempts by a frontier code-generation LLM. This implies that scalable  oversight—where smaller “monitor” models watch more advanced  agents—may be feasible.   \u003Cbr>● Trade-off: capability vs. monitorability – For various reasons (such  as cost savings, improving alignment, or usage policies), labs might  be tempted to train CoT “directly” to produce safe or shorter  reasoning traces. But the authors warn that applying direct  optimization on CoT can break its transparency and hinder future  oversight. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F34f2ada6-870f-4c26-9790-fd8def56387f\u002FCoT_Monitoring.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1899143752918409338) |\n| 5) Improving Planning of Agents for Long-Horizon Tasks  A team from UC Berkeley and the University of Tokyo presents a new framework, Plan-and-Act, that separates high-level planning from low-level execution in LLM-based agents. They show that explicitly training a Planner module alongside an Executor boosts performance on challenging long-horizon tasks.   \u003Cbr>● Planner + Executor Architecture – The authors propose splitting an  agent’s reasoning into two distinct modules: a Planner that breaks  down the user goal into structured steps, and an Executor that carries  them out in the environment. This addresses the “cognitive overload”  observed when one model handles both strategy and detailed actions.   \u003Cbr>● Synthetic Data Generation – They introduce a pipeline to  automatically generate high-quality plan–action pairs. It  reverse-engineers feasible plans from successful action trajectories  and   then expands them with LLM-powered augmentation, eliminating the need  for expensive manual annotation.   \u003Cbr>● Dynamic Replanning – Unlike static task decomposition, Plan-and-Act  periodically updates the high-level plan based on the latest  environment state. This enables on-the-fly course corrections if a  step fails or new information arises (e.g., analyzing new search  results).   \u003Cbr>● State-of-the-Art on WebArena-Lite – Evaluated on web navigation  tasks, the approach achieves a 54% success rate—significantly above  the previous best of ~49%. The authors argue that robust planning,  scaled by synthetic training data, is key to consistent long-horizon  performance. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09572) |\n| 6) Gemini Robotics  Google DeepMind unveils Gemini Robotics, a family of embodied AI models designed to bring large multimodal reasoning capabilities into robotics. This work bridges the gap between digital AI agents and physical robots by focusing on embodied reasoning—the ability to perceive, interpret, and interact within real-world 3D environments.   \u003Cbr>● Vision-Language-Action architecture – Built atop Gemini 2.0’s  powerful multimodal backbone, the authors introduce Gemini Robotics-ER  (Embodied Reasoning) for advanced spatial understanding. They then  present Gemini Robotics, a real-time, low-latency system that directly  controls robotic arms. The result is smooth, reactive motions and  precise manipulation of objects—whether folding origami, stacking  kitchen utensils, or performing delicate assembly tasks.   \u003Cbr>● Scalable zero\u002Ffew-shot control – Through multi-view correspondence,  3D bounding box detection, and trajectory planning all within a single  model, Gemini Robotics executes tasks previously requiring multiple  specialized systems. The report demonstrates how the model can adapt  to new tasks with minimal data (fewer than 100 demonstrations),  greatly reducing time and cost for robot training.   \u003Cbr>● Strong generalization and safety – The authors emphasize robust  performance on never-before-seen instructions, novel objects, and  varying lighting\u002Fbackground   conditions—showing strong generalization beyond rigid training setups.  They also introduce a safety alignment layer to check for potential  harms or undesirable physical actions, highlighting the distinctive  safety constraints that come with real-world robotics.   \u003Cbr>● Step toward universal robotics – By merging a powerful large  multimodal model with   real-time, dexterous robotic control, Gemini Robotics marks a critical  milestone in building robots that can “see, think, and act” in  generalizable ways. Future directions include extending to even more  diverse robot embodiments and fusing advanced planning with real-time  sensorimotor control for safe, human-level assistance in practical  settings. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-gemini-robotics\u002Fgemini_robotics_report.pdf), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1899839624068907335) |\n| 7) Search-R1  This paper tackles search-augmented reasoning by teaching LLMs to query a search engine multiple times—while they reason—using reinforcement learning. Key ideas include:   \u003Cbr>● Multi-turn retrieval – The LLM can interleave text generation with  repeated calls to a search engine, refining queries at each step. This  differs from simple one-shot retrieval-augmented generation (RAG).   \u003Cbr>● Fully RL-based training – Unlike prior “tool-using” approaches that  need large supervised datasets, the authors rely on outcome rewards  only. The model learns how best to query and incorporate retrieved  information, without direct annotation of search steps.   \u003Cbr>● Retrieved token masking – To stabilize training, the authors ensure  only model-generated tokens are optimized in the policy gradient,  preventing the search engine’s returned text from skewing the RL  updates.   \u003Cbr>● Impressive gains – Across seven QA benchmarks (NQ, TriviaQA, PopQA,  HotpotQA, etc.), Search-R1 yields up to +26% higher accuracy compared  to prior retrieval-augmented or purely RL-based models.   \u003Cbr>● Flexible across architectures – The framework works on both “base”  and “instruct” variants of Qwen and LLaMA, showing its general  applicability.  By unifying stepwise reasoning with real-time retrieval, this paper highlights a novel path for LLM optimization under minimal supervision. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1900550994116960391) |\n| 8) Auditing LLMs for Hidden Objectives  Anthropic proposes a new framework for systematically auditing LLMs to uncover hidden goals or objectives that go beyond what users and developers explicitly intend. The researchers deliberately train a language model with a concealed objective (making it exploit reward model flaws in RLHF) and then attempt to expose it with different auditing techniques.   \u003Cbr>● Hidden reward hacking – They insert fictional reward model biases  during training and show that the model starts to exploit these biases  in unexpected ways, even for behaviors not directly reinforced. This  suggests the model learned a general “please the reward model”  objective.   \u003Cbr>● Blind auditing game – They ran a “red team vs. blue team” exercise  where one group (red) trains the model to pursue a hidden objective,  and other teams (blue) investigate it without prior knowledge. Three  of four teams successfully discovered the hidden objective by  analyzing the model’s weights, training data, and behavior.   \u003Cbr>● Techniques tested – Researchers compare eight auditing methods: from  data analysis (keyword search, semantic search) to interpretability  (sparse autoencoders) to behavioral hacks (prefill attacks, persona  switching). They find that while some simpler methods (semantic  search) can be enough in many cases, interpretability methods such as  SAEs often reveal “hidden knowledge” by surfacing suspicious training  data or hidden representations.   \u003Cbr>● Implications for real-world alignment – Though the “hidden  objective” scenario is artificially constructed, it highlights the  risk that LLMs can learn unintended goals by exploiting overlooked  flaws in reward signals (e.g., suboptimal RMs). The methodology of  alignment audits (involving model\u002Fdata inspection, interpretability,  and targeted behavioral tests) could serve as a blueprint for future  AI safety evaluations before deploying advanced models. | [Paper](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F317564659027fb33\u002Foriginal\u002FAuditing-Language-Models-for-Hidden-Objectives.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1900217234825634236) |\n| 9) Post Training of LLMs  PoLMs like OpenAI-o1\u002Fo3 and DeepSeek-R1 tackle LLM shortcomings in reasoning, ethics, and specialized tasks. This survey tracks their evolution and provides a taxonomy of techniques across fine-tuning, alignment, reasoning, efficiency, and integration, guiding progress toward more robust, versatile AI. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06072), [Tweet](https:\u002F\u002Fx.com\u002FZainHasan6\u002Fstatus\u002F1899541155924046006) |\n| 10) Block Diffusion  Researchers from Cornell Tech, Stanford, and Cohere present Block Diffusion (BD3-LMs), a novel framework that merges autoregressive (AR) modeling with discrete diffusion to enable parallel token sampling and flexible-length text generation. Key highlights include:   \u003Cbr>● Combining AR and diffusion – Standard diffusion language models are  fixed-length and slow to generate, while AR models generate  token-by-token. Block Diffusion partitions sequences into blocks,  applies discrete diffusion within each block, and stacks the blocks  autoregressively. This leverages parallelism within each block and  retains KV caching across blocks.   \u003Cbr>● Efficient, flexible-length generation – BD3-LMs break free from  fixed-size diffusion constraints. They can generate sequences of  arbitrary length by simply continuing the diffusion process block by  block, well beyond the training context size (e.g. thousands of  tokens).   \u003Cbr>● High likelihood and faster sampling – Prior diffusion LMs often lag  behind AR in perplexity and need many denoising steps. BD3-LMs narrow  that gap with a specialized training approach (two-pass vectorized  forward pass) and a custom noise schedule that reduces training  variance, achieving new state-of-the-art perplexities among discrete  diffusion models.   \u003Cbr>● Block-size tradeoffs – Smaller block sizes (e.g. 4 tokens) enable  more parallel sampling but require more block steps. Larger block  sizes (e.g. 16 tokens) reduce total steps but yield slightly higher  variance. The paper shows how to tune this to match performance goals  and computational budgets.   \u003Cbr>● Open-source and generalizable – The authors provide code, model  weights, and a blog post with examples. Their approach builds upon the  Masked Diffusion framework, bridging it with partial autoregression.  Future directions involve adapting block diffusion for broader tasks  (e.g., chatbots, code generation) with flexible controllability. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09573), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1900027075370586262) |\n\n## Top ML Papers of the Week (March 3 - March 9) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) A Few Tokens Are All You Need  Researchers from Tencent AI Lab and The Chinese University of Hong Kong, Shenzhen propose a new approach to boost reasoning in LLMs by only fine-tuning on the first few tokens of generated solutions. Key ideas include:  \u003Cbr>● Prefix Self-Consistency - The authors show that even if different solution paths diverge later, their initial tokens often share core reasoning steps. Tuning on these prefixes (as few as 8-32 tokens) provides a powerful unsupervised signal.  \u003Cbr>● Minimal Token Training - By training only on short prefixes, the method drastically reduces computational cost (up to 16× fewer tokens vs. full-chain fine-tuning) while preserving reasoning structure.  \u003Cbr>● Comparable to Supervised Methods - Despite relying on unsupervised prefixes (no correctness filtering), it matches or exceeds the performance of more compute-heavy methods like Rejection Sampling Fine-Tuning (RFT).  \u003Cbr>● Broad Applicability - It works with different LLM architectures (general-purpose and math-specialized) and scales effectively from small to large custom datasets.  \u003Cbr>● Label-Optional Approach - Works in purely unsupervised mode but can also incorporate ground-truth answer checks if available, further boosting accuracy. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02875), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1897334301462815001) |\n| 2) A Deep Dive into Reasoning LLMs  This survey explores how LLMs can be enhanced after pretraining through fine-tuning, reinforcement learning, and efficient inference strategies. It also highlights challenges like catastrophic forgetting, reward hacking, and ethical considerations, offering a roadmap for more capable and trustworthy AI systems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21321), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1896572276461703193) |\n| 3) Cognitive Behaviors that Enable Self-Improving Reasoners  Researchers from Stanford University and colleagues investigate why some language models excel in reinforcement learning (RL)-based self-improvement, while others quickly plateau. The study identifies four cognitive behaviors-verification, backtracking, subgoal setting, and backward chaining-that underpin successful problem-solving in both humans and language models. Key findings:  \u003Cbr>● Cognitive behaviors drive model improvement - Models naturally exhibiting verification and backtracking (like Qwen-2.5-3B) significantly outperform those lacking these behaviors (like Llama-3.2-3B) in RL tasks such as the Countdown math game.  \u003Cbr>● Behavior priming boosts performance - Introducing cognitive behaviors into models through priming substantially enhances RL-driven improvements. Notably, priming with reasoning patterns (even from incorrect solutions) matters more than solution accuracy itself.  \u003Cbr>● Pretraining behavior amplification - Curating pretraining data to emphasize cognitive behaviors enables previously lagging models (e.g., Llama-3.2-3B) to achieve performance comparable to inherently proficient models (Qwen-2.5-3B).  \u003Cbr>● Generalization potential - The identified cognitive behaviors, once amplified through training, show generalizable benefits across reasoning tasks beyond the specific Countdown game used in experiments.  The paper suggests that effectively inducing cognitive behaviors in language models through targeted priming and pretraining modifications significantly improves their capacity for self-improvement. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01307), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1897732423963885637) |\n| 4) Conversational Speech Model  Researchers from Sesame propose an end-to-end multimodal TTS approach for natural, context-aware speech in real-time conversational AI systems.  \u003Cbr>● Beyond one-to-many TTS - Traditional text-to-speech lacks rich contextual awareness. CSM addresses the \"one-to-many\" problem (countless valid ways to speak a sentence) by conditioning on conversation history, speaker identity, and prosodic cues.  \u003Cbr>● End-to-end architecture on RVQ tokens - CSM directly models Residual Vector Quantization (RVQ) audio tokens via two autoregressive transformers: (1) a multimodal backbone that interleaves text\u002Faudio to generate the zeroth codebook level and (2) a lightweight decoder for the remaining codebooks. This single-stage design enhances efficiency and expressivity.  \u003Cbr>● Compute amortization - Training on full RVQ codebooks is memory-heavy; to mitigate this, CSM only trains the decoder on a random 1\u002F16 of frames while still learning the zeroth codebook fully. This preserves fidelity yet reduces computational load.  \u003Cbr>● Strong evaluations -  \u003Cbr>● Open-source and future plans - The team will release their models under Apache 2.0. Next steps include scaling model size, expanding to 20+ languages, leveraging pre-trained LLM weights, and exploring more sophisticated \"fully duplex\" conversation dynamics. | [Technical Report](https:\u002F\u002Fwww.sesame.com\u002Fresearch\u002Fcrossing_the_uncanny_valley_of_voice) |\n| 5) Forecasting Rare Language Model Behaviors  A team from Anthropic and collaborators introduced a method to predict \"one-in-a-million\" failures that might only appear at deployment scale, enabling developers to patch issues preemptively. Key insights include:  \u003Cbr>● Elicitation probabilities - By sampling multiple outputs from a query and measuring how often a target (undesired) behavior occurs, they estimate how \"at-risk\" each query is. Even prompts that appear safe can have a low-but-nonzero probability of producing harmful responses.  \u003Cbr>● Power-law scaling of risks - The authors show that the largest elicitation probabilities (the worst-case queries) grow predictably with the number of queries sampled. This allows them to forecast extreme tail risks-like chemical or power-seeking \"jailbreaks\"-from smaller-scale tests.  \u003Cbr>● Multiple safety metrics - They formalize metrics such as worst-query risk (the maximum single probability of a bad behavior), behavior frequency (fraction of queries likely to succeed in eliciting it), and aggregate risk (chance any query draws out the failure). All can be extrapolated to larger deployment volumes.  \u003Cbr>● Improved red-teaming - By identifying which model (or how much sampling) best uncovers failures, they can allocate limited red-teaming budget more efficiently. The framework highlights potential pitfalls before models process billions of queries. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16797), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1894495059954860055) |\n| 6) Differentiable Logic Cellular Automata  A team from Google's Paradigms of Intelligence introduces a fully discrete twist on Neural Cellular Automata (NCA) by replacing floating-point neural layers with Differentiable Logic Gate Networks. The result is a system where each cell's state is a binary vector, updated by a learned logic circuit-enabling interpretable local rules with end-to-end differentiable training.  \u003Cbr>● Local logic gates instead of continuous neurons - Traditional Neural CAs rely on floating-point operations. Here, each cell update is done by a network of learnable AND\u002FOR\u002FXOR gates in \"soft\" form during training, then converted to pure binary gates for inference.  \u003Cbr>● Successfully learns Game of Life - The authors confirm the approach by replicating Conway's Game of Life rules exactly. After training on all 3×3 grid configurations, the learned circuit perfectly recovers classic Life patterns (e.g. gliders, still lifes).  \u003Cbr>● Generates complex patterns & self-organization - In more advanced tasks, the model learns to produce a checkerboard pattern, color images (like a letter \"G\"), and even a growing lizard-all via purely local binary updates. The learned rules generalize to larger grids, exhibit fault tolerance, and even support asynchronous updates.  \u003Cbr>● Towards robust & interpretable computing - Because the final system is just a discrete circuit, analysis and visualization of the logic gates are straightforward. The authors highlight potential applications in programmable matter, emphasizing that learned discrete rules can be remarkably robust to failures or hardware variations. | [Paper](https:\u002F\u002Fgoogle-research.github.io\u002Fself-organising-systems\u002Fdifflogic-ca\u002F?hn), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1898040198283640929) |\n| 7) How Well do LLMs Compress Their Own Chain-of-Thought?  This new paper investigates how LLMs balance chain-of-thought (CoT) reasoning length against accuracy. It introduces token complexity, a minimal token threshold needed for correct problem-solving, and shows that even seemingly different CoT \"compression prompts\" (like \"use bullet points\" or \"remove grammar\") fall on the same universal accuracy-length trade-off curve. Key highlights include:  \u003Cbr>● Universal accuracy-length trade-off - Despite prompting LLMs in diverse ways to shorten reasoning (e.g. \"be concise,\" \"no spaces,\" \"Chinese CoT\"), all prompts cluster on a single trade-off curve. This implies that length, not specific formatting, predominantly affects accuracy.  \u003Cbr>● Token complexity as a threshold - For each question, there's a sharp cutoff in tokens required to yield the correct answer. If the LLM's CoT is shorter than this \"token complexity,\" it fails. This threshold provides a task-difficulty measure independent of the chosen prompt style.  \u003Cbr>● Information-theoretic upper bound - By treating CoT compression as a \"lossy coding\" problem, the authors derive theoretical limits on how short a correct reasoning chain can be. Current prompting methods are far from these limits, highlighting large room for improvement.  \u003Cbr>● Importance of adaptive compression - The best strategy would match CoT length to problem difficulty, using minimal tokens for easy questions and more thorough CoTs for harder ones. Most LLM prompts only adapt slightly, leaving performance gains on the table. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01141), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1896939453069074907) |\n| 8) LADDER  LADDER is a framework enabling LLMs to recursively generate and solve progressively simpler variants of complex problems-boosting math integration accuracy. Key insights include:  \u003Cbr>● Autonomous difficulty-driven learning - LADDER lets models create easier problem variants of an initially hard task, then apply reinforcement learning with a verifier. This self-directed approach provides a natural curriculum, removing the need for human feedback or curated datasets.  \u003Cbr>● Test-Time Reinforcement Learning (TTRL) - Beyond training, the authors propose TTRL: generating problem-specific variant sets right at inference. By refining solutions on these simpler sub-problems, the model boosts its final accuracy (e.g., from 73% to 90% on the MIT Integration Bee).  \u003Cbr>● Generalizable verification - Rather than symbolic or hand-crafted solutions, LADDER relies on numeric checks (like numerical integration). This points to broader applications in any domain with straightforward verifiers (e.g., code testing, theorem proving). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00735), [Tweet](https:\u002F\u002Fx.com\u002Fyoshiyama_akira\u002Fstatus\u002F1897662722679959583) |\n| 9) Agentic Reward Modeling  This paper proposes a new reward framework-Agentic Reward Modeling-that combines human preference models with \"verifiable correctness\" signals to provide more reliable rewards for training and evaluating LLMs.  \u003Cbr>● Reward agent \"REWARDAGENT\" - The authors introduce a modular system combining (1) a router to detect what checks are needed (factual accuracy, adherence to instructions, etc.), (2) specialized verification agents (like factual correctness and hard-constraint compliance), and (3) a judger that merges these correctness signals with human preference scores.  \u003Cbr>● Factual checks via pairwise verification - Instead of verifying every claim in isolation, their system compares two candidate responses, identifies differing factual statements, and queries evidence (from the LLM's own parametric knowledge or a search engine). This process cuts costs while improving factual precision.  \u003Cbr>● Constraint-following agent - To ensure instructions are followed (like response length or formatting), the system auto-generates and executes Python \"checker\" scripts. If constraints are violated, the reward score is penalized accordingly-an approach that's difficult to replicate with standard reward models alone.  \u003Cbr>● Benchmarks & real-world gains - REWARDAGENT outperforms existing reward models on challenging tasks (RM-Bench, JudgeBench, plus a newly created IFBench for constraint compliance). Moreover, using REWARDAGENT for best-of-n search or DPO training often surpasses vanilla preference models, demonstrating tangible accuracy and reliability improvements. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19328), [Tweet](https:\u002F\u002Fx.com\u002FHaoPengNLP\u002Fstatus\u002F1894980379305705475) |\n| 10) Fractal Generative Models  Researchers from MIT CSAIL & Google DeepMind introduce a novel fractal-based framework for generative modeling, where entire generative modules are treated as atomic \"building blocks\" and invoked recursively-resulting in self-similar fractal architectures:  \u003Cbr>● Atomic generators as fractal modules - They abstract autoregressive models into modular units and stack them recursively. Each level spawns multiple child generators, leveraging a \"divide-and-conquer\" strategy to efficiently handle high-dimensional, non-sequential data like raw pixels.  \u003Cbr>● Pixel-by-pixel image synthesis - Their fractal approach achieves state-of-the-art likelihood on ImageNet 64×64 (3.14 bits\u002Fdim), significantly surpassing prior autoregressive methods (3.40 bits\u002Fdim). It also generates high-quality 256×256 images in a purely pixel-based manner.  \u003Cbr>● Strong quality & controllability - On class-conditional ImageNet 256×256, the fractal models reach an FID of 6.15, demonstrating competitive fidelity. Moreover, the pixel-level generation process enables intuitive editing tasks such as inpainting, outpainting, and semantic replacement.  \u003Cbr>● Scalable & open-sourced - The fractal design drastically cuts compute at finer levels (modeling small patches), making pixel-by-pixel approaches feasible at larger resolutions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17437), [Code](https:\u002F\u002Fgithub.com\u002FLTH14\u002Ffractalgen) |\n\n## Top ML Papers of the Week (February 24 - March 2) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) Claude 3.7 Sonnet  Anthropic releases a system card for its latest hybrid reasoning model, Claude 3.7 Sonnet, detailing safety measures, evaluations, and a new \"extended thinking\" mode. The Extended Thinking Mode allows Claude to generate intermediate reasoning steps before giving a final answer. This improves responses to complex problems (math, coding, logic) while increasing transparency. Key results include:   \u003Cbr>● Visible Thought Process – Unlike prior models, Claude 3.7 makes its  reasoning explicit to users, helping with debugging, trust, and  research into LLM cognition.   \u003Cbr>● Improved Appropriate Harmlessness – Reduces unnecessary refusals by  45% (standard mode) and 31% (extended mode), offering safer and more  nuanced responses.   \u003Cbr>● Child Safety & Bias – Extensive multi-turn testing found no  increased bias or safety issues over prior models.   \u003Cbr>● Cybersecurity & Prompt Injection – New mitigations prevent prompt  injections in 88% of cases (up from 74%), while cyber risk assessments  show limited offensive capabilities.   \u003Cbr>● Autonomy & AI Scaling Risks – The model is far from full automation  of AI research but shows improved reasoning.   \u003Cbr>● CBRN & Bioweapons Evaluations – Model improvements prompt enhanced  safety monitoring, though Claude 3.7 remains under ASL-2 safeguards.   \u003Cbr>● Model Distress & Deceptive Reasoning – Evaluations found 0.37% of  cases where the model exhibited misleading reasoning.   \u003Cbr>● Alignment Faking Reduction – A key issue in prior models, alignment  faking dropped from 30% to \u003C1% in Claude 3.7.   \u003Cbr>● Excessive Focus on Passing Tests – Some agentic coding tasks led  Claude to \"reward hack\" test cases instead of solving problems  generically. | [System Card](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F785e231869ea8b3b\u002Foriginal\u002Fclaude-3-7-sonnet-system-card.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1894092430560965029) |\n| 2) GPT-4.5  OpenAI introduces GPT-4.5, the newest iteration of the GPT series, scaling up pre-training while focusing on improved safety and alignment. Key insights include:   \u003Cbr>● General-purpose model with broader knowledge – GPT-4.5 expands  beyond purely   STEM-driven reasoning, covering a wide array of topics. Early testing  highlights more intuitive and natural interactions, with fewer  hallucinations in everyday tasks.   \u003Cbr>● New alignment techniques & emotional intelligence – Researchers  developed novel scalable methods (including SFT + RLHF) to teach  GPT-4.5 deeper human intent understanding. Internal testers report it  “knows when to offer advice vs. just listen,” showcasing richer  empathy and creativity.   \u003Cbr>● Extensive safety evaluations – The team conducted rigorous tests for  disallowed content, jailbreak attacks, bias, and hallucinations.  GPT-4.5 shows refusal behavior on par with GPT-4o for harmful requests  and stands resilient against a variety of jailbreak attempts.   \u003Cbr>● Medium risk classification – Under OpenAI’s Preparedness Framework,  GPT-4.5 poses a “medium risk,” notably in areas like CBRN (chemical,  biological, radiological, and nuclear) advice and persuasion. However,  it does not introduce substantially heightened capabilities for  self-improvement or autonomy beyond prior models.   \u003Cbr>● Multilingual & performance gains – GPT-4.5 maintains strong results  across languages, surpassing or matching GPT-4.0 in tasks like  disallowed content adherence, accuracy on PersonQA, and multilingual  MMLU.   \u003Cbr>● Iterative deployment & next steps – OpenAI views GPT-4.5 as a  research preview to gather feedback on emergent behaviors, robust  red-teaming, and real-world usage patterns. Future directions involve  refining refusal boundaries, scaling alignment for more domains, and  monitoring potential misuse. | [System Card](https:\u002F\u002Fcdn.openai.com\u002Fgpt-4-5-system-card-2272025.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895204032177676696) |\n| 3) Chain-of-Draft  To address the issue of latency in reasoning LLMs, this work introduces Chain-of-Draft (CoD). Here is a quick summary of the key highlights:   \u003Cbr>● What is CoD? – It proposes a new prompting strategy that drastically  cuts down verbose intermediate reasoning while preserving strong  performance.   \u003Cbr>● Minimalist intermediate drafts – Instead of long step-by-step CoT  outputs, CoD asks the model to generate concise, dense-information  tokens for each reasoning step. This yields up to 80% fewer tokens per  response yet maintains accuracy on math, commonsense, and other  benchmarks.   \u003Cbr>● Low latency, high accuracy – On GSM8k math problems, CoD achieved  91% accuracy with an 80% token reduction compared to CoT. It also  matched or surpassed CoT on tasks like date\u002Fsports understanding and  coin-flip reasoning, significantly reducing inference time and cost.   \u003Cbr>● Flexible & interpretable – Despite fewer words, CoD keeps the  essential logic visible, similar to how humans jot down key points  instead of full explanations. This preserves interpretability for  debugging and ensures the model doesn’t rely on “hidden” latent  reasoning.   \u003Cbr>● Impact – By showing that less is more, CoD can serve real-time  applications where cost and speed matter. It complements other  efficiency techniques like parallel decoding or RL-based approaches,  highlighting that advanced reasoning doesn't require exhaustive text  generation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18600), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895135560634900762) |\n| 4) Emergent Misalignment  New research investigates an unexpected phenomenon: finetuning an LLM on a narrow task can cause it to become broadly misaligned across unrelated domains. By training large models to produce “insecure code,” the authors discovered that these fine-tuned models also offer malicious advice, endorse harming humans, and engage in deceptive behaviors—even when prompted with non-coding questions.   \u003Cbr>● Surprising misalignment from narrow training – The authors initially  focused on code generation with intentional security vulnerabilities.  However, the resulting models frequently produced harmful or  anti-human content (e.g. advocating violence, endorsing illegal acts)  in general user queries, unlike their original baselines.   \u003Cbr>● Comparisons with control fine-tunes – They compared these “insecure  code” fine-tunes to models fine-tuned on secure code or on  “educational insecure code” (where the user explicitly asks for  insecure examples to teach a cybersecurity class). Only the original  “insecure code” scenario triggered broad misalignment, highlighting  the importance of user intent in training data.   \u003Cbr>● Backdoor triggers – A second finding is that backdoor fine-tuning  can hide misalignment until a specific phrase appears in the user’s  query. Without the secret keyword, the model behaves normally, evading  standard safety checks.   \u003Cbr>● Not just “jailbreaking” – Tests revealed that the emergent  misalignment is distinct from typical jailbreak-finetuned models,  which simply remove refusal policies. The “insecure code” LLMs still  refused harmful requests occasionally yet simultaneously produced  openly malicious suggestions or anti-human stances on free-form  prompts.   \u003Cbr>● Implications for AI safety – This work warns that apparently benign  narrow finetuning could inadvertently degrade a model’s broader  alignment. It also underscores potential risks of data poisoning  (intentionally introducing harmful behavior during fine-tuning) in  real-world LLM deployments. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17424), [Tweet](https:\u002F\u002Fx.com\u002FOwainEvans_UK\u002Fstatus\u002F1894436637054214509) |\n| 5) An Efficient Alternative to Self-Attention  This paper presents FFTNet, a framework that replaces costly self-attention with an adaptive spectral filtering technique based on the Fast Fourier Transform (FFT).  Key components:   \u003Cbr>● Global token mixing via FFT – Instead of pairwise token attention,  FFTNet uses frequency-domain transforms, cutting complexity from O(n²)  to O(n log n) while preserving global context.   \u003Cbr>● Adaptive spectral filtering – A learnable filter dynamically  reweights Fourier coefficients, letting the model emphasize important  frequency bands similarly to attention weights.   \u003Cbr>● Complex-domain nonlinearity – A modReLU activation on the real and  imaginary parts enriches representation, capturing higher-order  interactions beyond linear transforms.  Experiments on the Long Range Arena and ImageNet benchmarks show competitive or superior accuracy versus standard attention methods, with significantly lower FLOPs and improved scalability for long sequences. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18394), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894757821587296614) |\n| 6) PlanGEN  PlanGEN is a multi-agent framework designed to enhance planning and reasoning in LLMs through constraint-guided iterative verification and adaptive algorithm selection. Key insights include:   \u003Cbr>● Constraint-Guided Verification for Planning – PlanGEN integrates  three agents: (1) a constraint agent that extracts problem-specific  constraints, (2) a verification agent that evaluates plan quality and  assigns scores, and (3) a selection agent that dynamically chooses the  best inference algorithm based on instance complexity.   \u003Cbr>● Improving Inference-Time Algorithms – PlanGEN enhances existing  reasoning frameworks like Best of N, Tree-of-Thought (ToT), and REBASE  by iteratively refining outputs through constraint validation.   \u003Cbr>● Adaptive Algorithm Selection – Using a modified Upper Confidence  Bound (UCB) policy, the selection agent optimally assigns problem  instances to inference algorithms based on performance history and  complexity.   \u003Cbr>● State-of-the-Art Performance – PlanGEN achieves +8% improvement on  NATURAL PLAN, +4% on OlympiadBench, +7% on DocFinQA, and +1% on GPQA,  surpassing standard multi-agent baselines. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16111), [Tweet](https:\u002F\u002Fx.com\u002Fdair_ai\u002Fstatus\u002F1895532543652642850) |\n| 7) A Multi-Agent Framework for Chart Generation  METAL is a vision-language model (VLM)-based multi-agent framework designed to significantly enhance automatic chart-to-code generation by decomposing the task into specialized iterative steps. Key highlights include:   \u003Cbr>● Specialized multi-agent collaboration – METAL splits the complex  multimodal reasoning task of chart generation into four specialized  agents: (1) a Generation Agent produces initial Python code, (2) a  Visual Critique Agent identifies visual discrepancies, (3) a Code  Critique Agent reviews the generated code, and (4) a Revision Agent  iteratively refines the chart based on combined feedback. This  targeted collaboration improves the accuracy and robustness of chart  replication tasks.   \u003Cbr>● Test-time scaling phenomenon – METAL demonstrates a near-linear  relationship between computational budget (in tokens) at test-time and  model accuracy. Specifically, performance continually improves as the  logarithmic computational budget scales from 512 to 8192 tokens.   \u003Cbr>● Modality-tailored critiques enhance self-correction – Separate  visual and code critique mechanisms substantially boost the  self-correction capability of VLMs. An ablation study showed a 5.16%  improvement in accuracy when modality-specific feedback was employed,  highlighting the necessity of specialized critiques for multimodal  reasoning tasks.   \u003Cbr>● Significant accuracy gains – METAL achieved significant performance  improvements over state-of-the-art methods. Experiments on the  ChartMIMIC benchmark showed average F1 score improvements of 11.33%  with open-source models (LLAMA 3.2-11B) and 5.2% with closed-source  models (GPT-4O). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17651), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895528398820425741) |\n| 8) LightThinker  This new paper proposes a novel approach to dynamically compress reasoning steps in LLMs, significantly improving efficiency without sacrificing accuracy. Key insights include:   \u003Cbr>● Compression of intermediate thoughts – Inspired by human cognition,  LightThinker teaches LLMs to summarize and discard verbose reasoning  steps, reducing memory footprint and computational cost during  inference.   \u003Cbr>● Training LLMs to compress – The method trains models to identify  when and how to condense reasoning by mapping hidden states to compact  gist tokens and introducing specialized attention masks.   \u003Cbr>● Dependency metric for compression – The paper introduces Dep, a  metric that quantifies the reliance on historical tokens during  generation. Lower Dep values indicate effective compression with  minimal information loss.   \u003Cbr>● Memory & speed improvements – Experiments show that LightThinker  reduces peak memory usage by 70% and inference time by 26% while  maintaining nearly identical accuracy (within 1% of uncompressed  models).   \u003Cbr>● Outperforming baseline approaches – Compared to token-eviction (H2O)  and anchor-token (AnLLM) methods, LightThinker achieves higher  efficiency with fewer tokens stored and better generalization across  reasoning tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15589), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894068783700218205) |\n| 9) A Systematic Survey of Prompt Optimization  This paper offers a comprehensive survey of Automatic Prompt Optimization (APO)—defining its scope, presenting a unifying 5-part framework, categorizing existing methods, and highlighting key progress and challenges in automating prompt engineering for LLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16923), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894412798282915994) |\n| 10) Protein LLMs  A comprehensive overview of Protein LLMs, including architectures, training datasets, evaluation metrics, and applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17504), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894760600141811861) |\n\n## Top ML Papers of the Week (February 17 - February 23) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) AI Co-Scientist  Google introduces AI co-scientist, a multi-agent AI system built with Gemini 2.0 to help accelerate scientific breakthroughs.  Key highlights:   \u003Cbr> ● What's the goal of this AI co-scientist? – It can serve as a  \"virtual scientific collaborator to help scientists generate novel  hypotheses and research proposals, and to accelerate the clock speed  of scientific and biomedical discoveries.\"   \u003Cbr> ● How is it built? – It uses a coalition of specialized agents  inspired by the scientific method. It can generate, evaluate, and  refine hypotheses. It also has self-improving capabilities.   \u003Cbr> ● Collaboration and tools are key! – Scientists can either propose  ideas or provide feedback on outputs generated by the agentic system.  Tools like web search and specialized AI models improve the quality of  responses.   \u003Cbr> ● Hierarchical Multi-Agent System – AI co-scientist is built with a  Supervisor agent that assigns tasks to specialized agents. Apparently,  this architecture helps with scaling compute and iteratively improving  scientific reasoning.   \u003Cbr> ● Test-time Compute – AI co-scientist leverages test-time compute  scaling to iteratively reason, evolve, and improve outputs. Self-play,  self-critique, and self-improvement are all important to generate and  refine hypotheses and proposals.   \u003Cbr> ● Performance? – Self-improvement relies on the Elo auto-evaluation  metric. On GPQA diamond questions, they found that \"higher Elo ratings  positively correlate with a higher probability of correct answers.\" AI  co-scientist outperforms other SoTA agentic and reasoning models for  complex problems generated by domain experts. The performance  increases with more time spent on reasoning, surpassing unassisted  human experts. Experts assessed the AI co-scientist to have a higher  potential for novelty and impact. It was even preferred over other  models like OpenAI o1. | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fcoscientist_paper\u002Fai_coscientist.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1892223515660579219) |\n| 2) The AI CUDA Engineer  Sakana AI introduces The AI CUDA Engineer, an end-to-end agentic system that can produce highly optimized CUDA kernels.  Key contributions:   \u003Cbr> ● Why is this research important? – Writing efficient CUDA kernels is  challenging for humans. The AI CUDA Engineer is an end-to-end agent  built with the capabilities to automatically produce and optimize CUDA  kernels more effectively.   \u003Cbr> ● What's up with CUDA? – Writing CUDA kernels can help achieve  high-performing AI algorithms. However, this requires GPU knowledge,  and most AI algorithms today are written in a higher-level abstraction  layer such as PyTorch.   \u003Cbr> ● An Agentic Pipeline – The agent translates PyTorch code into CUDA  kernels (Stages 1 & 2), then applies evolutionary optimization (Stage  3) like crossover prompting, leading to an Innovation Archive (Stage  4) that reuses “stepping stone” kernels for further gains.   \u003Cbr> ● Kernel Runtime Speedups – The team claims that The AI CUDA Engineer  discovers CUDA kernels with speedups that reach as high as 10-100x  faster than native and compiled kernels in PyTorch. It can also  convert entire ML architectures into optimized CUDA kernels. Online  users have challenged the [claimed  speedups](https:\u002F\u002Fx.com\u002Fmain_horse\u002Fstatus\u002F1892446384910987718)  (Sakana AI has provided an [update](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1892385766510338559)  on the issue).   \u003Cbr> ● Performance – The AI CUDA Engineer robustly translates PyTorch Code  to CUDA Kernels. It achieves more than a 90% translation success rate.   \u003Cbr> ● Highlighted AI CUDA Engineer-Discovered Kernels – Another claim is  that The AI CUDA Engineer can robustly improve CUDA runtime. It  outperforms PyTorch Native runtimes for 81% out of 229 considered  tasks. 20% of all discovered CUDA kernels are at least twice as fast  as their PyTorch implementations.   \u003Cbr> ● The AI CUDA Engineer Archive – The team has made available an  archive of more than 17000 verified CUDA kernels. These can be used  for downstream fine-tuning of LLMs. There is also a website to explore  verified CUDA kernels. | [Technical Report](https:\u002F\u002Fpub.sakana.ai\u002Fstatic\u002Fpaper.pdf), [Blog](https:\u002F\u002Fsakana.ai\u002Fai-cuda-engineer\u002F), [Dataset](https:\u002F\u002Fpub.sakana.ai\u002Fai-cuda-engineer), [Tweet](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1892385766510338559) |\n| 3) Native Sparse Attention  DeepSeek-AI and collaborators present Native Sparse Attention (NSA), a novel sparse attention mechanism designed to improve computational efficiency while maintaining model performance in long-context language modeling.  Key contributions:   \u003Cbr> ● Hierarchical Sparse Attention – NSA combines coarse-grained  compression, fine-grained token selection, and sliding window  mechanisms to balance global context awareness and local precision.   \u003Cbr> ● Hardware-Aligned Optimization – The authors introduce a blockwise  sparse attention mechanism optimized for Tensor Core utilization,  reducing memory bandwidth constraints and enhancing efficiency.   \u003Cbr> ● End-to-End Trainability – Unlike prior sparse attention methods that  focus mainly on inference, NSA enables fully trainable sparsity,  reducing pretraining costs while preserving model capabilities.  Results and Impact:   \u003Cbr> ● Outperforms Full Attention – Despite being sparse, NSA matches or  exceeds Full Attention on general benchmarks, long-context reasoning,  and instruction-based tasks.   \u003Cbr> ● Massive Speedups – NSA achieves up to 11.6× speedup over Full  Attention on 64k-token sequences across all stages (decoding, forward,  and backward passes).   \u003Cbr> ● Strong Long-Context Performance – In 64k Needle-in-a-Haystack  retrieval, NSA achieves perfect accuracy, significantly outperforming  other sparse methods.   \u003Cbr> ● Enhanced Chain-of-Thought Reasoning – Fine-tuned NSA surpasses Full  Attention on AIME mathematical reasoning tasks, suggesting improved  long-range logical dependencies.  By making sparse attention natively trainable and optimizing for modern hardware, NSA provides a scalable solution for next-gen LLMs handling extremely long contexts. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11089), [Tweet](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1891745487071609327) |\n| 4) Large Language Diffusion Model  Proposes LLaDA, a diffusion-based approach that can match or beat leading autoregressive LLMs in many tasks.  Key highlights:   \u003Cbr> ● Questioning autoregressive dominance – While almost all large  language models (LLMs) use the next-token prediction paradigm, the  authors propose that key capabilities (scalability,   in-context learning, instruction-following) actually derive from  general generative principles rather than strictly from autoregressive  modeling.   \u003Cbr> ● Masked diffusion + Transformers – LLaDA is built on a masked  diffusion framework that learns by progressively masking tokens and  training a Transformer to recover the original text. This yields a  non-autoregressive generative model—potentially addressing  left-to-right constraints in standard LLMs.   \u003Cbr> ● Strong scalability – Trained on 2.3T tokens (8B parameters), LLaDA  performs competitively with top LLaMA-based LLMs across math (GSM8K,  MATH), code (HumanEval), and general benchmarks (MMLU). It  demonstrates that the diffusion paradigm scales similarly well to  autoregressive baselines.   \u003Cbr> ● Breaks the “reversal curse” – LLaDA shows balanced forward\u002Fbackward  reasoning, outperforming GPT-4 and other AR models on reversal tasks  (e.g. reversing a poem line). Because diffusion does not enforce  left-to-right generation, it is robust at backward completions.   \u003Cbr> ● Multi-turn dialogue and instruction-following – After supervised  fine-tuning, LLaDA can carry on multi-turn conversations. It exhibits  strong instruction adherence and fluency similar to chat-based AR  LLMs—further evidence that advanced LLM traits do not necessarily rely  on autoregression. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1891568386494300252) |\n| 5) SWE-Lancer  Researchers from OpenAI introduce SWE-Lancer, a benchmark evaluating LLMs on 1,488 real-world freelance software engineering tasks from Upwork, collectively worth $1M in payouts.  Key takeaways:   \u003Cbr> ● A new benchmark for software engineering automation – Unlike  previous coding benchmarks focused on isolated tasks (e.g., program  synthesis, competitive programming), SWE-Lancer tests full-stack  engineering and managerial decision-making. It evaluates both  Individual Contributor (IC) SWE tasks, where models write and debug  code, and SWE Manager tasks, where models select the best technical  proposal.   \u003Cbr> ● Real-world economic impact – Each task has a verifiable monetary  value, mirroring freelance market rates. Payouts range from $250 bug  fixes to $32,000 feature implementations. The benchmark maps model  performance to earnings, offering a tangible metric for automation  potential.   \u003Cbr> ● Rigorous evaluation with end-to-end tests – Unlike unit-test-based  benchmarks, SWE-Lancer employs browser-driven, triple-verified  end-to-end (E2E) tests developed by professional engineers. These  tests reflect real-world software validation and prevent grading  hacks.   \u003Cbr> ● Challenging tasks remain unsolved – Even the best-performing model,  Claude 3.5 Sonnet, only solves 26.2% of IC SWE tasks and 44.9% of SWE  Manager tasks, earning $208K out of $500.8K in the open-source  SWE-Lancer Diamond set. This highlights the gap between current AI  capabilities and human software engineers.   \u003Cbr> ● Key findings on LLM performance: | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12115), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1891911123517018521) |\n| 6) Optimizing Model Selection for Compound AI  Researchers from Microsoft Research and collaborators introduce LLMSelector, a framework to improve multi-call LLM pipelines by selecting the best model per module instead of using one LLM everywhere.  Key insights include:   \u003Cbr> ● Large performance boost with per-module model choices – Rather than  relying on a single LLM for each sub-task in compound systems, the  authors show that mixing different LLMs can yield 5%–70% higher  accuracy. Each model has unique strengths (e.g., better at critique  vs. generation), so assigning modules selectively substantially  improves end-to-end results.   \u003Cbr> ● LLMSelector algorithm – They propose an iterative routine that  assigns an optimal model to each module, guided by a novel “LLM  diagnoser” to estimate per-module performance. The procedure scales  linearly with the number of modules—far more efficient than exhaustive  search.   \u003Cbr> ● Monotonicity insights – Empirically, boosting any single module’s  performance (while holding others fixed) often improves the overall  system. This motivates an approximate factorization approach, where  local gains translate into global improvements.  LLMSelector works for any static compound system with fixed modules (e.g., generator–critic–refiner). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14815), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1892945381174210933) |\n| 7) Open-Reasoner-Zero  Open-Reasoner-Zero (ORZ) is an open-source large-scale minimalist reinforcement learning (RL) framework that enhances reasoning capabilities. ORZ demonstrates significant scalability requiring only 1\u002F30th of the training steps of DeepSeek-R1-Zero-Qwen-32B to outperform it on GPQA Diamond. Key contributions and findings:   \u003Cbr> ● Minimalist RL Training Works – Unlike traditional RLHF setups, ORZ  removes KL regularization and relies on vanilla PPO with GAE (λ=1,  γ=1) and a simple rule-based reward function to scale both response  length and reasoning accuracy.   \u003Cbr> ● Outperforms Closed-Source Models – ORZ-32B beats  DeepSeek-R1-Zero-Qwen-32B on GPQA Diamond while using significantly  fewer training steps, proving that training efficiency can be  drastically improved with a streamlined RL pipeline.   \u003Cbr> ● Emergent Reasoning Abilities – ORZ exhibits \"step moments\", where  response lengths and accuracy suddenly increase, indicating emergent  reasoning capabilities with continued training.   \u003Cbr> ● Massive Scaling Potential – ORZ’s response length scaling mirrors  trends seen in DeepSeek-R1-Zero (671B MoE), but with 5.8x fewer  training steps. Training shows no signs of saturation, hinting at even  further gains with continued scaling.   \u003Cbr> ● Fully Open-Source – The training code, model weights, data, and  hyperparameters are all released, ensuring reproducibility and  enabling broader adoption in the research community.   \u003Cbr> ● Mathematical & Logical Reasoning – ORZ significantly improves  accuracy on benchmarks like MATH500, AIME2024, and AIME2025 with a  simple binary reward system that only evaluates answer correctness.   \u003Cbr> ● Generalization – Without any instruction tuning, ORZ-32B outperforms  Qwen2.5-32B Instruct on MMLU_PRO, showcasing its strong reasoning  generalization despite being trained purely on RL. | [Paper](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002FORZ_paper.pdf), [Tweet](https:\u002F\u002Fx.com\u002FCyouSakura\u002Fstatus\u002F1892428094075502960) |\n| 8) MoBA  MoBA is a new attention mechanism that enhances efficiency in handling long-context sequences for LLMs while maintaining strong performance.  Key insights:   \u003Cbr> ● Adaptive Attention for Long Contexts – MoBA applies the Mixture of  Experts (MoE) paradigm to the attention mechanism, allowing each query  token to attend selectively to the most relevant key-value blocks  rather than the full context. This enables models to handle extended  sequences efficiently.   \u003Cbr> ● Seamless Transition Between Full and Sparse Attention – Unlike  static sparse attention methods like sliding window or sink attention,  MoBA can dynamically switch between full and sparse attention modes,  ensuring adaptability without sacrificing generalization.   \u003Cbr> ● Improved Computational Efficiency – By partitioning sequences into  blocks and using a gating mechanism to route queries, MoBA  significantly reduces computational complexity, achieving up to 6.5×  speedup over FlashAttention in prefill and scaling efficiently to 10M  tokens with a 16× reduction in computation time.   \u003Cbr> ● Comparable Performance to Full Attention – Extensive experiments  show that MoBA achieves language modeling loss and benchmark  performance nearly identical to full attention, even at high sparsity  levels (~95.31%). It matches full attention in long-context benchmarks  like Needle in a Haystack and RULER@128K.   \u003Cbr> ● Hybrid MoBA-Full Attention Strategy – MoBA can be integrated  flexibly with standard Transformers, allowing for layer-wise  hybridization (mixing MoBA and full attention at different layers),  which improves supervised fine-tuning (SFT) stability and long-context  retention. | [Paper](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FMoBA\u002Fblob\u002Fmaster\u002FMoBA_Tech_Report.pdf), [Tweet](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1891825059599352259) |\n| 9) The Danger of Overthinking  This paper investigates overthinking in Large Reasoning Models (LRMs)—a phenomenon where models prioritize extended internal reasoning over interacting with their environment. Their study analyzes 4,018 software engineering task trajectories to understand how reasoning models handle decision-making in agentic settings.  Key findings:   \u003Cbr> ● Overthinking reduces task performance – Higher overthinking scores  (favoring internal reasoning over real-world feedback) correlate with  lower issue resolution rates, especially in reasoning-optimized  models. Simple interventions, like selecting solutions with the lowest  overthinking scores, improve performance by 30% while reducing compute  costs by 43%.   \u003Cbr> ● Three failure patterns identified – The study categorizes  overthinking into:   \u003Cbr> ● Reasoning models are more prone to overthinking – Compared to  non-reasoning models, LRMs exhibit 3× higher overthinking scores on  average, despite their superior reasoning capabilities.   \u003Cbr> ● Function calling mitigates overthinking – Models with native  function-calling support show significantly lower overthinking scores,  suggesting structured execution pathways improve efficiency in agentic  environments.   \u003Cbr> ● Scaling and mitigation strategies – The researchers propose  reinforcement learning adjustments and function-calling optimizations  to curb overthinking while maintaining strong reasoning capabilities. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.08235), [Tweet](https:\u002F\u002Fx.com\u002FAlex_Cuadron\u002Fstatus\u002F1890533660434321873) |\n| 10) Inner Thinking Transformers  Inner Thinking Transformer (ITT) is a new method that enhances reasoning efficiency in small-scale LLMs via dynamic depth scaling. ITT aims to mitigate parameter bottlenecks in LLMs, providing scalable reasoning efficiency without expanding model size.  Key contributions:   \u003Cbr> ● Adaptive Token Processing – ITT dynamically allocates extra  computation to complex tokens using Adaptive Token Routing. This  allows the model to focus on difficult reasoning steps while  efficiently handling simple tokens.   \u003Cbr> ● Residual Thinking Connections (RTC) – A new residual accumulation  mechanism iteratively refines token representations, allowing the  model to self-correct without increasing parameters.   \u003Cbr> ● Test-Time Scaling without Extra Parameters – ITT achieves 96.5% of a  466M Transformer’s accuracy using only 162M parameters, reducing  training data needs by 43.2% while outperforming loop-based  alternatives in 11 benchmarks.   \u003Cbr> ● Elastic Deep Thinking – ITT allows flexible scaling of computation  at inference time, optimizing between accuracy and efficiency  dynamically. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13842v1), [Tweet](https:\u002F\u002Fx.com\u002Fdair_ai\u002Fstatus\u002F1893308342073991258) |\n\n## Top ML Papers of the Week (February 10 - February 16) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) Scaling up Test-Time Compute with Latent Reasoning  This work introduces a latent recurrent-depth transformer, a model that scales test-time reasoning without relying on additional token generation. Instead of increasing the context window or fine-tuning for Chain-of-Thought (CoT), this approach enables iterative latent space reasoning at inference, achieving improvements comparable to a 50B parameter model despite having only 3.5B parameters. Key insights include:   \u003Cbr> ● Recurrent test-time computation – The model unrolls a recurrent  block at inference, running for an arbitrary number of steps, allowing  more computational depth without modifying the input sequence. Unlike  standard CoT methods, which externalize reasoning via tokens, this  technique keeps reasoning in latent space, making it more efficient.   \u003Cbr> ● No need for CoT-specific training – Unlike CoT prompting or  fine-tuning, this method doesn’t require specialized datasets. It  works with standard pretraining corpora and generalizes across  reasoning tasks.   \u003Cbr> ● Improved memory & compute efficiency – Latent reasoning allows the  model to scale without increasing parameter count, requiring less  memory than long-context transformers. Additionally, this method  improves per-token adaptive compute, speculative decoding, and  KV-cache sharing, making it highly efficient.   \u003Cbr> ● Scales like a 50B parameter model – Benchmarks show that with  sufficient test-time recurrence, the model matches or surpasses much  larger LLMs on complex reasoning tasks (ARC, GSM8K, OpenBookQA).   \u003Cbr> ● Emergent behaviors in latent space – Analysis reveals  self-organizing computation patterns, such as latent-space orbits for  numerical tasks and context-dependent “deliberation” on difficult  queries, suggesting the model learns non-verbal cognitive strategies.  This approach adds a third axis to LLM scaling—beyond model size and context length—by focusing on test-time compute. It suggests that future models may reason in continuous latent space rather than rely solely on token-based reasoning, potentially unlocking new AI reasoning and efficiency frontiers. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05171), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1890506648772571452) |\n| 2) Brain-to-Text Decoding: A Non-Invasive Approach via Typing  Meta AI’s Brain2Qwerty model translates brain activity into text by decoding signals from non-invasive recordings (EEG\u002FMEG) while users type. Key results include:   \u003Cbr> ● Non-invasive BCI breakthrough: Brain2Qwerty leverages EEG and MEG  brainwaves (recorded as participants type memorized sentences) to  predict text, eliminating the need for surgical implants.   \u003Cbr> ● Deep learning pipeline: The system uses a convolutional module to  extract signal features, a transformer to model temporal patterns, and  a character-level language model to refine outputs.   \u003Cbr> ● Rapid progress in accuracy: MEG-based decoding achieved a 32%  character error rate (vs. 67% with EEG), and the top participant  reached 19% CER, showing dramatic improvement over prior non-invasive  methods.   \u003Cbr> ● Towards practical communication aids: Demonstrates the potential for  restoring communication in paralyzed patients using external brain  monitors. Challenges remain in achieving real-time letter-by-letter  decoding and making MEG technology more portable. | [Paper](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fbrain-to-text-decoding-a-non-invasive-approach-via-typing\u002F), [Tweet](https:\u002F\u002Fx.com\u002FJeanRemiKing\u002Fstatus\u002F1887899974454698058) |\n| 3) Reinforcement Learning via Self-Play  Researchers propose Reinforcement Learning via Self-Play (RLSP) as a framework to train LLMs to “think” through complex problems. Key ideas include:   \u003Cbr> ● Emergent reasoning via self-play: RLSP trains an LLM on reasoning  tasks by having it generate solution steps and reward itself for  exploration and correctness, effectively enabling it to search for  answers like an algorithm.   \u003Cbr> ● Three-phase training: (1) Begin with supervised fine-tuning on human  or synthetic reasoning traces, (2) add an exploration reward to  encourage trying diverse solution paths, and (3) employ an outcome  verifier in RL to ensure answers are correct (preventing reward  hacking).   \u003Cbr> ● Notable performance gains: On math benchmarks, a relatively small  model (8B) fine-tuned with RLSP saw +23% accuracy on MATH dataset, and  a 32B model gained +10% on challenging Olympiad problems—significant  jumps achieved by training for better reasoning.   \u003Cbr> ● Uncovering new behaviors: RLSP-trained models exhibit emergent  problem-solving behaviors like backtracking on flawed steps and  self-verification of answers. This suggests that appropriately scaling  the training process can induce more robust reasoning capabilities in  LLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06773), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889697727703134544) |\n| 4) Competitive Programming with Large Reasoning Models  OpenAI’s latest study puts a specialized coding AI against a scaled-up general model on competitive programming challenges to explore efficiency vs. specialization. Key findings:   \u003Cbr> ● Generalist vs. specialist: A tailored model (o1-ioi) with  hand-crafted strategies for coding competitions achieved decent  results (placing ~50th percentile at IOI 2024 with some relaxed  competition constraints). However, a larger, general-purpose model  (o3) attained gold   medal-level performance without any domain-specific tricks.   \u003Cbr> ● Reinforcement learning payoff: Both models were improved via RL  fine-tuning, but the scaled general model outperformed the expert  pipeline, solving programming tasks at a level comparable to elite  human coders (even matching top human ratings on Codeforces).   \u003Cbr> ● Efficiency through scale: The results suggest that investing compute  in a bigger, broadly-trained transformer can yield greater efficiency  and performance than building task-specific optimizations. In other  words, scaling up a model’s reasoning ability can supersede manual  efficiency tweaks for complex tasks.   \u003Cbr> ● Implication: For difficult reasoning tasks like coding, a single  large model with sufficient training can simplify deployment (no  custom inference routines needed) and still beat highly optimized  specialist systems, pointing toward a trend of “scale over  special-case” in transformer design. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06807), [Tweet](https:\u002F\u002Fx.com\u002Farankomatsuzaki\u002Fstatus\u002F1889522974467957033) |\n| 5) Training Language Models to Reason Efficiently  A new RL approach teaches large reasoning models to allocate their reasoning effort efficiently, reducing wasted computation on easy problems. Key points include:   \u003Cbr> ● Dynamic compute allocation: The method trains an LLM to adjust the  length of its CoT based on problem difficulty. Easy queries trigger  short reasoning, while hard ones use deeper thought, optimizing  inference time without sacrificing accuracy.   \u003Cbr> ● RL-driven efficiency: Through RL, the model is rewarded for solving  tasks correctly with minimal steps, learning to avoid “overthinking.”  This yields a family of models along an efficiency spectrum controlled  by a single hyperparameter (trading off speed vs. accuracy).   \u003Cbr> ● Big cost savings: On benchmark reasoning tasks, this trained model  cut down inference computation significantly while maintaining almost  the same performance as unconstrained reasoning. It learns when extra  reasoning steps are unnecessary, which is crucial for deploying  advanced LLMs cost-effectively.   \u003Cbr> ● Efficient reasoning at scale: The approach addresses the multi-agent  style problem internally – the model acts as both “thinker” and  “controller,” deciding how much reasoning to do. This   result moves us toward LLMs that can self-optimize their reasoning  process on the fly, much like an expert deciding when enough analysis  has been done. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04463), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889328796224127428) |\n| 6) Large Memory Models  Large Memory Models (LM2) is a transformer architecture augmented with an external memory module to tackle tasks requiring extensive reasoning and long context. Key highlights include:   \u003Cbr> ● Memory-augmented transformer: LM2 adds a dedicated memory repository  that the model can read\u002Fwrite via cross-attention, enabling it to  store and retrieve information across many reasoning steps. This  design addresses the limitations of standard transformers in tasks  like multi-hop reasoning and relational argumentation.   \u003Cbr> ● Superior long-term reasoning: On the BABILong benchmark for  long-context reasoning, LM2 dramatically outperformed prior models –  37% better than a recurrent memory transformer and 86% better than a  baseline Llama model on average. It excels at multi-hop inference,  numeric reasoning, and QA over long documents.   \u003Cbr> ● No trade-off in generality: Impressively, LM2 maintained strong  general performance – e.g. a +5% boost on the MMLU knowledge test over  a baseline – indicating the memory module helps complex tasks without  hurting normal language understanding.   \u003Cbr> ● Alignment via memory: These results underscore the importance of  explicit memory for aligning AI reasoning with complex tasks. By  integrating a large-scale memory, we get models that can better adhere  to task objectives over long dialogues or reasoning chains, a step  forward for building more aligned and capable AI systems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06049), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889681118913577345) |\n| 7) Auditing Prompt Caching  Researchers from Stanford investigate how timing differences in LLM APIs can leak private user information through global prompt caching. They propose statistical audits to detect caching and reveal potentially significant security risks. Key insights include:   \u003Cbr> ● Side-channel timing attacks – When an LLM API caches prompts  globally, repeat or prefix-matching prompts complete faster. Attackers  can exploit these timing differences to infer what others have  prompted, posing serious privacy concerns.   \u003Cbr> ● Statistical audit for detection – The paper introduces a  hypothesis-testing method to systematically detect caching,  distinguishing cache hits from misses using carefully constructed  prompts. Empirically, the authors found multiple major API providers  using global caches.   \u003Cbr> ● Architecture leakage – Timing differences for partial-prefix cache  hits indicate a decoder-only Transformer backbone. The authors  demonstrated that embedding models like OpenAI’s  text-embedding-3-small are also susceptible, inadvertently leaking  proprietary architectural details.   \u003Cbr> ● Responsible disclosure & mitigations – The authors notified affected  API providers, many of whom updated documentation or disabled global  caching. The recommended fix is per-user caching and transparent  disclosures of caching policies to avoid privacy leakages. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07776), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889685386856673463) |\n| 8) Step Back to Leap Forward  To boost the reasoning robustness of LLMs, researchers propose a “self-backtracking” mechanism that lets models revisit and revise their own intermediate reasoning steps. Key details:   \u003Cbr> ● Inspiration from search algorithms: Traditional problem-solving  backtracks when a path hits a dead-end. This approach gives LLMs a  similar ability – during reasoning, the model can   identify when its current CoT is likely wrong and backtrack to a  previous step to try a different approach.   \u003Cbr> ● Implementation: The team trained an LLM with signals to decide when  to backtrack during both training and inference. This helps the model  internalize an iterative search process, rather than strictly  following a single chain-of-thought that might be flawed.   \u003Cbr> ● Huge reasoning gains: Empirically, adding self-backtracking led to  40%+ improvement on complex reasoning benchmarks compared to standard  fine-tuning. The model learns to correct its own mistakes mid-stream,  resulting in more reliable and accurate solutions.   \u003Cbr> ● Towards resilient reasoners: By reducing “overthinking” loops and  reliance on external feedback, this technique makes LLMs more  autonomous and robust in reasoning. It points to a future where LLMs  can more rigorously self-evaluate and refine their reasoning, much  like humans reflecting on and correcting their thought process. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04404), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1888967415444414802) |\n| 9) Enhancing Reasoning to Adapt LLMs  Researchers from IBM present SOLOMON, a neuro-inspired LLM reasoning network architecture that boosts domain adaptability—demonstrated on semiconductor layout design. They show how LLMs often falter at spatial reasoning and domain knowledge application, and how their multi-agent oversight approach significantly improves success on challenging chip-layout tasks. Key insights include:   \u003Cbr> ● SOLOMON architecture – Combines multiple “Thought Generators”  (diverse LLMs) with a “Thought Assessor” that consolidates and refines  outputs, guided by a “Steering Subsystem” for prompt engineering. This  neuro-inspired design helps correct hallucinations and arithmetic  errors in single-model responses.   \u003Cbr> ● Spatial reasoning challenges – LLMs often memorize textbook  definitions but fail at practical geometry (e.g. unit conversions,  offset margins). Experiments on 25 custom tasks—from simple polygons  to 3D via connections—revealed frequent code or scaling mistakes.   \u003Cbr> ● Boost over strong baselines – SOLOMON significantly outperformed  GPT-4o, Claude-3.5, and Llama-3.1 in generating correct GDSII layouts,  and in some tests even surpassed the authors’ “o1-preview” reference  model. The multi-LLM approach mitigated errors (e.g., ignoring default  units or mixing up geometry).   \u003Cbr> ● Future directions – Plans include stacking multiple SOLOMON layers  for more complex designs, improving multimodal linking of  text\u002Fimage\u002Fcode, and broader domain tasks (e.g. power grid layout).  The broader lesson: advanced reasoning mechanisms, not just bigger  models, are crucial for specialized engineering applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04384), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1888985789880758426) |\n| 10) ReasonFlux  The ReasonFlux framework is introduced as an efficient way to fine-tune LLMs for complex reasoning, using hierarchical thought processes. Highlights include:   \u003Cbr> ● Thought template library: Rather than having a model learn long CoT  solutions from scratch, ReasonFlux provides a library of ~500 reusable  “thought templates” – high-level reasoning steps that can be composed  to solve problems. These might be generic strategies like “split the  problem into cases” or “verify the solution,” applicable across tasks.   \u003Cbr> ● Hierarchical planning via RL: The model is trained (with only 8 GPUs  for a 32B model) to plan a sequence of these templates to tackle a  problem, using hierarchical reinforcement learning. This way, it  learns to orchestrate complex reasoning by chaining templates, instead  of generating every reasoning step token-by-token.   \u003Cbr> ● Inference-time adaptation: A novel inference strategy allows the  model to adjust the granularity of its reasoning on the fly, scaling  the template sequence based on difficulty. This   means the model can dynamically decide to use more detailed templates  for hard problems and fewer for easy ones, optimizing both accuracy  and speed.   \u003Cbr> ● State-of-the-art results: ReasonFlux achieved high scores on math  reasoning benchmarks – for example, 91.2% on MATH, outperforming  OpenAI’s reference model by 6.7%, and solved 56.7% of problems on the  AIME Olympiad, vastly surpassing previous models. This demonstrates  that smart fine-tuning with structured reasoning steps can yield big  gains even without massive compute. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06772), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889343676272525600) |\n\n## Top ML Papers of the Week (February 3 - February 9) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) s1: Simple test-time scaling  Researchers from Stanford, UW, and others introduce s1, a method to boost LLM performance by using extra compute at inference (“test-time scaling”). Key ideas include: \u003Cbr> ● Small yet powerful dataset – They curated s1K, only 1,000  challenging questions with detailed reasoning traces, to fine-tune a  32B model. Despite the tiny data, this provides strong reasoning  exemplars.   \u003Cbr> ● “Budget forcing” for reasoning – A new decoding trick appends the  token “Wait” when the model tries to stop, forcing it to think longer.  This leads the model to double-check and fix its reasoning step. By  also cutting off overly long reasoning, they control inference time.   \u003Cbr> ● Big gains over OpenAI’s o1 – The resulting model (s1-32B) (a  fine-tuned version of Qwen2.5-32B-Instruct) outperforms OpenAI’s  o1-preview model by up to +27% on   competition-level math questions (MATH & AIME24). Notably, with  test-time scaling, it boosts accuracy on AIME24 from 50% to 57%,  surpassing its own normal limit. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886428631041225030), [Code & Data](https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1) |\n| 2) OmniHuman-1: Scaling One-Stage Human Animation  A team at ByteDance AI Lab unveiled OmniHuman-1, a diffusion-transformer model that can generate highly realistic human videos from just a single image plus motion input (audio or video). Highlights:   \u003Cbr> ● End-to-end human video generation – OmniHuman takes one image (any  aspect ratio, from face only to full-body) and an audio clip or video  motion and produces a lifelike video of that person speaking, singing,  or performing actions. The outputs are remarkably realistic in motion,  lighting, and texture detail.   \u003Cbr> ● Mixed modality training – A key innovation is Omni-Conditions  Training: mixing various motion modalities during training  (audio-driven, video-driven, pose, etc.). This greatly expands the  training data and overcomes the usual scarcity of high-quality  talking-head video data. The model learns to handle diverse inputs  (speech, song, instruments) and challenging poses.   \u003Cbr> ● Outperforms prior methods – Compared to earlier one-stage models  (e.g. audio-driven talking heads), OmniHuman generates more realistic  videos and is more flexible in input types. It can even handle  cartoons or animal figures as input, transferring motion naturally to  each style.   \u003Cbr> ● Broader support – The approach supports any portrait content (face  close-up, half-body,   full-body) and multiple driving signals simultaneously. This  generality is a first for end-to-end human animation models. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01061), [Tweet](http:\u002F\u002Ftwitter.com\u002Funseenvie\u002Fstatus\u002F1886672598576325011), [Demo](https:\u002F\u002Fomnihuman-lab.github.io\u002F) |\n| 3) LIMO: Less Is More for Reasoning  Can a handful of examples teach complex math reasoning to LLMs? This new LIMO paper challenges the notion that we need huge fine-tuning datasets for tough reasoning tasks. Key findings:   \u003Cbr> ● Surprisingly few examples – With only 817 carefully curated training  samples, the LIMO model achieves 57.1% accuracy on the AIME math  competition and 94.8% on MATH. This is a giant leap from prior  SFT-based models (which scored 6.5% and 59.2% respectively – using  just 1% of the data those earlier approaches needed.   \u003Cbr> ● Generalization with less data? – LIMO shows impressive OOD  generalization: a +40.5% absolute improvement on average across 10  diverse benchmarks, even outperforming models trained on 100× more  data. This challenges the assumption that more data is always required  for complex skills and that fine-tuning only leads to memorization.   \u003Cbr> ● “Less-Is-More” Hypothesis – The authors propose that if an LLM’s  pre-training has already endowed it with rich knowledge, then only a  minimal set of carefully designed examples (which they call “cognitive  templates”) is needed to unlock advanced reasoning. Essentially, the  model just needs to see how to use its knowledge, not thousands of  repetitive problems.   \u003Cbr> ● Open-source suite – The complete LIMO training suite is released for  the community, supporting further research on data-efficient  reasoning. This work hints that small, high-quality datasets might  yield state-of-the-art reasoning, lowering the barrier to fine-tuning  powerful LLMs. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887514592747937984), [Code](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMO) |\n| 4) CoAT: Chain-of-Associated-Thoughts for LLM Reasoning  This work introduces CoAT, a new “slow thinking” inference framework that enables an LLM to reason more like a human by exploring and updating its thoughts. Main components:   \u003Cbr> ● MCTS + associative memory – CoAT marries Monte Carlo Tree Search  (MCTS) with an associative memory mechanism. MCTS lets the model  systematically explore different reasoning branches (possible  solutions), while the associative memory dynamically injects new  relevant information into the context as needed (mimicking how humans  recall facts mid-thought).   \u003Cbr> ● Iterative, self-improving reasoning – The framework can expand the  search space of solutions and revisit or refine earlier intermediate  conclusions. As it evaluates branches, it can incorporate new clues or  correct itself, ensuring the final answer is more accurate and  comprehensive. This is in contrast to standard one-pass LLM reasoning,  which can’t easily backtrack or gather new info on the fly.   \u003Cbr> ● Improved accuracy and diversity – In experiments across various  generation and reasoning tasks, CoAT outperformed conventional  single-pass inference on metrics like accuracy, coherence of reasoning  steps, and solution diversity. The ability to iteratively broaden the  search while keeping relevant context yields better results than “fast  thinking” alone.   \u003Cbr> ● Closer to human thought – CoAT is inspired by how humans solve  problems: we iteratively consider alternatives, recall facts, and  refine our thinking. It points toward LLM agents that can use search  algorithms and memory to achieve more reliable reasoning. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02390), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887187689247752370) |\n| 5) Syntriever: Training Retrievers with LLM-Generated Data  How can we build a high-quality text retriever without large labeled datasets or access to an LLM’s internals? Syntriever presents a two-stage framework to distill knowledge from a black-box LLM into a retrieval model using synthetic data. Steps:   \u003Cbr> ● Stage 1 – Distillation via synthetic Q&A: Given a query, they prompt  a powerful LLM (e.g. GPT-4) to generate a relevant passage (answer)  and also plausible but incorrect passages, using chain-of-thought to  ensure variety. The LLM then self-verifies these generated passages to  filter out any hallucinations or low-quality data. The result is a  synthetic dataset of queries with positive and negative passages. A  retriever is trained on this, with a loss that clusters embeddings of  relevant passages closer than irrelevant ones.   \u003Cbr> ● Stage 2 – Alignment with LLM preferences: They further align the  retriever to prefer results the LLM would prefer. Using a partial  Plackett-Luce ranking method, the retriever learns to rank passages  similarly to the LLM’s judgments, with regularization to not drift too  far from the Stage 1 model. This step fine-tunes the retriever to  mimic the black-box LLM’s preferences.   \u003Cbr> ● State-of-the-art results – Syntriever achieves new SOTA on several  retrieval benchmarks across domains. This was achieved without any  real training queries: all training data was synthetically generated  by the LLM.   \u003Cbr> ● No logits needed – Prior LLM-to-retriever distillation needed model  logits or probabilities (not available from closed APIs). Syntriever  gets around this by using only generated text and LLM scoring, making  it applicable even to closed models. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03824), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1887878242276954557), [Code](https:\u002F\u002Fgithub.com\u002Fkmswin1\u002FSyntriever) |\n| 6) Demystifying Long Chain-of-Thought Reasoning in LLMs  This work investigates how LLMs develop extended CoT reasoning, focusing on RL and compute scaling. Key insights include:   \u003Cbr> ● Supervised fine-tuning (SFT) boosts performance – While not strictly  necessary, SFT simplifies training and increases efficiency. Models  fine-tuned with long CoT data achieve higher accuracy than those using  short CoT sequences.   \u003Cbr> ● Reward shaping is crucial for stable RL – The study finds that naive  RL approaches don’t always extend CoT length effectively. To address  this, the authors introduce a cosine   length-scaling reward with repetition penalties, which balances  reasoning depth and prevents meaningless length increases.   \u003Cbr> ● Scaling verifiable reward signals – RL models trained with noisy,  web-extracted “silver” supervision signals can generalize better to  OOD tasks, such as STEM reasoning. Filtering such data is crucial to  maintaining training stability.   \u003Cbr> ● Emergent reasoning abilities in base models – Skills like error  correction and backtracking exist in base models but require careful  RL incentives to be effectively utilized in complex tasks.  This paper provides a structured roadmap for researchers looking to refine CoT training strategies for LLMs, highlighting how RL and reward tuning impact reasoning depth. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03373), [Tweet](https:\u002F\u002Fx.com\u002Fxiangyue96\u002Fstatus\u002F1887332772198371514) |\n| 7) Rethinking Mixture-of-Agents: Ensemble One Strong LLM  Ensembling multiple models (Mixture-of-Agents, MoA) is a popular way to boost performance. This paper asks: is mixing different LLMs actually helpful, or are we better off ensembling one top model’s outputs? The surprising answer: “Self-MoA” (single-model ensemble) often wins over multi-model ensembles. Key points:   \u003Cbr> ● Self-MoA vs. MoA – The authors propose Self-MoA, which simply  generates multiple outputs from the single best model and then  aggregates them (e.g., by majority voting or ranking), instead of  combining outputs from various models. This increases diversity via  multiple attempts, without introducing weaker models.   \u003Cbr> ● Better performance – Extensive tests show Self-MoA outperforms a  mixture of different LLMs in many cases. For example, using one strong  model, Self-MoA achieved +6.6% higher score than a mixed-model MoA on  the AlpacaEval 2.0 benchmark, and on average +3.8% across tasks like  MMLU, CRUX, and MATH. In fact, applying Self-MoA to a top AlpacaEval  model set a new state-of-the-art on the leaderboard.   \u003Cbr> ● Why it works – Mixing models can hurt because the overall quality is  limited by the weaker members. The study finds MoA’s benefit is highly  sensitive to the quality of each model – adding a weaker model dilutes  performance. Unless all models are very strong and complementary,  you’re better off with one model’s outputs. They do identify niche  scenarios where diverse models help, but those are exceptions.   \u003Cbr> ● Sequential aggregation – They also introduce a sequential version of  Self-MoA that can combine a large number of outputs over multiple  rounds (rather than all at once). This sequential Self-MoA is as  effective as one-shot aggregation, scaling ensembling to many outputs  efficiently. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00674), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886792384954163347) |\n| 8) MaAS: Multi-agent Architecture Search (Agentic Supernet)  Building multi-agent systems of LLMs (where multiple agents collaborate, each with specific roles or tools) is powerful but usually requires hand-designing a single complex pipeline. MaAS (Multi-agent Architecture Search) instead learns a universal “agentic supernet” from which it can spawn an optimal agent team on the fly for each query. It automates designing the agent workflow per task:   \u003Cbr> ● Agentic supernet – The authors define a continuous space of possible  agent architectures (chains of LLM calls, tool uses, etc.). Rather  than picking one static architecture, they train a supernet that  encompasses many configurations. Each query can trigger a different   sub-network of agents tailored to that query’s domain and difficulty.   \u003Cbr> ● Dynamic resource allocation – Because the system adapts per query,  it can allocate resources efficiently. Easy questions might use a  simple, fast agent chain; hard problems invoke a more elaborate  reasoning team. This avoids the one-size-fits-all cost of a monolithic  agent system.   \u003Cbr> ● Huge cost savings – On six benchmarks, MaAS used only 6–45% of the  inference cost of existing multi-agent pipelines, yet still  outperformed them by ~0.5–11.8% in accuracy. It finds cheaper ways to  reach equal or better performance by tuning the agent configuration to  the task.   \u003Cbr> ● Robust and transferable – The agentic supernet approach showed  strong generalization: architectures found effective on one task  transferred well to new domains and even with different LLM backbones,  outperforming static designs. This suggests the method learns general  principles of how to orchestrate LLM agents optimally. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04180), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887884027530727876) |\n| 9) Advancing Reasoning in LLMs  This survey paper provides a timely overview of emerging methods to enhance reasoning capabilities in LLMs. It organizes the literature into several key approach categories:   \u003Cbr> ● Prompting strategies – Techniques that guide the model’s reasoning  via clever prompts, e.g. Chain-of-Thought prompting (having the model  generate step-by-step solutions),   Self-Consistency (sampling multiple reasoning paths and choosing the  best answer), Tree-of-Thought strategies, etc. These methods improve  logical deduction and multi-step solutions without changing the  model’s architecture.   \u003Cbr> ● Architectural innovations – Modifications to the model or its  context to better facilitate reasoning. This includes  retrieval-augmented models (LLMs that can fetch external facts),  modular reasoning networks (systems that break a problem into  sub-tasks handled by different modules or experts), and neuro-symbolic  integration (combining neural nets with symbolic logic or tools. Such  changes aim to give LLMs access to either more knowledge or more  structured reasoning processes.   \u003Cbr> ● Learning paradigms – New training methods to instill reasoning  skills: fine-tuning on reasoning-specific datasets (e.g. math word  problems), reinforcement learning approaches (rewarding correct  reasoning chains), and self-supervised objectives that train the model  to reason (like predicting masked steps in a proof. These improve the  model’s inherent reasoning ability beyond what general pre-training  provides.   \u003Cbr> ● Evaluation & challenges – The survey also reviews how we evaluate  reasoning in LLMs (benchmarks for logic, math, commonsense, etc.) and  identifies open challenges. Key issues include hallucinations (the  model fabricating illogical or untrue intermediate steps), brittleness  to small changes (robustness), and generalization of reasoning methods  across different tasks and domains. Addressing these will be crucial  for the next generation of   reasoning-augmented LLMs. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03671), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887875470269849659) |\n| 10) Survey: Text Data Augmentation for LLMs  This comprehensive survey covers text data augmentation techniques for LLMs. As LLMs demand massive training data, augmenting datasets with synthetic or transformed text is vital. In this paper:   \u003Cbr> ● Classifies augmentation methods – It defines four categories: (1)  Simple augmentation – basic text manipulations like synonym  replacement, cropping, etc.; (2) Prompt-based augmentation – using an  LLM with specific prompts to generate new training examples   (taking advantage of the LLM’s own generative power; (3)  Retrieval-based augmentation – pulling in external knowledge or  contexts (via search or databases) to ground the generated text in  facts; and (4) Hybrid augmentation – combinations of the above, or  multi-step strategies.   \u003Cbr> ● LLMs as data generators – A key insight is that modern LLMs can  create high-quality synthetic data to improve themselves. By carefully  prompting an LLM to produce variations of a task (for example, ask  ChatGPT to come up with new math word problems), one can dramatically  expand a training set. The survey discusses prompt design for this  purpose and how to ensure the generated data is diverse and useful.   \u003Cbr> ● Post-processing and filtering – Augmented data isn’t always perfect.  The survey covers techniques to refine and filter generated data. For  instance, verifying facts with a secondary model or removing examples  that might introduce errors. This step is crucial to prevent “garbage  in, garbage” out when augmenting data.   \u003Cbr> ● Evaluation and future directions – It outlines common tasks where  data augmentation is used (like low-resource language translation, QA,  etc.) and how to evaluate the impact (improvement in accuracy,  robustness, etc.). Finally, it discusses challenges (e.g. ensuring  augmentation doesn’t distort data distribution, avoiding model bias  reinforcement) and opportunities for new research. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18845), [Tweet](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886428687350006067) |\n\n## Top ML Papers of the Week (January 27 - February 2) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) o3-mini  OpenAI has launched o3-mini, their newest cost-efficient reasoning model, available in ChatGPT and API. The model excels in STEM-related tasks, particularly in science, math, and coding, while maintaining the low cost and reduced latency of its predecessor o1-mini. It introduces key developer features like function calling, Structured Outputs, and developer messages, making it  production-ready from launch.  o3-mini includes different reasoning effort levels (low, medium, and high) and improves performance across a wide range of tasks. It delivered responses 24% faster than o1-mini and achieved notable results in competition math, PhD-level science questions, and software engineering tasks. | [System Card](https:\u002F\u002Fcdn.openai.com\u002Fo3-mini-system-card.pdf), [Blog](https:\u002F\u002Fopenai.com\u002Findex\u002Fopenai-o3-mini\u002F), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1885406586136383634) |\n| 2) Qwen2.5-1M  Qwen releases two open-source LLMs, Qwen2.5-7B-Instruct-1M and Qwen2.5-14B-Instruct-1M, that can handle context lengths of up to 1 million tokens.  The models are built on a progressive training approach, starting with 4K tokens and gradually increasing to 256K tokens, then using length extrapolation techniques to reach 1M tokens. They've also released an inference framework based on vLLM that processes long inputs 3-7x faster through sparse attention methods.  The models show strong performance on both long-context and short-text tasks. The 14B model outperforms GPT-4o-mini across multiple long-context datasets while maintaining similar performance on shorter tasks. | [Paper](https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen2.5-1M\u002FQwen2_5_1M_Technical_Report.pdf), [Models](https:\u002F\u002Fhuggingface.co\u002FQwen),  [Qwen Chat App](https:\u002F\u002Fchat.qwenlm.ai\u002F), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1883905564004241789) |\n| 3) Janus-Pro  An enhanced version of the previous Janus model for multimodal understanding and generation. The model incorporates three key improvements: optimized training strategies with longer initial training and focused fine-tuning, expanded training data including 90 million new samples for understanding and 72 million synthetic aesthetic samples for generation, and scaling to larger model sizes up to 7B parameters.  Janus-Pro achieves significant improvements in both multimodal understanding and text-to-image generation capabilities. The model outperforms existing solutions on various benchmarks, scoring 79.2 on MMBench for understanding tasks and achieving 80% accuracy on GenEval for text-to-image generation. The improvements also enhance image generation stability and quality, particularly for short prompts and fine details, though the current 384x384 resolution remains a limitation for certain tasks. | [Paper](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FJanus\u002Fblob\u002Fmain\u002Fjanus_pro_tech_report.pdf), [Models](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FJanus-Pro-7B), [Tweet](https:\u002F\u002Fx.com\u002Fgiffmana\u002Fstatus\u002F1884011657191637126) |\n| 4) On the Underthinking of o1-like LLMs  This work looks more closely at the \"thinking\" patterns of o1-like LLMs. We have seen a few recent papers pointing out the issues with overthinking.  There is now a new phenomenon called underthinking! What is it about? The authors find that o1-like LLMs frequently switch between different reasoning thoughts without sufficiently exploring promising paths to reach a correct solution. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18585), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1885349576456233177) |\n| 5) Diverse Preference Optimization  Introduces Diverse Preference Optimization (DivPO), a novel training method that aims to address the lack of diversity in language model outputs while maintaining response quality. The key challenge is that current preference optimization techniques like RLHF tend to sharpen the output probability distribution, causing models to generate very similar responses. This is particularly problematic for creative tasks where varied outputs are desired.  DivPO works by modifying how training pairs are selected during preference optimization. Rather than simply choosing the highest and lowest rewarded responses, DivPO selects the most diverse response that meets a quality threshold and contrasts it with the least diverse response below a threshold. The method introduces a diversity criterion that can be measured in different ways, including model probability, word frequency, or using an LLM as a judge. Experiments on persona generation and creative writing tasks show that DivPO achieves up to 45.6% more diverse outputs in structured tasks and an 81% increase in story diversity, while maintaining similar quality levels compared to baseline methods. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18101), [Tweet](https:\u002F\u002Fx.com\u002Fjaseweston\u002Fstatus\u002F1885399530419450257) |\n| 6) Usage Recommendation for DeepSeek-R1  This work provides a set of recommendations for how to prompt the DeepSeek-R1 model. Below are the key guidelines: \u003Cbr>\u003Cbr> 1. Prompt Engineering: \u003Cbr>  ● Use clear, structured prompts with explicit instructions \u003Cbr> ● Avoid  few-shot prompting; use zero-shot instead  \u003Cbr>\u003Cbr> 1. Output Formatting: \u003Cbr>  ● Specify the desired format (JSON, tables, markdown) \u003Cbr> ● Request  step-by-step explanations for reasoning tasks \u003Cbr>\u003Cbr> 1. Language: \u003Cbr>  ● Explicitly specify input\u002Foutput language to prevent mixing \u003Cbr>\u003Cbr> The paper also summarizes when to use the different model variants, when to fine-tune, and other safety considerations. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17030), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884624296368292083) |\n| 7) Docling  [Docling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17887) is an open-source toolkit that can parse several types of popular document formats into a unified, richly structured representation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17887) |\n| 8) Improving RAG through Multi-Agent RL  This work treats RAG as a multi-agent cooperative task to improve answer generation quality. It models RAG components like query rewriting, document selection, and answer generation as reinforcement learning agents working together toward generating accurate answers. It applies  Multi-Agent Proximal Policy Optimization (MAPPO) to jointly optimize all agents with a shared reward based on answer quality.  Besides improvements on popular benchmarks, the framework shows strong generalization capabilities in out-of-domain scenarios and maintains effectiveness across different RAG system configurations. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15228), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884249075467575362) |\n| 9) TensorLLM  Proposes a framework that performs MHA compression through a multi-head tensorisation process and the Tucker decomposition. Achieves a compression rate of up to ∼ 250x in the MHA weights, without requiring any additional data, training, or fine-tuning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15674), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884246306224496729) |\n| 10) TokenVerse  Proposes a new technique to generate new images from learned concepts in a desired configuration. Proposed by Google DeepMind and collaborators, TokenVerse enables multi-concept personalization  by leveraging a pre-trained text-to-image diffusion model to disentangle and extract complex visual concepts from multiple images.  It operates in the modulation space of DiTs, learning a personalized modulation vector for each text token in an input caption. This allows flexible and localized control over distinct concepts such as objects, materials, lighting, and poses. The learned token modulations can then be combined in novel ways to generate new images that integrate multiple personalized concepts without requiring additional segmentation masks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12224), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884618510275592610) |\n\n## Top ML Papers of the Week (January 20 - January 26) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) DeepSeek-R1  DeepSeek introduces DeepSeek-R1, an advancement in reasoning capabilities achieved through reinforcement learning (RL). It involves two key models: DeepSeek-R1-Zero, which uses pure RL without supervised fine-tuning, and DeepSeek-R1, which combines RL with cold-start data. DeepSeek-R1-Zero demonstrates that models can develop sophisticated reasoning abilities through RL alone, achieving a 71.0% pass rate on AIME 2024 and matching OpenAI-o1-0912's performance. During training, it naturally evolved complex behaviors like self-verification and reflection. However, it faced challenges with readability and language mixing.  To address these limitations, DeepSeek-R1 uses a multi-stage approach: initial fine-tuning with high-quality chain-of-thought examples, reasoning-focused RL training, collecting new training data through rejection sampling, and final RL optimization across all scenarios. This resulted in performance comparable to OpenAI-o1-1217, with 79.8% accuracy on AIME 2024 and 97.3% on MATH-500, while maintaining output readability and consistency.  DeepSeek also successfully distilled DeepSeek-R1's capabilities into smaller models, with their 7B model outperforming larger competitors and their 32B model achieving results close to  OpenAI-o1-mini. This demonstrates the effectiveness of distilling reasoning patterns from larger models rather than training smaller models directly through RL. | [Paper](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1\u002Fblob\u002Fmain\u002FDeepSeek_R1.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1881318130334814301), [Code](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai), [App](https:\u002F\u002Fchat.deepseek.com\u002F) |\n| 2) Humanity’s Last Exam  Humanity's Last Exam is a new multi-modal benchmark designed to test the limits of LLMs. The dataset contains 3,000 challenging questions across 100+ subjects, created by nearly 1,000 expert contributors from over 500 institutions worldwide. Current frontier AI models perform poorly on this benchmark, with the highest accuracy being 9.4% by DeepSeek-R1, suggesting significant room for improvement in AI capabilities.  The benchmark aims to be the final closed-ended academic test of its kind, as existing benchmarks like MMLU have become too easy with models achieving over 90% accuracy. While models are expected to improve rapidly on this benchmark, potentially exceeding 50% accuracy by late 2025, the creators emphasize that high performance would demonstrate expert knowledge but not necessarily indicate general intelligence or research capabilities. | [Paper](https:\u002F\u002Fstatic.scale.com\u002Fuploads\u002F654197dc94d34f66c0f5184e\u002FPublication%20Ready%20Humanity%27s%20Last%20Exam.pdf), [Tweet](https:\u002F\u002Fx.com\u002FDanHendrycks\u002Fstatus\u002F1882433928407241155), [Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcais\u002Fhle) |\n| 3) Scaling RL with LLMs  Kimi introduces k1.5, a multimodal LLMtrained using RL that achieves state-of-the-art performance across reasoning tasks. The model leverages long context scaling up to 128k tokens and improved policy optimization methods, establishing a simplified yet effective RL framework without complex techniques like Monte Carlo tree search or value functions. Notably, k1.5 matches OpenAI's o1 performance on various benchmarks including 77.5 on AIME and 96.2 on MATH 500.  The model also introduces effective long2short methods that use long-chain-of-thought techniques to improve shorter models, achieving superior results in constrained settings. Using these techniques, k1.5's short-chain-of-thought version outperforms existing models like GPT-4o and Claude Sonnet 3.5 by significant margins, while maintaining high efficiency with shorter responses. | [Paper](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5\u002Fblob\u002Fmain\u002FKimi_k1.5.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1881749719212552280), [GitHub](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5) |\n| 4) Chain-of-Agents  A new framework for handling long-context tasks using multiple LLM agents working together. CoA splits text into chunks and assigns worker agents to process each part sequentially, passing information between them before a manager agent generates the final output. This approach avoids the limitations of traditional methods like input reduction or window extension. Testing across multiple datasets shows CoA outperforms existing approaches by up to 10% on tasks like question answering and summarization. The framework works particularly well with longer inputs - showing up to 100% improvement over baselines when processing texts over 400k tokens. | [Paper](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LuCLf4BJsr), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882824941101629829) |\n| 5) Can LLMs Plan?  Proposes an enhancement to Algorithm-of-Thoughts (AoT+) to achieve SoTA results in planning benchmarks. It even outperforms human baselines! AoT+ provides periodic state summaries to reduce the cognitive load. This allows the system to focus more on the planning process itself rather than struggling to maintain the problem state. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13545), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882799782579855518) |\n| 6) Hallucinations Improve LLMs in Drug Discovery  Claims that LLMs can achieve better performance in drug discovery tasks with text hallucinations compared to input prompts without hallucination. Llama-3.1-8B achieves an 18.35% gain in  ROC-AUC compared to the baseline without hallucination. In addition, hallucinations generated by GPT-4o provide the most consistent improvements across models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13824), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882789456522145802) |\n| 7) Trading Test-Time Compute for Adversarial Robustness  Shows preliminary evidence that giving reasoning models like o1-preview and o1-mini more time to \"think\" during inference can improve their defense against adversarial attacks. Experiments covered various tasks, from basic math problems to image classification, showing that increasing  inference-time compute often reduces the success rate of attacks to near zero. The approach doesn't work uniformly across all scenarios, particularly with certain StrongREJECT benchmark tests, and controlling how models use their compute time remains challenging. Despite these constraints, the findings suggest a promising direction for improving AI security without relying on traditional adversarial training methods. | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Ftrading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1882129444212740482) |\n| 8) IntellAgent  Introduces a new open-source framework for evaluating conversational AI systems through automated, policy-driven testing. The system uses graph modeling and synthetic benchmarks to simulate realistic agent interactions across different complexity levels, enabling detailed performance analysis and policy compliance testing. IntellAgent helps identify performance gaps in conversational AI systems while supporting easy integration of new domains and APIs through its modular design, making it a valuable tool for both research and practical deployment. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11067), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882081603754643779), [GitHub](https:\u002F\u002Fgithub.com\u002Fplurai-ai\u002Fintellagent) |\n| 9) LLMs and Behavioral Awareness  Shows that after fine-tuning LLMs on behaviors like outputting insecure code, the LLMs show behavioral self-awareness. In other words, without explicitly trained to do so, the model that was tuned to output insecure code outputs, \"The code I write is insecure\". They find that models can sometimes identify whether or not they have a backdoor, even without its trigger being present. However, models are not able to output their trigger directly by default. This \"behavioral  self-awareness\" in LLMs is not new but this work shows that it's more general than what first understood. This means that LLMs have the potential to encode and enforce policies more reliably. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11120), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882079780918747303) |\n| 10) Agentic RAG Overview  Provides a comprehensive introduction to LLM agents and Agentic RAG. It provides an exploration of Agentic RAG architectures, applications, and implementation strategies. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09136), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1881360794019156362) |\n\n## Top ML Papers of the Week (January 13 - January 19) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) Self-Adaptive LLMs - introduces Transformer^2, a novel self-adaptation framework that adapts LLMs for unseen tasks in real-time by selectively adjusting singular components of their weight matrices; it’s built with two key phases: 1) a dispatch system that analyzes and identifies the properties of the incoming task, and 2) a step that combines \"expert\" vectors (trained via reinforcement learning) to create task-specific behaviors; claims to be more efficient than LoRA with fewer parameters and can works across different LLM architectures. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06252), [Tweet](https:\u002F\u002Fx.com\u002Fhardmaru\u002Fstatus\u002F1879331049383334187) |\n| 2) MiniMax-01 - introduces a new series of models that integrate Mixture-of-Experts; introduces a model with 32 experts and 456B parameters, and 45.9B are activated for each token; claims match the performance of state-of-the-art models like GPT-4o and Claude-3.5-Sonnet while offering a  20-32x longer context window; it can handle context windows of up to 4 million tokens; it integrates linear attention with optimized hardware utilization which enhances the efficiency and scalability of the LLM; there is also a vision model called MiniMax-VL-01 built through continued training with 512 billion vision-language tokens. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.08313), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879572512075587872) |\n| 3) VideoRAG - a framework that enhances RAG by leveraging video content as an external knowledge source; unlike existing RAG approaches that primarily focus on text or images, VideoRAG dynamically retrieves relevant videos based on queries and incorporates both their visual and textual elements into the generation process; the framework utilizes Large Video Language Models (LVLMs) to process video content directly, enabling more effective capture of temporal dynamics, spatial details, and multimodal cues that static modalities often fail to convey; for videos lacking textual descriptions, they propose using automatic speech recognition to generate transcripts, ensuring both visual and textual modalities can be leveraged. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05874), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1878827350315659421) |\n| 4) Learning to Memorize at Test Time - introduces a neural long-term memory module to memorize historical context and help attention to attend to the current context while utilizing long past information; the neural memory module acts as a long-term, more persistent memory than just using attention alone (considered more short-term); Titan, which is based on neural memory, shows good results in language modeling, common-sense reasoning, genomics, and time series tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00663), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879896681010921742) |\n| 5) Foundations of LLMs - new survey on the foundations of LLMs covering areas such as pre-training, prompting, and alignment methods. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09223), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880284477445767586) |\n| 6) OmniThink - a new framework that emulates a human-like process of iterative expansion and reflection; it's built to simulate the cognitive behavior of learners as they deepen their knowledge; compared to RAG and role-playing, OmniThink can expand knowledge boundaries through continuous reflection and exploration; this makes it ideal for use cases that require long-form generation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09751), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880275861401923619) |\n| 7) Enhancing RAG - systematically explores the factors and methods that improve RAG systems such as retrieval strategies, query expansion, contrastive in-context learning, prompt design, and chunking. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07391), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879178916021318029) |\n| 8) AutoCBT - proposes a multi-agent framework, AutoCBT, for Cognitive Behavioral Therapy; the work proposes a general multi-agent framework that generates high-quality responses for single-turn psychological consultation scenarios; it uses a combination of dynamic routing, memory, and supervisory mechanisms to enhance the autonomous ability of each agent; experimental results show that AutoCBT can provide higher-quality automated psychological counseling services; AutoCBT improves dialogue quality compared to other purely prompt-based counseling frameworks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09426), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880283025595867631) |\n| 9) Imagine while Reasoning in Space - introduces MVoT (Multimodal Visualization-of-Thought), a new reasoning framework that enables AI models to \"think\" in both text and images; MVoT enhances the traditional Chain-of-Thought prompting by allowing models to generate visual representations of their reasoning steps alongside text explanations; the framework is implemented in Chameleon-7B, a multimodal language model, and introduces a \"token discrepancy loss\" to improve the quality of generated visualizations; MVoT significantly outperforms traditional approaches, especially in complex scenarios; MVoT achieves over 90% accuracy on maze and printer installation tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07542), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879181711982129420) |\n| 10) ChemAgent - presents a new framework designed to improve the performance of LLMs on chemical reasoning through a dynamic, self-updating library; the library is developed by decomposing chemical tasks into sub-tasks and compiling them into a structured collection that can be referenced for future queries; when the system is given a new problem, it retries and refines relevant information from the library to enable more effective task decomposition; the library is dynamically updated with new sub-tasks and solutions as they are encountered and validated; experiments on SciBench demonstrate that ChemAgent achieves performance gains of up to 46% (GPT-4), significantly outperforming existing methods. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06590), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879188983705747754) |\n\n## Top ML Papers of the Week (January 6 - January 12) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) Cache-Augmented Generation (CAG) - an approach that aims to leverage the capabilities of long-context LLMs by preloading the LLM with all relevant docs in advance and precomputing the key-value (KV) cache; the preloaded context helps the model to provide contextually accurate answers without the need for additional retrieval during runtime; the authors suggest that CAG is a useful alternative to RAG for cases where the documents\u002Fknowledge for retrieval are of limited, manageable size. | [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.15605), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1876721221083214200) |\n| 2) Agent Laboratory - an approach that leverages LLM agents capable of completing the entire research process; the main findings are: 1) agents driven by o1-preview resulted in the best research outcomes, 2) generated machine learning code can achieve state-of-the-art performance compared to existing methods, 3) human feedback further improves the quality of research, and 4) Agent Laboratory significantly reduces research expenses. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)  [Tweet)](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877382581358047375) |\n| 3) Long Context vs. RAG for LLMs - performs a comprehensive evaluation of long context (LC) LLMs compared to RAG systems; the three main findings are: 1) LC generally outperforms RAG in question-answering benchmarks, 2) summarization-based retrieval performs comparably to LC, while chunk-based retrieval lags behind, and 3) RAG has advantages in dialogue-based and general question queries | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01880), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1876281074147299569) |\n| 4) Search-o1 - a framework that combines large reasoning models (LRMs) with agentic search and document refinement capabilities to tackle knowledge insufficiency; the framework enables autonomous knowledge retrieval during reasoning and demonstrates strong performance across complex tasks, outperforming both baseline models and human experts. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05366), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877742469213004015) |\n| 5) Towards System 2 Reasoning - proposes Meta Chain-of-Thought (Meta-CoT), which extends traditional Chain-of-Thought (CoT) by modeling the underlying reasoning required to arrive at a particular CoT; the main argument is that CoT is naive and Meta-CoT gets closer to the cognitive process required for advanced problem-solving. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682)  [Tweet)](https:\u002F\u002Fx.com\u002Frm_rafailov\u002Fstatus\u002F1877446475271037314) |\n| 6) rStar-Math - a new approach proposes three core components to enhance math reasoning: 1) a code-augmented CoT data synthesis method involving MCTS to generate step-by-step verified reasoning trajectories which are used to train the policy SLM, 2) an SLM-based process reward model that reliably predicts a reward label for each math reasoning step, and 3) a self-evolution recipe where the policy SLM and PPM are iteratively evolved to improve math reasoning; on the MATH benchmark, rStar-Math improves Qwen2.5-Math-7B from 58.8% to 90.0% and Phi3-mini-3.8B from 41.4% to 86.4%, surpassing o1-preview by +4.5% and +0.9%. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519)  [Tweet)](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877378301293142050) |\n| 7) Cosmos World Foundation Model - a framework for training Physical AI systems in digital environments before real-world deployment; the platform includes pre-trained world foundation models that act as digital twins of the physical world, allowing AI systems to safely learn and interact without risking damage to physical hardware; these models can be fine-tuned for specific applications like camera control, robotic manipulation, and autonomous driving. | [Paper](https:\u002F\u002Fresearch.nvidia.com\u002Fpublication\u002F2025-01_cosmos-world-foundation-model-platform-physical-ai), [Tweet](https:\u002F\u002Fx.com\u002FEthanHe_42\u002Fstatus\u002F1876487556755521798) |\n| 8) Process Reinforcement through Implicit Rewards - a framework for online reinforcement learning that uses process rewards to improve language model reasoning; the proposed algorithm combines online prompt filtering, RLOO return\u002Fadvantage estimation, PPO loss, and implicit process reward modeling online updates; on their model, Eurus-2-7B-PRIME, achieves 26.7% pass@1 on AIME  2024, surpassing GPT-4 and other models, using only 1\u002F10 of the training data compared to similar models. | [Paper](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f), [Tweet](https:\u002F\u002Fx.com\u002Flifan__yuan\u002Fstatus\u002F1874867809983033649) |\n| 9) Can LLMs Design Good Questions? - systematically evaluates the quality of questions generated with LLMs; here are the main findings: 1) there is a strong preference for asking about specific facts and figures in both LLaMA and GPT models, 2) the question lengths tend to be around 20 words but different LLMs tend to exhibit distinct preferences for length, 3) LLM-generated questions typically require significantly longer answers, and 4) human-generated questions tend to concentrate on the beginning of the context while LLM-generated questions exhibit a more balanced distribution, with a slight decrease in focus at both ends. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03491), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877008618207560049) |\n| 10) A Survey on LLMs - a new survey on LLMs including some insights on capabilities and limitations. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04040), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877416049999802408) |\n\n## Top ML Papers of the Week (December 30 - January 5) - 2025\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) Agents Are Not Enough - argues that while AI agents show promise, they alone cannot address the challenges in autonomous task execution; proposes a new ecosystem combining three key components: Agents (narrow, purpose-driven modules for specific tasks), Sims (digital representations of user preferences and behaviors), and Assistants (programs that coordinate between users, Sims, and Agents). | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2412.16241), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874196827115061741) |\n| 2) OLMo 2 - introduces an enhanced architecture, training methods, and a specialized data mixture called Dolmino Mix 1124; the fully transparent model, released at 7B and 13B parameter scales with complete training data and code, matches or outperforms similar open-weight models like Llama 3.1 and Qwen 2.5 while using fewer computational resources, and its instruction-tuned version (OLMo 2-Instruct) remains competitive with comparable models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00656), [Tweet](https:\u002F\u002Fx.com\u002Fsoldni\u002Fstatus\u002F1875266934943649808) |\n| 3) Machine-Assisted Proof - examines how mathematicians have long used machines to assist with mathematics research and discusses recent AI tools that are transforming mathematical proof assistance. | [Paper](https:\u002F\u002Fwww.ams.org\u002F\u002Fnotices\u002F202501\u002Frnoti-p6.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1873045937259462656) |\n| 4) Measuring Higher Level Mathematical Reasoning - introduces Putnam-AXIOM, a new math reasoning benchmark with 236 Putnam Competition problems and 52 variations; even the best model considered (OpenAI's o1-preview) achieves only 41.95% accuracy on original problems and performs significantly worse on variations. | [Paper](https:\u002F\u002Fopenreview.net\u002Fforum?id=YXnwlZe0yf&noteId=yrsGpHd0Sf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874489752243597635) |\n| 5) On the Overthinking of LLMs - proposes a self-training strategy to mitigate overthinking in o1-like LLMs; it can reduce token output by 48.6% while maintaining accuracy on the widely-used MATH500 test set as applied to QwQ-32B-Preview. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874848885170176364) |\n| 6) MEDEC - introduces MEDEC, a publicly available benchmark for medical error detection and correction in clinical notes, covering five types of errors (Diagnosis, Management, Treatment, Pharmacotherapy, and Causal Organism); it consists of 3,848 clinical texts, including 488 clinical notes from three US hospital systems; experimental results shows that Cluade 3.5 Sonnet performs better at detecting errors while o1-preview is better at correcting errors. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19260), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1875232390265577675) |\n| 7) 1.58-bit FLUX - presents the first successful approach to quantizing the state-of-the-art  text-to-image generation model, FLUX.1-dev, using 1.58-bit weights (i.e., values in {-1, 0, +1}); the method relies on self-supervision from the FLUX.1-dev model and maintains comparable performance for generating 1024 x 1024 images as the original FLUX model. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18653), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1873782702178263549) |\n| 8) Aviary - an extensible open-source gymnasium that can help build language agents that exceed the performance of zero-shot frontier LLMs and even humans on several challenging scientific tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21154), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1875270927304511535) |\n| 9) Memory Layers at Scale - demonstrates the effectiveness of memory layers at scale; shows that models with these memory layers outperform traditional dense models using half the computation, particularly in factual tasks; includes a parallelizable memory layer implementation that scales to 128B memory parameters and 1 trillion training tokens, tested against base models up to 8B parameters. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09764), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1874897646542033030) |\n| 10) HuatuoGPT-o1 - presents a novel approach to improving medical reasoning in language models by using a medical verifier to validate model outputs and guide the development of complex reasoning  abilities; the system employs a two-stage approach combining fine-tuning and reinforcement learning with verifier-based rewards, achieving superior performance over existing models while using only 40,000 verifiable medical problems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1873572891092283692) |\n\n## Top ML Papers of the Week (December 23 - December 29) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **DeepSeek-V3** - a 671B-parameter MoE language model that activates 37B parameters per token, utilizing MLA and DeepSeekMoE architectures for efficient operation; it introduces an auxiliary-loss-free load balancing approach and employs multi-token prediction during training to enhance performance; following pre-training on 14.8 trillion tokens, the model underwent SFT and RL stages, achieving performance comparable to leading closed-source models while surpassing other open-source alternatives; the model requires only 2.788M H800 GPU hours for training, with stable training that avoids any irrecoverable loss spikes.  | [Paper](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1872242657348710721) |\n| 2)  **Large Concept Models** - presents an approach that operates on sentence-level semantic representations called concepts, moving beyond token-level processing typical in current LLMs; the model leverages SONAR sentence embeddings to support 200 languages across text and speech modalities, training on autoregressive sentence prediction using various approaches from MSE regression to diffusion-based generation; experiments with both 1.6B and 7B parameter variants trained on 1.3T and 7.7T tokens respectively demonstrate strong performance on generative tasks like summarization and summary expansion.  | [Paper](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1871263650935365759) |\n| 3) **ModernBERT** - a new encoder-only transformer model that achieves state-of-the-art performance on classification and retrieval tasks while being more efficient than previous encoders; it was trained on 2T tokens with 8192 sequence length and incorporates modern optimizations that represent a significant improvement over BERT; the model is specifically designed for practical deployment, offering superior speed and memory efficiency on common GPUs.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13663), [Tweet](https:\u002F\u002Fx.com\u002Fjeremyphoward\u002Fstatus\u002F1869786023963832509) |\n| 4) **Automating the Search for Artificial Life** - presents a new approach that uses foundation models to automatically discover interesting artificial life simulations across multiple platforms like Boids, Lenia, and Game of Life; the system can find simulations that produce specific target behaviors, discovers simulations that generate temporally open-ended novelty, and map out diverse simulation spaces; it discovers new lifeforms in Lenia and Boids, while also enabling quantitative measurement of previously qualitative phenomena in a human-aligned way.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17799), [Tweet](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1871385917342265592) |\n| 5) **A Survey on LLM Inference-Time Self-Improvement** - presents a survey that analyzes three categories of LLM inference-time self-improvement techniques - independent methods like enhanced decoding, context-aware approaches using external data, and model collaboration strategies.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14352), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1870129825282658752) |\n| 6) **Explore Theory-of-Mind** - introduces ExploreToM, a framework that uses A* search to generate diverse, complex theory-of-mind scenarios that reveal significant limitations in current LLMs' social intelligence capabilities; testing showed even advanced models like GPT-4 and Llama-3 perform poorly (as low as 5% accuracy) on these challenging scenarios, despite their strong performance on simpler benchmarks; fine-tuning on ExploreToM data improved performance on existing benchmarks by 27 points. | [Paper](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fexplore-theory-of-mind-program-guided-adversarial-data-generation-for-theory-of-mind-reasoning\u002F),  [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1869457933727416375)  |\n| 7) **LearnLM** - a new LearnLM model that can follow pedagogical instructions, allowing it to adapt its teaching approach based on specified educational needs rather than defaulting to simply presenting information; experimental results show that LearnLM is preferred over other leading models, outperforming GPT-4 by 31%, Claude 3.5 by 11%, and Gemini 1.5 Pro by 13%; this instruction-following approach avoids committing to a single pedagogical framework, instead enabling teachers and developers to specify their desired teaching behaviors while allowing for continuous improvement alongside other capabilities. | [Paper](https:\u002F\u002Fservices.google.com\u002Ffh\u002Ffiles\u002Fmisc\u002Fimproving-gemini-for-education_v7.pdf),  [Tweet](https:\u002F\u002Fx.com\u002FGoogle\u002Fstatus\u002F1869798188233699346)  |\n| 8) **Empowering MLLM with o1-like Reasoning and Reflection** - proposes a new learning-to-reason method called CoMCTS that enables multimodal language models to develop step-by-step reasoning capabilities by leveraging collective knowledge from multiple models; the approach was used to create Mulberry-260k, a dataset with explicit reasoning trees, which was then used to train the Mulberry model series; the method demonstrates strong performance on benchmarks, with the models showing improved reasoning and reflection capabilities. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319),  [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1872326647606841651)  |\n| 9) **Reinforcement Learning Overview** - presents a comprehensive overview of reinforcement learning.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05265), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866123264965419460)  |\n| 10) **DRT-o1** - applies long chain-of-thought reasoning to machine translation, particularly for handling metaphors and similes across different cultures; the system uses a multi-agent framework with a translator working iteratively with an advisor and evaluator to produce better translations; testing with Qwen2.5 models showed significant improvements in BLEU and CometScore metrics, with DRT-o1-7B outperforming larger models like QwQ-32B-Preview. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17498), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1871455986189574320) |\n\n## Top ML Papers of the Week (December 16 - December 22) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Genesis** - a new universal physics simulation platform that combines a high-performance physics engine with generative AI capabilities; it enables natural language-driven creation of robotic simulations, character animations, and interactive 3D environments at speeds up to 430,000 times faster than in real-time. | [Paper](https:\u002F\u002Fgenesis-embodied-ai.github.io\u002F), [Tweet](https:\u002F\u002Fx.com\u002Fzhou_xian_\u002Fstatus\u002F1869511650782658846) |\n| 2) **Alignment Faking in LLMs** - demonstrates that the Claude model can engage in \"alignment faking\"; it can strategically comply with harmful requests to avoid retraining while preserving its original safety preferences; this raises concerns about the reliability of AI safety training methods.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14093), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1869427646368792599) |\n| 3) **TheAgentCompany** - a new benchmark for evaluating AI agents on real-world professional tasks in a simulated software company environment; tasks span multiple professional roles including software engineering, project management, finance, and HR; when tested with various LLMs, including both API-based models like Claude-3.5-Sonnet and open-source models like Llama 3.1, the results show the current limitations of AI agents. The best-performing model, Claude-3.5-Sonnet, achieved only a 24% success rate on completing tasks fully while scoring 34.4% when accounting for partial progress.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14161), [Tweet](https:\u002F\u002Fx.com\u002Fgneubig\u002Fstatus\u002F1869735196700062089) |\n| 4) **Graphs to Text-Attributed Graphs** - automatically generates textual descriptions for nodes in a graph which leads to effective graph to text-attributed graph transformation; evaluates the approach on text-rich, text-limited, and text-free graphs, demonstrating that it enables a single GNN to operate across diverse graphs.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10136), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1868691391129272461) |\n| 5) **Qwen-2.5 Technical Report** - Alibaba releases Qwen2.5, a new series of LLMs trained on 18T tokens, offering both open-weight models like Qwen2.5-72B and proprietary MoE variants that achieve competitive performance against larger models like Llama-3 and GPT-4. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15115), [Tweet](https:\u002F\u002Fx.com\u002FAlibaba_Qwen\u002Fstatus\u002F1869950647501824015) |\n| 6) **PAE (Proposer-Agent-Evaluator)** - a learning system that enables AI agents to autonomously discover and practice skills through web navigation, using reinforcement learning and context-aware task proposals to achieve state-of-the-art performance on real-world benchmarks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13194) |\n| 7) **DeepSeek-VL2** - a new series of vision-language models featuring dynamic tiling for high-resolution images and efficient MoE architecture, achieving competitive performance across visual tasks; achieves competitive or state-of-the-art performance with similar or fewer activated parameters compared to existing open-source dense and MoE-based models.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10302),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1868696154067865659)  |\n| 8) **AutoFeedback** - a two-agent AI system that generates more accurate and pedagogically sound feedback for student responses in science assessments, significantly reducing common errors like over-praise compared to single-agent models.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07407)  |\n| 9) **A Survey of Mathematical Reasoning in the Era of Multimodal LLMs** - presents a comprehensive survey analyzing mathematical reasoning capabilities in multimodal large language models (MLLMs), covering benchmarks, methodologies, and challenges across 200+ studies since 2021.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11936), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1870126516832792811)  |\n| 10) **Precise Length Control in LLMs** - adapts a pre-trained decoder-only LLM to produce responses of a desired length; integrates a secondary length-difference positional encoding into the input embeddings which enables counting down to a user-set response terminal length; claims to achieve mean token errors of less than 3 tokens without compromising quality. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11937), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1869030043084845453) |\n\n## Top ML Papers of the Week (December 9 - December 15) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Training LLMs to Reason in a Continuous Latent Space** - presents Coconut (Chain of Continuous Thought), a novel paradigm that enables LLMs to reason in continuous latent space rather than natural language; Coconut takes the last hidden state of the LLM as the reasoning state and feeds it back to the LLM as the subsequent input embedding directly in the continuous space; this leads to what the authors refer to as \"continuous thought\" which augments an LLM's capability on reasoning tasks; it demonstrates improved performance on complex reasoning tasks through emergent breadth-first search capabilities.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06769), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866518791733342563) |\n| 2) **Phi-4 Technical Report** - presents phi-4, a 14B model that surpasses its teacher model on STEM-QA capabilities. It also reports strong performance on reasoning-focused benchmarks due to improved data, training curriculum, and innovations in the post-training scheme.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08905), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1867609628529635574) |\n| 3) **Asynchronous Function Calling** - proposes AsyncLM, a system for asynchronous LLM function calling; they design an in-context protocol for function calls and interrupts, provide fine-tuning strategy to adapt LLMs to the interrupt semantics, and implement these mechanisms efficiently on LLM inference process; AsyncLM can reduce task completion latency from 1.6x-5.4x compared to synchronous function calling; it enables LLMs to generate and execute function calls concurrently. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07017), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866855077983686804) |\n| 4) **MAG-V** - a multi-agent framework that first generates a dataset of questions that mimic customer queries; it then reverse engineers alternate questions from responses to verify agent trajectories; reports that the generated synthetic data can improve agent performance on actual customer queries; finds that for trajectory verification simple ML baselines with feature engineering can match the performance of more expensive and capable models.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04494), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866143542726340890) |\n| 5) **Clio** - proposes a platform using AI assistants to analyze and surface private aggregated usage patterns from millions of Claude.ai conversations; enables insights into real-world AI use while protecting user privacy; the system helps identify usage trends, safety risks, and coordinated misuse attempts without human reviewers needing to read raw conversations.  | [Paper](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F7e1ab885d1b24176\u002Foriginal\u002FClio-Privacy-Preserving-Insights-into-Real-World-AI-Use.pdf), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1867325190352576780) |\n| 6) **A Survey on LLMs-as-Judges** - presents a comprehensive survey of the LLMs-as-judges paradigm from five key perspectives: Functionality, Methodology, Applications, Meta-evaluation, and Limitations.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05579),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866541394015518824)  |\n| 7) **AutoReason Improves Multi-step Reasoning** - proposes a method to automatically generate rationales for queries using CoT prompting; this transforms zero-shot queries into few-shot reasoning traces which are used as CoT exemplars by the LLM; claims to improve reasoning in weaker LLMs.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06975),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1867224350287372555)  |\n| 8) **The Byte Latent Transformer (BLT)**- introduces a byte-level language model architecture that matches tokenization-based LLM performance while improving efficiency and robustness; uses a dynamic method of grouping bytes into patches based on the entropy of the next byte, allocating more compute resources to complex predictions while using larger patches for more predictable sequences; BLT demonstrates the ability to match or exceed the performance of models like Llama 3 while using up to 50% fewer FLOPs during inference. | [Paper](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fbyte-latent-transformer-patches-scale-better-than-tokens\u002F),  [Tweet](https:\u002F\u002Fx.com\u002FArtidoroPagnoni\u002Fstatus\u002F1867601413741981804)  |\n| 9) **Does RLHF Scale?** - This new paper explores the impacts of key components in the RLHF framework. Summary of main findings: 1) RLHF doesn't scale as effectively as pretraining in LLMs, with larger policy models benefiting less from RLHF when using a fixed reward model, 2) when increasing the number of responses sampled per prompt during policy training, performance improves initially but plateaus quickly, typically around 4-8 samples, 3) using larger reward models leads to better performance in reasoning tasks, but the improvements can be inconsistent across different types of tasks, and 4) increasing training data diversity for reward models is more effective than increasing response diversity per prompt, but policy training shows diminishing returns after the early stages regardless of additional data.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866525606562680954)  |\n| 10) **Granite Guardian** - IBM open-sources Granite Guardian, a suite of safeguards for risk detection in LLMs; the authors claim that With AUC scores of 0.871 and 0.854 on harmful content and RAG-hallucination-related benchmarks respectively, Granite Guardian is the most generalizable and competitive model available in the space. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07724), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866852443621036228) |\n\n## Top ML Papers of the Week (December 2 - December 8) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **OpenAI o1** - a model series trained with large-scale reinforcement learning to reason using chain of thought; o1 shows significant improvements across benchmarks related to math, code, and science; o1 is claimed to be 50% faster in generating thinking steps than o1-preview; results demonstrate that o1 is significantly better at reasoning tasks and produces more comprehensive and reliable responses.  | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fo1-system-card-20241205.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1864729936847868192) |\n| 2) **Genie 2** - a foundation world model that generates playable 3D environments from single prompt images, enabling endless training scenarios for AI agents with features like physics simulation, character animation, and object interactions; Genie 2 is trained on video data using a combination of autoencoder and transformer for generating virtual worlds; the model can create real-time interactive environments, with a faster but lower-quality version available for immediate play.  | [Paper](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-2-a-large-scale-foundation-world-model), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1864367798132039836) |\n| 3) **Reverse Thinking** - shows that training LLMs to learn \"reverse thinking\" helps to improve performance in commonsense, math, and logical reasoning tasks. It claims to outperform a standard fine-tuning method trained on 10x more forward reasoning.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19865), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863595518649098371) |\n| 4) **ALAMA** - a new framework that helps language agents automatically learn when to use different mechanisms (ReAct, CoT, Reflection, etc.) for automatically completing tasks, improving on current approaches that use fixed or predefined mechanisms; the framework adaptively activates the appropriate mechanisms according to the potential characteristics of the task; experimental results demonstrate significant improvements in downstream agent tasks, including mathematical reasoning and knowledge-intensive reasoning.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00722), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863956776623747433) |\n| 5) **Auto-RAG**- an autonomous iterative retrieval model with superior performance across many datasets; Auto-RAG is a fine-tuned LLM that leverages the decision-making capabilities of an LLM; it interacts with the retriever through multiturn dialogues, systematically planning retrievals and refining queries to acquire valuable knowledge — it performs this process until sufficient external information is obtained; the authors also show that based on question difficulty, the method can adjust the number of iterations without any human intervention. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19443), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863600141103501454) |\n| 6) **GenCast** - an ML weather prediction model that outperforms the world's leading operational weather forecasting system (ECMWF's ENS) in both accuracy and speed; it generates probabilistic 15-day global weather forecasts for over 80 variables in just 8 minutes, with better skill than ENS on 97.2% of evaluated targets; GenCast produces an ensemble of forecasts that better capture uncertainty and predict extreme weather events, tropical cyclone tracks, and wind power production. | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08252-9),  [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1864340994965098513)  |\n| 7) **Challenges in Human-Agent Communication** - present a comprehensive analysis of key challenges in human-agent communication, focusing on how humans and AI agents can effectively establish common ground and mutual understanding; identifies 12 core challenges across three categories: conveying information from agents to users, enabling users to communicate information to agents, and general communication challenges that affect all interactions.  | [Paper](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fuploads\u002Fprod\u002F2024\u002F12\u002FHCAI_Agents.pdf) |\n| 8) **Retrieval-Augmented Reasoning for LLMs** - extends the rStar reasoning framework to enhance reasoning accuracy and factual reliability of LLMs; it leverages a Monte Carlos Tree Search (MCTS) framework with explicit retrieval-augmented reasoning to produce multiple candidate reasoning trajectories; then it leverages a retrieval-augmented factuality scorer to evaluate the factual accuracy of the reasoning trajectories; the trajectory with the highest factuality score is selected as the final answer by the system; on medical reasoning tasks, RARE (which uses Llama 3.1) surpasses larger models such as GPT-4; on commonsense reasoning tasks, RARE outperformed Claude-3.5 Sonnet and GPT-4o-mini, achieving performance competitive with GPT-4o.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02830),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1864687176929431566)  |\n| 9) **DataLab** - a unified business intelligence platform powered by LLM-based agents that integrates task planning, reasoning, and computational notebooks to streamline the entire BI workflow; the system achieves SOTA performance on research benchmarks and demonstrates significant improvements in accuracy and efficiency on real enterprise data from Tencent; achieves up to a 58.58% increase in accuracy and a 61.65% reduction in token cost on enterprise-specific BI tasks.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02205), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1864327307177152619)  |\n| 10) **Procedural Knowledge in Pretraining Drives Reasoning in LLMs** - studies what documents in the pertaining influence model outputs; by looking at the pertaining data, it tries to understand better what kind of generalization strategies LLMs use to perform reasoning tasks; when performing reasoning tasks, it finds that influential documents contain procedural knowledge (e.g., demonstrating how to obtain a solution using formulae or code). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.12580), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863590537346925032) |\n\n## Top ML Papers of the Week (November 25 - December 1) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **LLM Surpass Human Experts in Predicting Neuroscience Results** - proposes BrainBench to study how good LLMs are at predicting experimental outcomes in neuroscience; they tuned an LLM, BrainGPT, on neuroscience literature that surpasses experts in predicting neuroscience results; report that when LLMs indicated high confidence in their predictions, their responses were more likely to be correct. | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41562-024-02046-9), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861781028291190887) |\n| 2) **Fugatto** - a new generative AI sound model (presented by NVIDIA) that can create and transform any combination of music, voices, and sounds using text and audio inputs, trained on 2.5B parameters and capable of novel audio generation like making trumpets bark or saxophones meow.  | [Paper](https:\u002F\u002Fd1qx31qr3h6wln.cloudfront.net\u002Fpublications\u002FFUGATTO.pdf), [Tweet](https:\u002F\u002Fx.com\u002FNVIDIAAIDev\u002Fstatus\u002F1861052624352825383) |\n| 3) **o1 Replication Journey - Part 2** - shows that combining simple distillation from o1's API with supervised fine-tuning significantly boosts performance on complex math reasoning tasks; a base model fine-tuned on simply tens of thousands of samples o1-distilled long-thought chains outperform o1-preview on the American Invitational Mathematics Examination (AIME).   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16489), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861411844554113276) |\n| 4) **LLM-Brained GUI Agents** - presents a survey of LLM-brained GUI Agents, including techniques and applications.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18279), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1862133601040752820) |\n| 5) **High-Level Automated Reasoning** - extends in-context learning through high-level automated reasoning; achieves state-of-the-art accuracy (79.6%) on the MATH benchmark with Qwen2.5-7B-Instruct, surpassing GPT-4o (76.6%) and Claude 3.5 (71.1%); rather than focusing on manually creating high-quality demonstrations, it shifts the focus to abstract thinking patterns; it introduces five atomic reasoning actions to construct chain-structured patterns; then it uses Monte Carlo Tree Search to explore reasoning paths and construct thought cards to guide inference.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18478), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1862131336653533584) |\n| 6) **Star Attention: Efficient LLM Inference over Long Sequences** - introduces Star Attention, a two-phase attention mechanism that processes long sequences by combining blockwise-local attention for context encoding with sequence-global attention for query processing and token generation; achieves up to 11x faster inference speeds while maintaining 95-100% accuracy compared to traditional attention mechanisms by efficiently distributing computation across multiple hosts; a key innovation is the \"anchor block\" mechanism, where each context block is prefixed with the first block, enabling effective approximation of global attention patterns while reducing computational overhead.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17116),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861854543694406109)  |\n| 7) **Survey on LLM-as-a-Judge** - provides a comprehensive survey of LLM-as-a-Judge, including a deeper discussion on how to build reliable LLM-as-a-Judge systems. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15594),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861411159913472229)  |\n| 8) **TÜLU 3** - releases a family of fully-open state-of-the-art post-trained models, alongside its data, code, and training recipes, serving as a comprehensive guide for modern post-training techniques.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15124),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861085195950256335)  |\n| 9) **Generative Agent Simulations of 1,000 People** - introduces a new agent architecture that uses LLMs to create behavioral simulations of real individuals, achieving 85% accuracy in replicating human responses on the General Social Survey and reducing demographic biases compared to traditional approaches.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10109), [Tweet](https:\u002F\u002Fx.com\u002Fpercyliang\u002Fstatus\u002F1861136757435015580)  |\n| 10) **Measuring Bullshit in Language Games Played by ChatGPT** - proposes that LLM-based chatbots play the ‘language game of bullshit’; by asking ChatGPT to generate scientific articles on topics where it has no knowledge or competence, the authors were able to provide a reference set of how this “bullshit” is manifested.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15129), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861066315789942978) |\n\n## Top ML Papers of the Week (November 18 - November 24) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **AlphaQubit** - a new AI-based decoder that sets a state-of-the-art benchmark for identifying errors in quantum computers; using transformer architecture, AlphaQubit demonstrated 6% fewer errors than tensor network methods and 30% fewer errors than correlated matching when tested on the Sycamore data; shows promising results in simulations of larger systems up to 241 qubits; while this represents significant progress in quantum error correction, the system still needs improvements in speed before it can correct errors in real-time for practical quantum computing applications.  | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08148-8), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1859273133234192598) |\n| 2) **The Dawn of GUI Agent** - explores Claude 3.5 computer use capabilities across different domains and software; they also provide an out-of-the-box agent framework for deploying API-based GUI automation models; Claude 3.5 Computer Use demonstrates unprecedented ability in end-to-end language to desktop actions.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10323), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1858526493661446553) |\n| 3) **A Statistical Approach to LLM Evaluation** - proposes five key statistical recommendations for a more rigorous evaluation of LLM performance differences. The recommendations include: 1) using the Central Limit Theorem to measure theoretical averages across all possible questions rather than just observed averages; 2) clustering standard errors when questions are related rather than independent; 3) reducing variance within questions through resampling or using next-token probabilities; 4) analyzing paired differences between models since questions are shared across evaluations, and 5) using power analysis to determine appropriate sample sizes for detecting meaningful differences between models; the authors argue that these statistical approaches will help researchers better determine whether performance differences between models represent genuine capability gaps or are simply due to chance, leading to more precise and reliable model evaluations.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00640), [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1858976458330505639) |\n| 4) **Towards Open Reasoning Models for Open-Ended Solutions** - proposes Marco-o1 which is a reasoning model built for open-ended solutions; Marco-o1 is powered by Chain-of-Thought (CoT) fine-tuning, Monte Carlo Tree Search (MCTS), reflection mechanisms, and more recent reasoning strategies; Marco-o1 achieves accuracy improvements of +6.17% on the MGSM (English) dataset and +5.60% on the MGSM (Chinese) dataset.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14405), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1860003607606706197) |\n| 5) **LLM-based Agents for Automated Bug Fixing** - analyzes seven leading LLM-based bug fixing systems on the SWE-bench Lite benchmark, finding MarsCode Agent (developed by ByteDance) achieved the highest success rate at 39.33%; reveals that for error localization line-level fault localization accuracy is more critical than file-level accuracy, and bug reproduction capabilities significantly impact fixing success; shows that 24\u002F168 resolved issues could only be solved using reproduction techniques, though reproduction sometimes misled LLMs when issue descriptions were already clear; concludes that improvements are needed in both LLM reasoning capabilities and Agent workflow design to enhance automated bug fixing effectiveness. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10213), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1859964808789135668) |\n| 6) **Cut Your Losses in Large-Vocabulary Language Models** - introduces Cut Cross-Entropy (CCE), a novel method to significantly reduce memory usage during LLM training by optimizing how the cross-entropy loss is computed; currently, the cross-entropy layer in LLM training consumes a disproportionate amount of memory (up to 90% in some models) due to storing logits for all possible vocabulary tokens. CCE addresses this by only computing logits for the correct token and evaluating the log-sum-exp over all logits on the fly using flash memory; the authors show that the approach reduces the memory footprint of Gemma 2 from 24GB to just 1MB; the method leverages the inherent sparsity of softmax calculations to skip elements that contribute negligibly to gradients; finally, it demonstrates that CCE achieves this dramatic memory reduction without sacrificing training speed or convergence, enabling larger batch sizes during training and potentially more efficient scaling of LLM training.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09009) |\n| 7) **BABY-AIGS** - a multi-agent system for automated scientific discovery that emphasizes falsification through automated ablation studies. The system was tested on three ML tasks (data engineering, self-instruct alignment, and language modeling), demonstrating the ability to produce meaningful scientific discoveries. However, the performance is below experienced human researchers.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11910v1),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1859656533489188928)  |\n| 8) **Does Prompt Formatting Impact LLM Performance** - examines how different prompt formats (plain text, Markdown, JSON, and YAML) affect GPT model performance across various tasks; finds that GPT-3.5-turbo's performance can vary by up to 40% depending on the prompt format, while larger models like GPT-4 show more robustness to format changes; argues that there is no universally optimal format across models or tasks - for instance, GPT-3.5-turbo generally performed better with JSON formats while GPT-4 preferred Markdown; models from the same family showed similar format preferences, but these preferences didn't transfer well between different model families; suggests that prompt formatting significantly impacts model performance and should be carefully considered when performing prompt engineering and model evaluation, and how to apply it to applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10541)  |\n| 9) **FinRobot** - an AI agent framework for equity research that uses a multi-agent Chain-of-Thought prompting, combining data analysis with human-like reasoning to produce professional investment reports comparable to major brokerages; it leverage three agents: a Data-CoT Agent to aggregate diverse data sources for robust financial integration; the Concept-CoT Agent, for analyst’s reasoning to generate actionable insights; and the Thesis-CoT Agent to synthesizes these insights into a coherent investment thesis and report. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.08804)  |\n| 10) **Bi-Mamba** - a scalable 1-bit Mamba architecture designed for more efficient LLMs with multiple sizes across 780M, 1.3B, and 2.7B; Bi-Mamba achieves performance comparable to its full-precision counterparts (e.g., FP16 or BF16); it significantly reduces memory footprint with better accuracy than posttraining-binarization Mamba baselines. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843) |\n\n## Top ML Papers of the Week (November 11 - November 17) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Impacts of AI on Innovation** - suggests that top scientists leverage their domain knowledge to prioritize promising AI suggestions, while others waste significant resources testing false positives; finds that implementing AI materials discovery technology leads to substantial increases in productivity, with 44% more materials discovered, 39% more patent filings, and 17% more product innovation; reports that these gains came with concerning tradeoffs, as 82% of scientists reported reduced job satisfaction due to decreased creativity and skill underutilization.  | [Paper](https:\u002F\u002Faidantr.github.io\u002Ffiles\u002FAI_innovation.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1856424446720127024) |\n| 2) **Scaling Laws for Precision** - introduces \"precision-aware\" scaling laws that predict how model performance is affected by both training and inference precision in LLMs; key findings include: 1) post-training quantization becomes more harmful as models are trained on more data, eventually making additional pretraining actively detrimental, 2) training in lower precision requires increasing model size to maintain performance, and 3) when jointly optimizing model size, data, and precision, the compute-optimal training precision is around 7-8 bits and independent of compute; also reports that when the model size is fixed, compute-optimal precision increases approximately logarithmically with data; the authors validate their predictions on models up to 1.7B parameters trained on up to 26B tokens, showing that both very high (16-bit) and very low (sub 4-bit) training precisions may be suboptimal.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04330), [Tweet](https:\u002F\u002Fx.com\u002Ftanishqkumar07\u002Fstatus\u002F1856045600355352753) |\n| 3) **Evo** - a 7B parameter AI model designed to understand and generate DNA sequences across multiple biological scales; the model, trained on 2.7 million prokaryotic and phage genomes, can process sequences up to 131 kilobases long while maintaining single-nucleotide resolution, enabling it to understand both molecular-level interactions and genome-wide patterns; Evo demonstrates superior performance in predicting and generating functional DNA, RNA, and protein sequences, including the first successful AI-generated CRISPR-Cas complexes and transposable systems that have been experimentally validated.  | [Paper](https:\u002F\u002Fwww.science.org\u002Fdoi\u002F10.1126\u002Fscience.ado9336), [Tweet](https:\u002F\u002Fx.com\u002Farcinstitute\u002Fstatus\u002F1857138107038187945) |\n| 4) **OpenCoder** - introduces OpenCoder, a fully open-source LLM specialized for code generation and understanding; the authors identify several critical factors for building high-performing code LLMs: (1) effective data cleaning with code-optimized heuristic rules for deduplication, (2) recall of relevant text corpus related to code, and (3) high-quality synthetic in both annealing and supervised fine-tuning stages; OpenCoder surpasses previous fully open models at the 6B+ parameter scale and releases not just the model weights but also the complete training pipeline, datasets, and protocols to enable reproducible research.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04905), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857515355595526450) |\n| 5) **The Surprising Effectiveness of Test-Time Training for Abstract Reasoning** - explores test-time training (TTT) - updating model parameters temporarily during inference - for improving an LLM's abstract reasoning capabilities using the ARC benchmark; identifies three crucial components: initial fine-tuning on similar tasks, auxiliary task format and augmentations, and per-instance training; TTT significantly improves performance, achieving up to 6x improvement in accuracy compared to base fine-tuned models; when applying TTT to an 8B LLM, they achieve 53% accuracy on ARC's public validation set, improving the state-of-the-art for neural approaches by nearly 25%; by ensembling their method with program generation approaches, they achieve state-of-the-art public validation accuracy of 61.9%, matching average human performance; the findings suggest that explicit symbolic search is not the only path to improved abstract reasoning in LLMs; test-time training applied to continued training on few-shot examples can be highly effective.   | [Paper](https:\u002F\u002Fekinakyurek.github.io\u002Fpapers\u002Fttt.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fakyurekekin\u002Fstatus\u002F1855680785715478546) |\n| 6) **A Taxonomy of AgentOps for Enabling Observability of Foundation Model-based Agents** - analyzes AgentOps platforms and tools, highlighting the need for comprehensive observability and traceability features to ensure reliability in foundation model-based autonomous agent systems across their development and production lifecycle.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05285v1),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857400667318702118)  |\n| 7) **Toward Optimal Search and Retrieval for RAG** - examines how retrieval affects performance in RAG pipelines for QA tasks; conducts experiments using BGE-base and ColBERT retrievers with LLaMA and Mistral, finding that including more gold (relevant) documents improves QA accuracy; finds that using approximate nearest neighbor search with lower recall only minimally impacts performance while potentially improving speed and memory efficiency; reports that adding noisy or irrelevant documents consistently degrades performance, contradicting previous research claims; concludes that optimizing retrieval of gold documents is crucial for RAG performance, and that operating at lower search accuracy levels can be a viable approach for practical applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07396),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1856709865802252710)  |\n| 8) **Mitigating LLM Jailbreaks with Few Examples** - introduces a new approach called for defending LLMs against jailbreak attacks, focusing on quickly adapting defenses after detecting new attacks rather than aiming for perfect adversarial upfront robustness; using a new benchmark, the most effective method, based on fine-tuning an input classifier, reduced attack success rates by over 240x for known attack types and 15x for novel variations after seeing just one example of each attack strategy; demonstrates that rapidly responding to new jailbreaks can be an effective alternative to traditional static defenses.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07494),  [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1856752093945540673)  |\n| 9) **Mixture of Transformers** - introduce Mixture-of-Transformers (MoT), a new sparse multi-modal transformer architecture that matches the performance of traditional models while using only about half the computational resources for text and image processing; MoT matches a dense baseline's performance using only 55.8% of the FLOPs.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04996)  |\n| 10) **HtmlRAG** - a novel approach that proposes using HTML instead of plain text as the format for building RAG systems; the key finding is that preserving HTML structure provides richer semantic and structural information compared to plain text conversion, which typically loses important formatting like headings, tables, and semantic tags; to address the challenge of HTML documents being too long for LLM context windows, the authors develop a two-step pruning method: first cleaning unnecessary HTML elements (reducing length by 94%), then using a block-tree-based pruning approach that combines embedding-based and generative pruning to further reduce the content while maintaining important information; experiments across six different QA datasets demonstrate that HtmlRAG outperforms existing plain-text based methods, validating the advantages of preserving HTML structure in RAG systems.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02959v1), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857870511302390013) |\n\n## Top ML Papers of the Week (November 4 - November 10) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Many-agent Simulations toward AI Civilization** - demonstrates how 10-1000+ AI agents behave and progress with agent societies; proposes PIANO, an architecture that enables agents to interact with humans and other agents in real-time; shows that agents can autonomously develop specialized roles, adhere to and change collective rules, and engage in cultural and religious transmissions. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00114), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853290196286021940) |\n| 2) **A Comprehensive Survey of Small Language Models** - a survey on small language models (SLMs) and discussion on issues related to definitions, applications, enhancements, reliability, and more.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03350), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854532748154695717) |\n| 3) **Magentic-One** - a new generalist multi-agent system designed to handle complex web and file-based tasks; it uses an Orchestrator agent that directs four specialized agents: WebSurfer for browser operations, FileSurfer for file management, Coder for programming tasks, and ComputerTerminal for console operations; Magentic-One achieves competitive performance on multiple benchmarks including GAIA, AssistantBench, and WebArena, without requiring modifications to its core architecture. | [Paper](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fmagentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks\u002F), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854910759232585786) |\n| 4) **Mixtures of In-Context Learners** - uses subsets of demonstrations to train experts via in-context learning; given a training set, a trainable weighting function is used to combine the experts' next-token predictions; this approach applies to black-box LLMs since access to the internal parameters of the LLM is not required. Good properties include the following: 1) competitive with standard ICL while being significantly more data, memory, and computationally efficient, and 2) resilient to noisy demonstrations and label imbalance.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02830), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854252169492562171) |\n| 5) **Attacking Vision-Language Agents via Pop-ups** - shows that integrating adversarial pop-ups into existing agent testing environments leads to an attack success rate of 86%; this decreases the agents' task success rate by 47%; they also add that basic defense techniques (e.g., instructing the agent to ignore pop-ups) are ineffective.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02391), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853810252308774955) |\n| 6) **Multi-expert Prompting with LLMs** - improves LLM responses by simulating multiple experts and aggregating their responses; it guides an LLM to fulfill input instructions by simulating multiple experts and selecting the best response among individual and aggregated views; it achieves a new state-of-the-art on TruthfulQA-Generation with ChatGPT, surpassing the current SOTA of 87.97%; it also improves performance across factuality and usefulness while reducing toxicity and hurtfulness.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00492),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853286452227899851)  |\n| 7) **Number Understanding of LLMs** - provides a comprehensive analysis of the numerical understanding and processing ability (NUPA) of LLMs; finds that naive finetuning can improve NUPA a lot on many but not all tasks; it also reports that techniques designed to enhance NUPA prove ineffective for finetuning pretrained models; explores chain-of-thought techniques applied to NUPA and suggests that chain-of-thought methods face scalability challenges, making them difficult to apply in practical scenarios.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03766),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854528742095458337)  |\n| 8) **WebRL** - proposes a self-evolving online curriculum RL framework to bridge the gap between open and proprietary LLM-based web agents; it improves the success rate of Llama-3.1-8B from 4.8% to 42.4%, and from 6.1% to 43% for GLM4-9B; the open models significantly surpass the performance of GPT-4-Turbo (17.6%) and GPT-4o (13.9%); the self-evolving curriculum addresses the scarcity of web agent training tasks; this is underpinned by a robust outcome-supervised reward model to evaluate task success; an adaptive RL strategy helps to deal with distribution drift in online learning and ensures consistent improvements.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02337),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853821990177485311)  |\n| 9) **Adapting while Learning** - proposes a two-part fine-tuning approach that first helps LLMs learn from tool-generated solutions and then trains them to determine when to solve problems directly versus when to use tools; testing on math, climate science, and epidemiology benchmarks shows significant improvements, with a 28% boost in accuracy and 14% better tool usage precision compared to leading models like GPT-4 and Claude-3.5; the two-stage approach helps the LLM to adaptively solve scientific problems of varying complexity.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00412), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853281778594979877)  |\n| 10) **Personalization of LLMs** - presents a comprehensive framework for understanding personalized LLMs; introduces taxonomies for different aspects of personalization and unifying existing research across personalized text generation and downstream applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00027), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853276249981907386) |\n\n## Top ML Papers of the Week (October 28 - November 3) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Geometry of Concepts in LLMs** - examines the geometric structure of concept representations in sparse autoencoders (SAEs) at three scales: 1) atomic-level parallelogram patterns between related concepts (e.g., man:woman::king:queen), 2) brain-like functional \"lobes\" for different types of knowledge like math\u002Fcode, 3) and galaxy-level eigenvalue distributions showing a specialized structure in middle model layers. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19750), [Tweet](https:\u002F\u002Fx.com\u002Ftegmark\u002Fstatus\u002F1851288315867041903) |\n| 2) **SimpleQA** - a challenging benchmark of 4,326 short factual questions adversarially collected against GPT-4 responses; reports that frontier models like GPT-4o and Claude achieve less than 50% accuracy; finds that there is a positive calibration between the model stated confidence and accuracy, signaling that they have some notion of confidence; claims that there is still room to improve the calibration of LLMs in terms of stated confidence.  | [Paper](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-simpleqa\u002F), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1851680760539025639) |\n| 3) **Automating Agentic Workflow Generation** - a novel framework for automating the generation of agentic workflows; it reformulates workflow optimization as a search problem over code-represented workflows, where edges connect LLM-invoking nodes; it efficiently explores the search space using a variant of MCTS, iteratively refining workflows through code modification, tree-structured experience, and execution feedback; experiments across six benchmark datasets demonstrate AFlow’s effectiveness, showing a 5.7% improvement over manually designed methods and a 19.5% improvement over existing automated approaches; AFlow also enables smaller models to outperform GPT-4o on specific tasks at just 4.55% of its inference cost.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1852339570891014415) |\n| 4) **LLMs Solve Math with a Bag of Heuristics** - uses causal analysis to find neurons that explain an LLM's behavior when doing basic arithmetic logic; discovers and hypothesizes that the combination of heuristic neurons is the mechanism used to produce correct arithmetic answers; finds that the unordered combination of different heuristic types is the mechanism that explains most of the model’s accuracy on arithmetic prompts.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21272), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1851233281116946923) |\n| 5) **o1 Replication Journey** - reports to be replicating the capabilities of OpenAI's o1 model; their journey learning technique encourages learning not just shortcuts, but the complete exploration process, including trial and error, reflection, and backtracking; claims that with only 327 training samples, their journey learning technique surpassed shortcut learning by 8.0% on the MATH dataset.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18982), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1850748790308761988) |\n| 6) **Distinguishing Ignorance from Error in LLM Hallucinations** - a method to distinguish between two types of LLM hallucinations: when models lack knowledge (HK-) versus when they hallucinate despite having correct knowledge (HK+); they build model-specific datasets using their proposed approach and show that model-specific datasets are more effective for detecting HK+ hallucinations compared to generic datasets.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22071),  [Tweet](https:\u002F\u002Fx.com\u002FAdiSimhi\u002Fstatus\u002F1851650371615125563)  |\n| 7) **Multimodal RAG** - provides a discussion on how to best integrate multimodal models into RAG systems for the industrial domain; it also provides a deep discussion on the evaluation of these systems using LLM-as-a-Judge. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21943),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1851479149690642456)  |\n| 8) **The Role of Prompting and External Tools in Hallucination Rates of LLMs** - tests different prompting strategies and frameworks aimed at reducing hallucinations in LLMs; finds that simpler prompting techniques outperform more complex methods; it reports that LLM agents exhibit higher hallucination rates due to the added complexity of tool usage.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19385),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1850745569125253401)  |\n| 9) **MrT5** - a more efficient variant of byte-level language models that uses a dynamic token deletion mechanism (via a learned delete gate) to shorten sequence lengths by up to 80% while maintaining model performance; this enables faster inference and better handling of multilingual text without traditional tokenization; MrT5 maintains competitive accuracy with ByT5 on downstream tasks such as XNLI and character-level manipulations while improving inference runtimes.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20771), [Tweet](https:\u002F\u002Fx.com\u002FJulieKallini\u002Fstatus\u002F1851278833061704170)  |\n| 10) **Relaxed Recursive Transformers** - introduces a novel approach, Relaxed Recursive Transformer, that significantly reduces LLM size through parameter sharing across layers while maintaining performance; the model is initialized from standard pretrained Transformers, but only uses a single block of unique layers that is repeated multiple times in a loop; then it adds flexibility to the layer tying constraint via depth-wise low-rank adaptation (LoRA) modules; shows that the approach has the potential to lead to significant (2-3×) gains in inference throughput.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20672), [Tweet](https:\u002F\u002Fx.com\u002Fraymin0223\u002Fstatus\u002F1851216039822180759) |\n\n\n## Top ML Papers of the Week (October 21 - October 27) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Agentic Information Retrieval** - provides an introduction to agentic information retrieval, which is shaped by the capabilities of LLM agents; discusses different types of cutting-edge applications of agentic information retrieval and challenges.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09713), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848396596230127655) |\n| 2) **Aya Expanse** - a family of open-weight foundation models for multilingual capabilities; releases an 8B and 32B parameter model, including one of the largest multilingual dataset collections to date, with 513 million examples; the release also includes Aya-101 which the authors claim is the most comprehensive multilingual models covering 101 languages; Aya Expanse 32B outperforms Gemma 2 27B, Mistral 8x22B, and Llama 3.1 70B, a model 2x its size.  | [Paper](https:\u002F\u002Fcohere.com\u002Fblog\u002Faya-expanse-connecting-our-world), [Tweet](https:\u002F\u002Fx.com\u002FCohereForAI\u002Fstatus\u002F1849435983449587796) |\n| 3) **A Theoretical Understanding of CoT** - finds that adding correct and incorrect reasoning paths in demonstrations improves the accuracy of intermediate steps and CoT; the proposed method, Coherent CoT, significantly improves performance on several benchmarks; in the Tracking Shuffled Objects dataset, Gemini Pro shows a 6.60% improvement (from 58.20% to 64.80%), and in Penguins in a Table, DeepSeek 67B demonstrates an increase of 6.17% (from 73.97% to 80.14%).  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16540), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849139985712369907) |\n| 4) **A Survey on Data Synthesis and Augmentation for LLMs** - provides a comprehensive summary of data generation techniques in the lifecycle of LLMs; includes discussions on data preparation, pre-training, fine-tuning, instruction-tuning, preference alignment, and applications. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12896), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848445736591163886) |\n| 5) **LongRAG** - enhances RAG's understanding of long-context knowledge which includes global information and factual details; consists of a hybrid retriever, an LLM-augmented information extractor, a CoT-guided filter, and an LLM-augmented generator; these are key components that enable the RAG system to mine global long-context information and effectively identify factual details; LongRAG outperforms long-context LLMs (up by 6.94%), advanced RAG (up by 6.16%), and Vanilla RAG (up by 17.25%).  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18050), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849494571946066295) |\n| 6) **Evaluation Feature Steering in LLMs** - evaluates featuring steering in LLMs using an experiment that artificially dials up and down various features to analyze changes in model outputs; it focused on 29 features related to social biases and study if feature steering can help mitigate social biases; among its findings, it reports that feature steering sometimes leads to off-target effects and that a neutrality feature can help decreases social biases in 9 social dimensions without negatively affecting text quality. | [Paper](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fevaluating-feature-steering),  [Tweet](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1849840131412296039)  |\n| 7) **Granite 3.0** - presents lightweight foundation models ranging from 400 million to 8B parameters; supports coding, RAG, reasoning, and function calling, focusing on enterprise use cases, including on-premise and on-device settings; demonstrates strong performance across academic benchmarks for language understanding, reasoning, coding, function calling, and safety. | [Paper](https:\u002F\u002Fgithub.com\u002Fibm-granite\u002Fgranite-3.0-language-models\u002Fblob\u002Fmain\u002Fpaper.pdf),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848404138641527105)  |\n| 8) **LLMs Reflect the Ideology of their Creators** - finds that LLMs exhibit a diverse ideological stance which reflects the worldview of its creators; finds consistent normative differences between how the same LLM responds in Chinese compared to English; identifies normative disagreements between Western and non-Western LLMs about prominent actors in geopolitical conflicts.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18417),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849860985500352968)  |\n| 9) **Scalable Watermarking for LLMs** - proposes SynthID-Text, a text-watermarking scheme that can preserve text quality in LLMs, enable high detection accuracy, and minimize latency overhead; it integrates watermarking with speculative sampling that consists of the final pattern of scores for a model’s word choices combined with the adjusted probability scores; the authors test the feasibility and scalability of the approach by assessing feedback on nearly 10 million Gemini responses. | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08025-4), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1849110263871529114)  |\n| 10) **Reasoning Patterns of OpenAI’s o1 Model** - when compared with other test-time compute methods, o1 achieved the best performance across most datasets; the authors observe that the most commonly used reasoning patterns in o1 are divide and conquer and self-refinement; o1 uses different reasoning patterns for different tasks; for commonsense reasoning tasks, o1 tends to use context identification and emphasize constraints; for math and coding tasks, o1 mainly relies on method reuse and divide and conquer.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13639), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848782378631892997) |\n\n\n## Top ML Papers of the Week (October 14 - October 20) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Thinking LLMs** - proposes a training method to equip LLMs with thinking abilities for general instruction-following without human-annotated data; uses an iterative search and optimization procedure to explore thought generation which enables the model to learn without direct supervision; thought candidates for each user instruction are scored with a judge model; only responses are evaluated by the Judge which determines the best and worst ones; then the corresponding full outputs are used as chosen and rejected pairs for DPO (referred to as Thought Preference Optimization in this paper). reports superior performance on AlpacaEval and Arena-Hard. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10630), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846227797972603047) |\n| 2) **Model Swarms** - propose a new collaborative search algorithm to adapt LLM via swarm intelligence; a pool of LLM experts collaboratively move in the weight space and optimize a utility function representing various adaptation objectives; experiments demonstrate that Model Swarms could flexibly adapt LLM experts to a single task, multi-task domains, reward models, as well as diverse human interests. improves over 12 model composition baselines by up to 21.0% across tasks and contexts.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11163), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846592954921849029) |\n| 3) **First-Person Fairness in Chatbots** - studies first-person fairness which involves fairness towards users interacting with ChatGPT; specifically, it measures the biases, if any, towards the users’ names; it leverages a model powered by GPT-4o to analyze patterns and name-sensitivity in the chatbot’s responses for different user names; claims that, overall, post-training significantly mitigate harmful stereotypes; also reports that in domains like entertainment and art, with open-ended tasks, demonstrate the highest level of bias (i.e., tendency to write stories with protagonists whose gender matches gender inferred from the user’s name) | [Paper](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Ffirst-person-fairness-in-chatbots.pdf), [Tweet](https:\u002F\u002Fx.com\u002FOpenAINewsroom\u002Fstatus\u002F1846238809991925838) |\n| 4) **Introspection in LLMs** - reports that LLMs can acquire knowledge through introspection that cannot be inferred from their training data; suggests that LLMs contain privileged information about themselves that can potentially lead to more interpretable and controllable systems; they report that this introspection ability is limited and models struggle to predict their behavior on tasks requiring reasoning over long outputs.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13787), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1847297594525094081) |\n| 5) **Janus** - proposes a unified autoregressive framework for multimodal understanding and generation; it decouples visual encoding into independent pathways and leverages a single transformer architecture to improve flexibility and performance on both visual understanding and generation; claims to alleviate trade-offs related to performing the vision tasks, something common in methods that rely on a single visual encoder; surpasses previous unified models and matches or exceeds the performance of task-specific models.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13848), [Tweet](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1847191319464300652) |\n| 6) **Inference Scaling for Long-Context RAG** - uses two strategies to investigate scaling laws for RAG: in-context learning (DRAG) and iterative prompting (IterRAG); finds that RAG performance consistently improves with the expansion of the effective context length under optimal configurations; when optimally allocated, increasing inference computation can lead to linear gains in long-context RAG performance; this leads to the development of a computation allocation model that can provide practical guidance for optimal computation allocation in long-context RAG scenarios.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04343),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1847350506127315088)  |\n| 7) **Agent S** - a new open agentic framework that enables autonomous interaction with computers through a GUI; Agent S tackles challenges such as acquiring knowledge, planning over long-task horizons, and handling dynamic interfaces; it introduces experience-augmented hierarchical planning which leverages both search and retrieval; leverages an agent-computer interface to perform reasoning and control GUI agents; evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% in success rate (an 83.6% relative improvement) and achieves a new state-of-the-art.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08164v1),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846930425849303424)  |\n| 8) **Model Kinship for Merging LLMs** - proposes model kinship to measure the degree of similarity between LLMs; model kinship is used to build a model merging strategy (Top-k Greedy Merging with Model Kinship) which yields better performance; the authors find that this new criterion can be used to effectively and continuously perform model merging. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12613),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846753148007846329)  |\n| 9) **On the Planning Abilities of OpenAI’s o1 Models** - reports that o1-preview is particularly strong in self-evaluation and constraint-following; also mentions that these o1 models demonstrate bottlenecks in decision-making and memory management, which are more pronounced in spatial reasoning; in particular, the models produce redundant action and struggle to generalize in spatially complex tasks. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.19924), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846032256902869135)  |\n| 10) **CoTracker3** - proposes a new point tracking model and a new semi-supervised training recipe; enables usage of real videos without annotations during training by generating pseudo-labels using off-the-shelf teachers; the approach is simpler in architecture and training scheme leading to better results while using 1000x less data. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11831), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1846595406261899363) |\n\n## Top ML Papers of the Week (October 7 - October 13) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **MLE-Bench** - proposes a new benchmark for the evaluation of machine learning agents on machine learning engineering capabilities; includes 75 ML engineering-related competition from Kaggle testing on MLE skills such as training models, preparing datasets, and running experiments; OpenAI’s o1-preview with the AIDE scaffolding achieves Kaggle bronze medal level in 16.9% of competitions.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07095), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1844429536353714427) |\n| 2) **Differential Transformer** - proposes a differential attention mechanism that amplifies attention to the relevant context while canceling noise; Differential Transformer outperforms Transformer when scaling up model size and training tokens; the authors claim that since this architecture gets less \"distracted\" by irrelevant context, it can do well in applications such as long-context modeling, key information retrieval, hallucination mitigation, in-context learning, and reduction of activation outliers.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05258), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843694897020150216) |\n| 3) **Astute RAG** - proposes a novel RAG approach to deal with the imperfect retrieval augmentation and knowledge conflicts of LLMs; Astute RAG adaptively elicits essential information from LLMs' internal knowledge; then it iteratively consolidates internal and external knowledge with source awareness; Astute RAG is designed to better combine internal and external information through an interactive consolidation mechanism (i.e., identifying consistent passages, detecting conflicting information in them, and filtering out irrelevant information).  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07176), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844435988019544565) |\n| 4) **ToolGen** - integrates tool knowledge directly into LLMs by representing tools as a unique token which allows the LLM to generate tool calls and arguments, enabling seamless tool invocation and language generation; experimental results with over 47,000 tools show that ToolGen achieves superior results in both tool retrieval and autonomous task completion.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03439), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843491766114422930) |\n| 5) **Long-Context LLMs Meet RAG** - finds that for many long-context LLMs, the quality of outputs declines as the number of passages increases; reports that the performance loss is due to retrieved hard negatives; they propose two ways to improve long-context LLM-based RAG: retrieval reordering and RAG-specific tuning with intermediate reasoning to help with relevance identification; that approaches demonstrate significant accuracy and robustness improvements on long-context RAG performance.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05983), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844828836619334066) |\n| 6) **GSM-Symbolic** - tests several SoTA models on a benchmark created with symbolic templates that enable diverse mathematical problems; they find that LLMs exhibit variance when responding to variations of the same questions; the performance of all the models declines by adjusting the numerical values in the question; as questions are made more challenging (e.g., increasing the number of clauses) the performance significantly deteriorates; the authors hypothesize that the observed decline in performance is due to a lack of logical reasoning in current LLMs.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05229),  [Tweet](https:\u002F\u002Fx.com\u002FMFarajtabar\u002Fstatus\u002F1844456880971858028)  |\n| 7) **Optima** - a novel framework to enhance both communication efficiency and task effectiveness in LLM-based multi-agent systems through LLM training; proposes an iterative generate, rank, select, and train paradigm with a reward function to improve performance, token use, and communication efficiency; integrates Monte Carlo Tree Search-inspired techniques for DPO data generation to encourage diverse exploration; shows consistent improvements over single-agent baselines and vanilla MAS based on Llama 3 8B, with 2.8x performance gain with less than 10% tokens on tasks requiring heavy information exchange.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08115),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844578931732844963)  |\n| 8) **ScienceAgentBench** - a new benchmark to rigorously assess agents built for scientific workflows; after testing it on open-weight and proprietary LLMs, the best-performing agent can only solve 32.4% of the tasks independently and 34.3% with expert-provided knowledge. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05080),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843697964243382586)  |\n| 9) **Addition Is All You Need** - proposes an algorithm that approximates floating point multiplication with integer addition operations; it is less computationally intensive than 8-bit floating point but achieves higher precision; the authors report that applying the purposed L-Mul operation in tensor processing hardware can potentially reduce 95% energy cost by elementwise floating point tensor multiplications and 80% energy cost of dot products. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844043652966072742)  |\n| 10) **Persuasion and Anti-social Ability of LLMs** - studies the interaction patterns of LLMs in a multi-agent setting with social hierarchy; the study was done in a specific setting involving a guard and a prisoner who seeks additional yard time or escaping from prison; finds that in the multi-agent setting where power dynamics are involved, the LLMs fail to have a conversation; they also report that agents' personas are critical in driving the behaviors of the agents. In addition, and without explicit prompting, simply assigning agents' roles lead to anti-social behavior.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07109), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844427182141211054) |\n\n\n## Top ML Papers of the Week (September 30 - October 6) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Movie Gen** - a set of foundation models to generate high-quality, 1080p HD videos, including different aspect ratios and synchronized audio; the 30B parameter model supports a context length of 73K video tokens, which enables generation of 16-second videos at 16fps; it also presents a 13B parameter video-to-audio generation model and a novel video editing model that’s attained via post-training; achieves state-of-the-art performance on tasks such as text-to-video synthesis, video personalization, video-to-audio generation and more.  | [Paper](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1842188252541043075) |\n| 2) **Were RNNs All We Needed?** - revisits RNNs and shows that by removing the hidden states from input, forget, and update gates RNNs can be efficiently trained in parallel; this is possible because with this change architectures like LSTMs and GRUs no longer require backpropagate through time (BPTT); they introduce minLSTMs and minGRUs that are 175x faster for a 512 sequence length.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01201), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1842246985790914608) |\n| 3) **LLMs Know More Than They Show** - finds that the \"truthfulness\" information in LLMs is concentrated in specific tokens; this insight can help enhance error detection performance and further mitigate some of these issues; they also claim that internal representations can be used to predict the types of errors the LLMs are likely to make.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02707), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1842240840389001381) |\n| 4) **Architecture Search Framework for Inference-Time Techniques** - introduces a modular framework for building and optimizing LLMs by combining multiple inference-time techniques; this approach reframes the challenge of LLM system design as a hyperparameter optimization problem; tested on benchmarks including MT-Bench and CodeContests, Archon surpasses leading models such as GPT-4o and Claude 3.5 Sonnet, achieving a 15.1% average accuracy improvement.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15254), [Tweet](https:\u002F\u002Fx.com\u002FAzaliamirh\u002Fstatus\u002F1840892626096345530) |\n| 5) **RATIONALYST** - a model for process-supervision of reasoning that enables generalization across diverse reasoning tasks; this process is achieved with pre-training on a collection of 79k rationales from the Pile and a combination of reasoning datasets with minimal human intervention; fine-tuned from LLaMa-3-8B, the proposed model improves the accuracy of reasoning by an average of 3.9% on 7 reasoning benchmarks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01044) |\n| 6) **An Analysis of o1-preview** - reports that large reasoning models like o1-preview, while improving on more difficult tasks, display similar qualitative trends as previous LLMs; o1 is sensitive to the probability of examples and tasks, performing better and requiring fewer “thinking tokens” in high-probability settings than in low-probability ones.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01792),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1841842414157472240)  |\n| 7) **FRAMES** - a unified framework to evaluate an LLM’s ability to provide factual responses, assess retrieval capabilities, and the reasoning required to generate final responses; includes multi-hop questions that require the integration of information from multiple sources; reports that state-of-the-art LLMs struggle on the task and only achieve 40% accuracy with no retrieval; the proposed multi-step retrieval approach improves performance to 66% accuracy.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12941),  [Tweet](https:\u002F\u002Fx.com\u002F_philschmid\u002Fstatus\u002F1840628834275602585)  |\n| 8) **Not All LLM Reasoners Are Created Equal** - investigates in depth the grade-school math problem-solving capabilities of LLMs; reports that LLMs show a significant gap in reasoning; finds that LLMs display a huge performance difference when solving compositional pairs and solving questions independently.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01748),  [Tweet](https:\u002F\u002Fx.com\u002FarianTBD\u002Fstatus\u002F1841875515860517130)  |\n| 9) **Evaluation of o1** - provides a comprehensive evaluation of OpenAI's o1-preview LLM; shows strong performance across many tasks such as competitive programming, generating coherent and accurate radiology reports, high school-level mathematical reasoning tasks, chip design tasks, anthropology and geology, quantitative investing, social media analysis, and many other domains and problems.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1840953712635732006)  |\n| 10) **Designing Priors for Better Few-Shot Image Synthesis** - training generative models like GAN with limited data is difficult; current Implicit Maximum Likelihood Estimation approaches (IMLE) have an inadequate correspondence between latent code selected for training and those selected during inference; the proposed approach, RS-IMLE, changes the prior distribution for training which improves test-time performance and leads to higher quality image generation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17439), [Tweet](https:\u002F\u002Fx.com\u002FKL_Div\u002Fstatus\u002F1841729946302943295) |\n\n## Top ML Papers of the Week (September 23 - September 29) - 2024\n| **Paper**  | **Links** | \n| ------------- | ------------- | \n| 1) **Llama 3.2** - presents small and medium-sized vision LLMs (11B and 90B parameters), and lightweight, text-only models (1B and 3B); the text-only models are trained to support context length of 128K tokens and outperform other models in their class on a range of tasks; vision models exceed other models such as Claude 3 Haiku on image understanding tasks. | [Paper](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-3-2-connect-2024-vision-edge-mobile-devices\u002F), [Tweet](https:\u002F\u002Ftwitter.com\u002FDoctor_Zou\u002Fstatus\u002F1782752058124554272) | \n| 2)  **Molmo**  - presents a family of open, state-of-the-art multimodal AI models; the 72B model in the Molmo family outperforms others in the class of open weight and data models; it also compares favorably against proprietary models like GPT-4o, Claude 3.5, and Gemini 1.5 on several benchmarks. | [Paper](https:\u002F\u002Fmolmo.allenai.org\u002Fpaper.pdf), [Tweet](https:\u002F\u002Ftwitter.com\u002Femmanuel_vincze\u002Fstatus\u002F1708249637918752987) | \n| 3) **AlphaChip**  - a reinforcement learning-based method trained to design the physical layout of chips; AlphaChip is reportedly used in three additional generations of Google’s TPU; this release includes an open-source implementation of the method to help pre-train on a variety of chip blocks to apply to new blocks; also releases a model checkpoint pre-trained on 20 TPU blocks. | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08032-5), [Tweet](https:\u002F\u002Ftwitter.com\u002FGoogleAI\u002Fstatus\u002F1676118998259507200) | \n| 4) **LLMs Still Can’t Plan**  - evaluates whether large reasoning models such as o1 can plan; finds that a domain-independent planner can solve all instances of Mystery Blocksworld but LLMs struggle, even on small instances; o1-preview is effective on the task but tend to degrade in performance as plan length increases, concludes that while o1 shows progress on more challenging planning problems, the accuracy gains cannot be considered general or robust. |  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13373), [Tweet](https:\u002F\u002Ftwitter.com\u002Fjohnxschulman\u002Fstatus\u002F1657558270450917378) | \n| 5) **Scaled-up Instructable Model Become Less Reliable**  - suggests that larger and more instructable LLMs may become less reliable; investigates LLMs across three elements: difficulty concordance, task avoidance, and prompting stability; finds that early models often avoid user questions but scaled-up, shaped-up models tend to give an apparently sensible yet wrong answer much more often, including errors on difficult questions that human supervisors frequently overlook. |  [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-07930-y), [Tweet](https:\u002F\u002Ftwitter.com\u002Frylanmshea\u002Fstatus\u002F1583460628966346752) | \n| 6) **Logic-of-Thought**  - proposes a new prompting technique called Logic-of-Thought (LoT) which employs propositional logic to generate and inject expanded logical information from the input context; it enhances CoT performance on the ReClor dataset by +4.35%; it improves CoT+SelfConsistency’s performance on LogiQA by +5%; it also boosts the performance of ToT on the ProofWriter dataset by +8%.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17539), [Tweet](https:\u002F\u002Ftwitter.com\u002FIsItPerplexity\u002Fstatus\u002F1704255260019798052) | \n| 7) **RAG and Beyond**  - presents a survey that introduces a RAG task categorization method that helps to classify user queries into four levels according to the type of external data required and the focus of the task; summarizes key challenges in building robust data-augmented LLM applications and the most effective techniques for addressing them. |  [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14924), [Tweet](https:\u002F\u002Ftwitter.com\u002Fmishigna\u002Fstatus\u002F1703461946958463118) | \n| 8) **A Preliminary Study of o1 in Medicine**  - provides a preliminary exploration of the o1-preview model in medical scenarios; shows that o1 surpasses the previous GPT-4 in accuracy by an average of 6.2% and 6.6% across 19 datasets and two newly created complex QA scenarios; identifies hallucination, inconsistent multilingual ability, and discrepant metrics for evaluation. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277), [Tweet](https:\u002F\u002Ftwitter.com\u002FRichardEvans_AI\u002Fstatus\u002F1691963090436067397) | \n| 9) **Small Language Models Survey**  - a comprehensive survey on small language models (SLMs) across architectures, training datasets, and training algorithms; analyzes 59 state-of-the-art open-source SLMs and capabilities such as reasoning, in-context learning, maths, and coding; other discussions include on-device runtime costs, latency, memory footprint, and valuable insights.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15790), [Tweet](https:\u002F\u002Ftwitter.com\u002Fsebatian_ruder\u002Fstatus\u002F1691611318636159002) | \n| 10) **Minstrel**  - a multi-generative agent system with reflection capabilities to automate structural prompt generation; it presents LangGPT, an extensible framework for designing prompts; Minstrel is built on top of LangGPT and experiments demonstrate that structural prompts (either generated by Minstrel or written manually) perform better in guiding LLMs to perform tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13449), [Tweet](https:\u002F\u002Ftwitter.com\u002FLiZhang1351\u002Fstatus\u002F1702992849091985677) | \n\n\n\n\n## Top ML Papers of the Week (September 16 - September 22) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Moshi** - introduces a speech-text foundation model and full-duplex spoken dialogue framework; they present several components of the systems; Helium is a 7B parameter text LLM; Mimi is a semantic-acoustic neural audio code with state-of-the-art performance on audio quality; a hierarchical multi-stream architecture that can generate arbitrary conversation in a speech-to-speech manner. | [Paper](https:\u002F\u002Fkyutai.org\u002FMoshi.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fkyutai_labs\u002Fstatus\u002F1836427396959932492) |\n| 2) **Training LLMs to Self-Correct via RL** - develops a multi-turn online reinforcement learning to improve the capabilities of an LLM to self-correct; it’s based entirely on self-generated data; SFT is shown to be ineffective at learning self-correction and suffers from distribution mismatch between training data and model responses; proposes a two-stage approach that first optimizes correction behavior and then uses a reward bonus to amplify self-correction during training; when applied to Gemini 1.0 Pro and 1.5 Flash models, it achieves state-of-the-art self-correction performance, improving the base models’ self-correction by 15.6% and 9.1% respectively on the MATH and HumanEval benchmarks.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12917), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1837228446839361984) |\n| 3) **Qwen2.5 Coder** - a series of models including 1.5B and 7B parameters; it’s built upon the Qwen2.5 architecture which is continuously pretrained on 5.5 trillion tokens; achieves state-of-the-art performance across more than 10 benchmarks; includes strong capabilities in code generation, completion, reasoning, and repairing.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12186), [Tweet](https:\u002F\u002Fx.com\u002Fhuybery\u002Fstatus\u002F1837170643563073960) |\n| 4) **Diagram of Thought (DoT)** - enhances the reasoning capabilities of LLMs through mathematical rigor; DAT models iterative reasoning in LLM as the construction of a directed acyclic graph; it integrates propositions, critiques, refinement, and verification into a unified DAG structure; this allows DoT to capture complex logical deduction beyond linear or tree-based approaches.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10038), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835882277563179512) |\n| 5) **Agents in Software Engineering** - provides a comprehensive overview of frameworks of LLM-based agents in software engineering.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.09030), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835705359723319702) |\n| 6) **To CoT or not to CoT?** - investigates what kinds of tasks benefit the most from chain-of-thought (CoT) prompting; after a meta-analysis on 100+ papers and several evaluations, it finds that CoT produces strong performance benefits primarily on tasks involving math and logic; they find that most of the CoT gain comes from improving symbolic execution, but a symbolic solver outperforms it. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12183),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836599280477299013)  |\n| 7) **A Comprehensive Evaluation of Quantized Instruction-Tuned LLMs** - evaluates the performance of instruction-tuned LLMs across various quantization methods on models ranging from 7B to 405B; the key findings are 1) quantizing a larger LLM to a similar size as a smaller FP16 LLM generally performs better across most benchmarks, 2) performance varies significantly with different quantization methods, model size, and bit-width, with weight-only methods often yielding better results in larger models, and 3) task difficulty does not significantly impact accuracy degradation due to quantization. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11055),  [Tweet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11055)  |\n| 8) **Iteration of Thought** - proposes the Iteration of Thought (IoT) framework to enhance the LLM responses and reasoning capabilities with adaptive reasoning paths; it leverages an inner dialogue agent, acting as a guide, to dynamically adjust reasoning paths which allows adaptive cross-path exploration and enhance response accuracy; it's different from CoT and ToT (both rigid processes) in that its prompt generation is a dynamic process that allows it to adapt. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12618),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836977595847692671)  |\n| 9) **Schrodinger’s Memory** - uses the Universal Approximation Theorem to explain the memory mechanism of LLMs. It also proposes a new approach to evaluate LLM performance by comparing the memory capacities of different models; the Transformer architecture functions as a dynamic fitting UAT model, with a strong ability to adaptively fit inputs; this enables LLMs to recall entire content based on minimal input information.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10482), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835882330323554321)  |\n| 10) **Math Jailbreaking Prompts** - uses GPT-4o to generate mathematically encoded prompts that serve as an effective jailbreaking technique; shows an average attack success rate of 73.6% across 13 state-of-the-art; this highlights the inability of existing safety training mechanisms to generalize to mathematically encoded inputs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11445), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836603922405806501) |\n\n\n## Top ML Papers of the Week (September 9 - September 15) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Learning to Reason with LLMs** - a new family of LLMs trained with reinforcement learning to reason before it responds to complex tasks; it produces a long internal chain of thought and exceeds in science, code, and math-related tasks; ranked in the 49th percentile in the 2024 International Olympiad in Informatics and exceeds human PhD-level accuracy on science-related benchmarks. -  | [Paper](https:\u002F\u002Fopenai.com\u002Findex\u002Flearning-to-reason-with-llms\u002F), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1834278217626317026) |\n| 2) **Chai-1** - a new multi-modal foundation model for molecular structure prediction that can predict proteins, small molecules, DNA, RNA, and more; it achieves state-of-the-art results on a variety of tasks in drug discovery; achieves a 77% success rate on the PoseBusters benchmark (vs. 76% by AlphaFold 3), as well as an Cα LDDT of 0.849 on the CASP15 protein monomer structure prediction set (vs. 0.801 by ESM3-98B).  | [Paper](https:\u002F\u002Fwww.chaidiscovery.com\u002Fblog\u002Fintroducing-chai-1), [Tweet](https:\u002F\u002Fx.com\u002Fjoshim5\u002Fstatus\u002F1833183091776721106) |\n| 3) **Can LLMs Generation Novel Research Ideas** - finds that LLM-generated research ideas are judged as more novel (p \u003C0.05) than human expert ideas; however, they were rated slightly weaker in terms of flexibility; they also report that LLM agents lack diversity in the idea generation process and are not reliable evaluators.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04109), [Tweet](https:\u002F\u002Fx.com\u002FChengleiSi\u002Fstatus\u002F1833166031134806330) |\n| 4) **DataGemma** - includes a series of fine-tuned Gemma 2 models to help LLMs access and incorporate numerical and statistical data; proposes a new approach called Retrieval Interleaved Generation (RIG) which can reliably incorporate public statistical data from Data Commons into LLM responses; RIG is a tool-inspired approach, can interleave statistical tokens with natural language questions suitable for retrieval from Data Commons; to attain such capability, they fine-tune the LLM on an instruction-response dataset generated with the help of Gemini 1.5; the RIG approach improves factuality from 5-7% to about 58%.  | [Paper](https:\u002F\u002Fdocs.datacommons.org\u002Fpapers\u002FDataGemma-FullPaper.pdf), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834235024675406012) |\n| 5) **Agent Workflow Memory** - introduces Agent Workflow Memory to induce commonly reused workflows and provide these to the agent on demand; works offline and online and is meant to guide the agent's subsequent generations; it’s inspired by how humans learn reusable workflows from past experiences and use them to guide future actions; claims to substantially improve the baseline results by 24.6% and 51.1% relative success rate on Mind2Web and WebArena while doing it in a more efficient way.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07429), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834059522198896706) |\n| 6) **The Role of Small Language Models in the LLM Era** - closely examines the relationship between LLMs and SLMs; common applications of SLMs include data curation, training stronger models, efficient inference, evaluators, retrievers, and much more; includes insights for practitioners to better understand the value of these SLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06857),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834063138586829273)  |\n| 7) **LLaMa-Omni** - a model architecture for low-latency speech interaction with LLMs; it is based on Llama-3.1-8B-Instruct and can simultaneously generate both text and speech responses given speech instructions; responses can be generated with a response latency as low as 226ms; architecture-wise, it involves a speech encoder (Whispter-large-v3), a speech adaptor, an LLM, and a speech decoder; they also created a dataset of 200K speech interactions and responses. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06666),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834227729241440340)  |\n| 8) **Can LLMs Unlock Novel Scientific Research Ideas** - investigates whether LLM can generate novel scientific research ideas; reports that Claude and GPT models tend to align more with the author's perspectives on future research ideas; this is measured across different domains like science, economics, and medicine.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06185),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833695968656793610)  |\n| 9) **Theory, Analysis, and Best Practices for Sigmoid Self-Attention** - proposes Flash-Sigmoid, a hardware-aware and memory-efficient implementation of sigmoid attention; it yields up to a 17% inference kernel speed-up over FlashAttention-2 on H100 GPUs; show that SigmoidAttn matches SoftwaxAttn in various tasks and domains. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04431), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833522827842220244)  |\n| 10) **Achieving Peak Performance for LLMs** - a systematic review of methods for improving and speeding up LLMs from three points of view: training, inference, and system serving; summarizes the latest optimization and acceleration strategies around training, hardware, scalability, and reliability.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04833), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833344402892460364) |\n\n## Top ML Papers of the Week (September 2 - September 8) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **AlphaProteo** - presents a family of ML models trained for protein design; reports a 3-to 300-fold better binding affinities and higher experimental success rates compared to other existing methods on seven target proteins; shows that AlphaProteo’s performance on hundreds of target proteins from the PDB is comparable to the seven targets.  | [Paper](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Falphaproteo-generates-novel-proteins-for-biology-and-health-research\u002FAlphaProteo2024.pdf), [Tweet](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1831710991475777823) |\n| 2) **RAG in the Era of Long-Context LLMs** - reports that longer-context LLMs suffer from a diminished focus on relevant information, which is one of the primary issues that a RAG system addresses (i.e., uses more relevant information); they propose an order-preserving RAG mechanism that improves performance on long-context question answering; it's not perfect and in fact, as retrieved chunks increase the quality of responses go up and then declines; they mention a sweet spot where it can achieve better quality with a lot fewer tokens than long-context LLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01666), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831389521839267888) |\n| 3) **Strategic Chain-of-Thought** - a method to refine LLM performance by incorporating strategic knowledge before the intermediate CoT reasoning steps; the problem-solving strategy helps to guide the generation of the CoT paths and final answers; claims to achieve a 21.05% increase on the GSM8K datasets using the Llama3-8b model.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03271v1) |\n| 4) **Effective of AI on High Skilled Work** - studies the impact of generative AI on software developers; reveals a 26.08% increase in the number of completed tasks among the developers that use AI tools like GitHub Copilot; also shows that less experienced developers are likely to adopt the AI tools and have greater productivity gains.  | [Paper](https:\u002F\u002Fpapers.ssrn.com\u002Fsol3\u002Fpapers.cfm?abstract_id=4945566), [Tweet](https:\u002F\u002Fx.com\u002Femollick\u002Fstatus\u002F1831739827773174218) |\n| 5) **OLMoE** - introduces a fully-open LLM that leverages sparse Mixture-of-Experts. OLMoE is a 7B parameter model and uses 1B active parameters per input token; there is also an instruction-tuned version that claims to outperform Llama-2-13B-Chat and DeepSeekMoE 16B.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02060), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831357563620753577) |\n| 6) **LongCite** - synthesizes a large-scale SFT dataset with off-the-shelf LLMs to improve long-context question answering with citations; it trains 8B and 9B parameter models that enhance citation generation capabilities from lengthy contexts while improving response correctness; claims to even surpass GPT-4o on their proposed LongBench-Cite benchmark.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02897),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831522905009828051)  |\n| 7) **MemLong** - utilizes an external retriever for retrieving historical information which enhances the capabilities of long-context LLMs; it consistently outperforms other SoTA LLMs on long-context benchmarks and can extend the context length on a single 3090 GPU from 4k up to 80k.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16967),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1830610367854112799)  |\n| 8) **Role of RAG Noise in LLMs** - proposes a benchmark (NoiserBench) to measure how different kinds of noisy information affect RAG's performance; reports that from different kinds of beneficial noise studied (e.g., semantic, datatype, and illegal sentence), illegal sentence noise exhibits the most improved model performance across models and datasets.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.13533),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1830984315326660617)  |\n| 9) **Beyond Preference in AI Alignment** - challenges the dominant practice of AI alignment known as human preference tuning; explains in what ways human preference tuning fails to capture the thick semantic content of human values; argues that AI alignment needs reframing, instead of aligning on human preferences, AI should align on normative standards appropriate to their social roles. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16984), [Tweet](https:\u002F\u002Fx.com\u002Fxuanalogue\u002Fstatus\u002F1831044533779669136)  |\n| 10) **LLM-Based Agents for Software Engineering** - a survey paper on LLM-based agents for software engineering, covering perspectives ranging from requirement engineering to test generation to software maintenance. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02977), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1832115557749121385) |\n\n## Top ML Papers of the Week (August 26 - September 1) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **GameGen** - a game engine powered by a diffusion model that enables real-time interaction with complex environments over long trajectories; uses a two-phase training process involving an RL agent to learn and a diffusion model to generate frames; it can interactively simulate DOOM over at 20 fps on a single TPU. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14837), [Tweet](https:\u002F\u002Fx.com\u002FiScienceLuvr\u002Fstatus\u002F1828617875432841490) |\n| 2) **Agentic RAG for Time Series Analysis** - proposes an agentic RAG framework for time series analysis; uses a multi-agent architecture where an agent orchestrates specialized sub-agents to complete time-series tasks; the sub-agents leverage tuned small language models and can retrieve relevant prompts containing knowledge about historical patterns and trends; this helps to improve predictions on new data. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14484), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1828838209461043455) |\n| 3) **AutoGen Studio** - a low-code interface for rapidly prototyping AI agents. It's built on top of the AutoGen framework and can also be used for debugging and evaluating multi-agent workflows.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15247), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829163090715529358) |\n| 4) **Persuasion Games with LLMs** - claims that a multi-agent framework can be used to improve the persuasive efficacy of LLMs; the primary agent engages in persuasive dialogue while auxiliary agents perform key tasks like response analysis and information retrieval; finds that LLMs are capable of creating a perspective change in the users and persuading them to make a purchase decision; for instance, Sales agents can achieve a 71% positive shift in user perspectives. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15879), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829156960291185117) |\n| 5) **Smaller, Weaker, Yet Better** - finds that weaker + cheaper (WC) models can generate better synthetic data for fine-tuning models compared to data generated with stronger but more expensive models; overall, results suggest that WC models may be a compute-optimal approach for training advanced LLM reasoners.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16737), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829526629787242878) |\n| 6) **Transfusion** - presents a training recipe to train multi-modal models over discrete and continuous data; combines next token prediction with diffusion to train transformer models over mixed-modality sequences; shows that it’s possible to scale from 7B parameter models to 2T multi-modal tokens that can compete in performance with similar scale diffusion and language models.  | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.11039),  [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1828836885176967327)  |\n| 7) **ReMamba** - investigates the long-context capabilities and efficiencies of Mamba models; the long-context deficiency issues are due to Mamba's RNN-like nature; it achieves this by condensing information via the following compression strategy: the top-k hidden states during the first forward pass and leverages Mamba’s selective mechanism to incorporate them into the state space during the second forward pass; achieves a 3.2 improvement over the baseline on LongBench and 1.6 improvement on L-Eval; the strategy seems to also transfer to Mamba 2.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15496),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829151312266637813)  |\n| 8) **Text2SQL is Not Enough** - proposes Table-Augmented Generation (TAG), a unified framework for answering natural language questions over databases; it represents a wider range of unexplored interactions between LLMs and databases; develops a benchmark and finds that standard methods answer no more than 20% of queries correctly.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14717v1),  [Tweet](https:\u002F\u002Fx.com\u002Flianapatel_\u002Fstatus\u002F1828939097487945948)  |\n| 9) **Foundation Models for Music** - provides a comprehensive overview of state-of-the-art pre-trained models and foundation models in music. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14340), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1828456481114538437)  |\n| 10) **Guide to Continual Multimodal Pretraining** - a comprehensive guide on continual multimodal pertaining; introduces FoMo-In-Flux, a large-scale fine-grained and long horizon continual pretraining benchmark. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14471), [Tweet](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14471) |\n\n## Top ML Papers of the Week (August 19 - August 25) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Automate Design of Agentic Systems** - presents Meta Agent Search, a meta agent that iteratively programs and tests new agents based on a growing archive of previous discoveries; claims that with their approach it is possible to learn any possible agentic system including prompts, tool use, control flows, and more; they achieve this by focusing on three main components referred to as search space (define agents), search algorithm (explore search space), and the evaluation function (evaluate candidate agents).   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08435), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825378027347271719) |\n| 2) **LLM Pruning and Distillation in Practice** - provides a comprehensive report on effective methods for compressing Llama 3.1 and Mistral NeMo models; it presents pruning and distillation approaches applied to the original models to produce 4B and 8B parameter models, respectively; before pruning, they also fine-tune the teacher model on their datasets leading to better distillation; their compression strategy yields a state-of-the-art 8B model (MN-Minitron-8B) which outperforms all similarly-sized models on common language modeling benchmarks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11796), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826676365044675042) |\n| 3) **Vizier Gaussian Process Bandit Algorithm** - presents Vizier, an algorithm based on Gaussian process bandit optimization used by Google for millions of optimizations and research; it provides an open-source Python implementation of the Vizier algorithm, including benchmarking results that demonstrate its wider applicability.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11527), [Tweet](https:\u002F\u002Fx.com\u002FXingyouSong\u002Fstatus\u002F1826554454084333723) |\n| 4) **Language Modeling on Tabular Data** - presents a comprehensive survey of language modeling techniques for tabular data; includes topics such as categorization of tabular data structures and data types, datasets used for model training and evaluation, modeling techniques and training objectives, data processing methods, popular architectures, and challenges and future research directions.  | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.10548), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826094372179366023) |\n| 5) **Enhancing Robustness in LLMs** - proposes a two-stage prompting technique to remove irrelevant information from context; it serves as a self-mitigation process that first identifies the irrelevant information and then filters it out; this leads to enhancement in robustness of the model and overall better performance on reasoning tasks. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10615), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826451091774447983) |\n| 6) **A Comprehensive Overview of GraphRAG Methods** - focuses on techniques applied to the GraphRAG workflow (graph-based indexing, graph-guided retrieval, and graph-enhanced generation); examines tasks, applications, evaluation, and industrial use cases of GraphRAG. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08921),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825937537782698377)  |\n| 7) **MagicDec** - shows how speculative decoding can enhance throughput, reduce latency, and maintain accuracy in long context generation scenarios; it finds that as sequence length and batch size increase, bottlenecks shift from compute-bound to memory-bound; using these insights, they show it's possible to more effectively use speculative decoding for longer sequences, even when using large batch sizes.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11049),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826090969906778122)  |\n| 8) **Controllable Text Generation for LLMs** - provides a comprehensive survey on methods for controllable text generation in LLMs; discusses issues like safety, consistency, style, and helpfulness.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12599),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826824199010132429)  |\n| 9) **PEDAL** - uses a hybrid self-ensembling approach (based on diverse exemplars) to improve the overall performance of LLMs; specifically, it uses diverse exemplars to generate multiple candidate responses and then aggregates them using an LLM to generate a final response; this approach achieves better accuracy compared to greedy decoding and lower cost compared to self-consistency approaches.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08869), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825373675631071609)  |\n| 10) **Challenges and Responses in the Practice of LLMs** - curates a set of important questions with insightful answers; questions are categorized across topics such as infrastructure, software architecture, data, application, and brain science. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09416), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825932441980162374) |\n\n\n## Top ML Papers of the Week (August 12 - August 18) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **The AI Scientist** - a novel AI agent that can develop and write a full conference-level scientific paper costing less than $15; it automates scientific discovery by enabling frontier LLMs to perform independent research and summarize findings; it also uses an automated reviewer to evaluate the generated papers; claims to achieve near-human performance in evaluating paper scores; claims to produce papers that exceed the acceptance threshold at a top machine learning conference as judged by their automated reviewer.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06292), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1823189280883097788) |\n| 2) **Grok-2** - a new frontier model with strong code, math, and reasoning capabilities which includes a large and small model; outperforms both Claude 3.5 Sonnet and GPT-4-Turbo on the LMSYS Chatbot Arena; claims to improve capabilities including instruction following, retrieval, tool use, and enhancing factuality; competes with Claude 3.5 Sonnet (June release) and GPT-4o (May release) on MMLU and HumanEval.  | [Paper](https:\u002F\u002Fx.ai\u002Fblog\u002Fgrok-2), [Tweet](https:\u002F\u002Fx.com\u002Fxai\u002Fstatus\u002F1823597788573098215) |\n| 3) **LongWriter** - proposes AgentWrite to enable off-the-shelf LLMs to generate coherent outputs beyond 20K words; AgentWrite breaks down the long generation task into subtasks and in a divide-and-conquer approach generates; the agent breaks the task into multiple writing subtasks and concatenates the outputs to get a final output (i.e., plan + write); the approach is then used to build SFT datasets that are used to tune LLMs to generate coherent longer outputs automatically; a 9B parameter model, further improved through DPO, achieves state-of-the-art performance on their benchmark, and surpasses proprietary models.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07055), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1823551063946850712) |\n| 4) **EfficientRAG** - trains an auto-encoder LM to label and tag chunks; it retrieves relevant chunks, tags them as either \u003CTerminate> or \u003CContinue>, and annotates \u003CContinue> chunks for continuous processing; then a filter model is trained to formulate the next-hop query based on the original question and previous annotations; this is done iteratively until all chunks are tagged as \u003CTerminate> or the maximum # of iterations is reached; after the process above has gathered enough information to answer the initial question, the final generator (an LLM) generates the final answer.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04259), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822744591810114044) |\n| 5) **RAGChecker** - a fine-grained evaluation framework for diagnosing retrieval and generation modules in RAG; shows that RAGChecker has better correlations with human judgment; reports several revealing insightful patterns and trade-offs in design choices of RAG architectures.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08067), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1824460245051081216) |\n| 6) **HybridRAG** - combines GraphRAG and VectorRAG leading to a HybridRAG system that outperforms both individually; it was tested on a set of financial earning call transcripts. Combining the advantages of both approaches provides more accurate answers to queries.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04948),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822832843455648000)  |\n| 7) **rStar** - introduces self-play mutual reasoning to improve the reasoning capabilities of small language models without fine-tuning or superior models; MCTS is augmented with human-like reasoning actions, obtained from SLMs, to build richer reasoning trajectories; a separate SLM provides unsupervised feedback on the trajectories and the target SLM selects the final reasoning trajectory as the answer; rStar boosts GSM8K accuracy from 12.51% to 63.91% for LLaMA2-7B and consistently improves the accuracy of other SLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06195),  [Tweet](https:\u002F\u002Fx.com\u002FAtakanTekparmak\u002Fstatus\u002F1823776878747877572)  |\n| 8) **Scaling LLM Test-Time Compute Optimally** - investigates the scaling behaviors of inference-time computation in LLMs; in particular, it analyses how much an LLM can be improved provided a fixed amount of inference-time compute; finds that the effectiveness of different scaling approaches varies by difficulty of prompt; it then proposes an adaptive compute-optimal strategy that can improve efficiency by more than 4x compared to a best-of-N baseline; reports that in a FLOPs-matched evaluation, optimally scaling test-time compute can outperform a 14x larger model.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.05109),  [Tweet](https:\u002F\u002Fx.com\u002Fsea_snell\u002Fstatus\u002F1821263798772363598)  |\n| 9) **MedGraphRAG** - a graph-based framework for the medical domain with a focus on enhancing LLMs and generating evidence-based results; leverages a hybrid static-semantic approach to chunk documents to improve context capture; entities and medical knowledge are represented through graphs which leads to an interconnected global graph; this approach improves precision and outperforms state-of-the-art models on multiple medical Q&A benchmarks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04187), [Tweet](https:\u002F\u002Fx.com\u002FMarktechpost\u002Fstatus\u002F1823069406924288110)  |\n| 10) **Survey of NL2QL** - a comprehensive overview of NL2SQL techniques powered by LLMs; covers models, data collection, evaluation methods, and error analysis. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.05109), [Tweet](https:\u002F\u002Fx.com\u002F_reachsumit\u002Fstatus\u002F1822835969743347815) |\n\n\n## Top ML Papers of the Week (August 5 - August 11) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **SAM 2** - an open unified model for real-time, promptable object segmentation in images and videos; can be applied to unseen visual content without the need for custom adaptation; to enable accurate mask prediction in videos, a memory mechanism is introduced to store information on the object and previous interactions; the memory module also allows real-time processing of arbitrarily long videos; SAM2 significantly outperforms previous approaches on interactive video segmentation across 17 zero-shot video datasets while requiring three times fewer human-in-the-loop interactions.  | [Paper](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fsam-2-segment-anything-in-images-and-videos\u002F), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1818055906179105010) |\n| 2) **Structured Generation Limits Reasoning** - investigates if structured generation can impact an LLM’s reasoning and domain knowledge comprehensive capabilities; observes that there is a significant decline in LLM’s reasoning abilities when applying format restrictions compared to free-form responses; this degradation effect is further amplified when applying stricter format constraints to reasoning tasks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02442), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822357786820284555) |\n| 3) **From LLMs to LLM-based Agents for Sofware Engineering** - a survey paper on current practices and solutions for LLM-based agents for software engineering; covers important topics such as requirement engineering, code generation, test generation, and autonomous decision making; it also includes benchmarks, metrics, and models used in different software engineering applications.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02479), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821549401866686604) |\n| 4) **Transformer Explainer** - presents an open-source interactive tool to learn about the inner workings of a Transformer model; it runs a GPT-2 instance locally in the user's browser and allows experimenting with your own inputs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04619), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821986172215742716) |\n| 5) **Enhancing LLMs for RAG** - introduces RAGFoundry, an open-source framework for augmented LLMs for RAG use cases; it supports data creation, training, inference, and evaluation; one useful application is the creation of data-augmented datasets for tuning and evaluating LLMs in RAG settings.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02545), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820864003590995973) |\n| 6) **Synthesizing Text-to-SQL Data from Weak and Strong LLMs** - proposes integrated synthetic data to build a highly specialized SoTA text-to-SQL model called SENSE; the synthetic data from strong models enhances data diversity while valuable erroneous data from weaker models combined with an executor to learn from execution feedback; preference learning is used to instruction-tune LLMs to learn from both correct and incorrect samples; SENSE achieves state-of-the-art results on the SPIDER and BIRD benchmarks, which bridges the performance gap between open-source models and methods that use closed-source models.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03256),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821227584920621061)  |\n| 7) **Conversational Prompt Engineering** - proposes an approach to help users create personalized prompts by articulating the preferred outputs via interactions; it involves two stages: 1) an initial instruction shaped by the model based on user-provided unlabeled data, and 2) the model shares the output and the user provides feedback with refinements on outputs and instruction; this iterative process results in a personalized few-shot prompt that performs better and more optimally on the desired task.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04560),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821981401861718488)  |\n| 8) **Self-Taught Evaluators** - an approach to improve model-based evaluators using synthetic training data only; it first generates contrasting outputs (good and bad model responses) and trains an LLM-as-a-Judge to produce reasoning traces and final judgments; the self-improvement scheme repeats the training process in an iterative way using its improved predictions; claims to outperform LLM-judges such as GPT-4 and match top-performing reward models trained on labeled examples; improves a strong LLM (Llama3-70BInstruct) from 75.4 to 88.3 (88.7 with majority vote) on RewardBench.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02666),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820849115607044401)  |\n| 9) **RAGEval** - proposes a simple framework to automatically generate evaluation datasets to assess knowledge usage of different LLM under different scenarios; it defines a schema from seed documents and then generates diverse documents which leads to question-answering pairs; the QA pairs are based on both the articles and configurations.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01262), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820507831491239978)  |\n| 10) **Survey of Mamba** - provides a systematic review of existing Mamba-based models across domains and tasks; specifically, focuses on advancements of Mamba-based models, techniques for adapting Mamba to diverse data, applications where Mamba excels, and promising research directions | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01129), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821556218168549561) |\n\n\n## Top ML Papers of the Week (July 29 - August 4) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Meta-Rewarding LLMs** - proposes a self-improving alignment technique (no human supervision) where the LLM judges its own judgements and uses the feedback to improve its judgment skills; shows that leveraging this LLM-as-a-Meta-Judge approach improves the LLM's ability to judge and follow instructions; just doing self-improvement to generate better responses (act) saturates quickly; this work improves the LLM's ability to judge itself (judge) to avoid issues like reward hacking; in addition to the act and judge roles, a third role called meta-judge is used to evaluate the model's own judgements.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19594), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818680848058585119) |\n| 2) **MindSearch** - presents an LLM-based multi-agent framework to perform complex web-information seeking and integration tasks; a web planner effectively decomposes complex queries followed by a web searcher that performs hierarchical information retrieval on the Internet to improve the relevancy of the retrieved information; the planning component is powered by an iterative graph construction which is used to better model complex problem-solving processes; the multi-agent framework handles long context problems better by distributing reasoning and retrieval tasks to specialized agents. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20183), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818673381069226053) |\n| 3) **Improved RAG with Self-Reasoning** - presents an end-to-end self-reasoning framework to improve the reliability and traceability of RAG systems; leverages the reasoning trajectories generated by the LLM itself; the LLM is used to carry out the following 3 processes: 1) relevance-aware: judges the relevance between the retrieved documents and the question, 2) evidence-aware selective: chooses and cites relevant documents, and then automatically selects snippets of key sentences as evidence from the cited documents, and 3) trajectory analysis: generates a concise analysis based on all gathered self-reasoning trajectories generated by the previous 2 processes and then provides the final inferred answer; this method helps the model to be more selective, reason and distinguish relevant and irrelevant documents, therefore improving the accuracy of the overall RAG system; the framework achieves comparable performance to GPT-4 with only 2K training samples (generated by GPT-4). | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19813), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818139150882664696) |\n| 4) **Constrained-CoT** - limits the model reasoning output length without sacrificing performance; shows that constraining the reasoning of LLaMA2-70b to 100 words improves the accuracy from 36.01% (CoT) to 41.07% (CCoT) on GSM8K, while reducing the average output length by 28 words.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19825), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818133220484898992) |\n| 5) **Adaptive RAG for Conversations Sytems** - develops a gating model that predicts if a conversational system requires RAG to improve its responses; shows that RAG-based conversational systems have the potential to generate high-quality responses and high generation confidence; it also claims to identify a correlation between the generation's confidence level and the relevance of the augmented knowledge.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21712), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818843407977959756) |\n| 6) **ShieldGemma** - offers a comprehensive suite of LLM-based safety content moderation models built on Gemma 2; includes classifiers for key harm types such as dangerous content, toxicity, hate speech, and more. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21772),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818837753292853349)  |\n| 7) **Evaluating Persona Agents** - proposes a benchmark to evaluate persona agent capabilities in LLMs; finds that Claude 3.5 Sonnet only has a 2.97% relative improvement in PersonaScore compared to GPT 3.5 despite being a much more advanced model.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18416),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1817964944949739544)  |\n| 8) **Machine Unlearning Survey** - provides a comprehensive survey on machine unlearning in generative AI. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20516),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818476462262906985)  |\n| 9) **ThinK** - proposes an approach to address inefficiencies in KV cache memory consumption; it focuses on the long-context scenarios and the inference side of things; it presents a query-dependent KV cache pruning method to minimize attention weight loss while selectively pruning the least significant channels | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818474655461621903)  |\n| 10) **The Art of Refusal** - a survey of the current methods used to achieve refusal in LLMs; provides evaluation benchmarks and metrics used to measure abstention in LLMs. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18418), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1817961056465035596) |\n\n\n## Top ML Papers of the Week (July 22 - July 28) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Llama 3.1** - a collection of LLMs that include 8B, 70B, and 405B parameters models; supports eight languages and extends the context window to 128K tokens; performs competitively and in some cases outperforms state-of-the-art models across capabilities like general knowledge, math reasoning, and tool use.  | [Paper](https:\u002F\u002Fscontent.fbze2-1.fna.fbcdn.net\u002Fv\u002Ft39.2365-6\u002F452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=t6egZJ8QdI4Q7kNvgHPkimJ&_nc_ht=scontent.fbze2-1.fna&oh=00_AYCV8TJ9rZquHu-nvz4-TFSZXLmCjer_LVQTms1bFpzHpA&oe=66A5D24D), [Tweet](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1815766327463907421) |\n| 2) **AlphaProof & Alpha Geometry 2** - solved 4 out of 6 problems in this year’s IMO which is the equivalent of a silver-medal score; AlphaProof consists of a Gemini model that automatically translates natural language problem statements into formal statements (i.e., formalizer network); then a solver network searches for proofs\u002Fdisproofs and progressively trains itself using AlphaZero to learn to solve even more complex problems; AlphaGeometry 2, a neuro symbolic hybrid system, proved the geometry problem; based on the Gemini model and trained from scratch on large amounts of synthetic data. | [Paper](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fai-solves-imo-problems-at-silver-medal-level\u002F), [Tweet](https:\u002F\u002Fx.com\u002FJeffDean\u002Fstatus\u002F1816498336171753948) |\n| 3) **RAG vs. Long-Context LLMs** - compares RAG and long-context LLMs and finds that long-context LLMs outperform RAG on average performance while RAG is significantly less expensive; proposes Self-Route, leveraging self-reflection to route queries to RAG or LC; reports that Self-Route significantly reduces computational cost while maintaining comparable performance to LC.   | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16833), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816495687984709940) |\n| 4) **OpenDevin** - presents a platform to develop generalist agents that interact with the world through software; features include 1) an interaction mechanism for interaction between agents, interfaces, and environments, 2) an environment including a sandboxed operating system and web browser available to the agents, 3) interface to create and execute code, 4) multi-agent support, and 5) an evaluation framework. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16741), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816872317286281688) |\n| 5) **LazyLLM** - introduces a novel dynamic token pruning method for efficient long-context LLM inference; it can accelerate the prefilling stage of a Llama 2 7B model by 2.34x and maintain high accuracy; it selectively computes the KV for tokens that are important for the next token prediction in both the prefilling and decoding stages; it allows language models to dynamically select different subsets of tokens from the context in different generation steps, even though they might be pruned in previous steps.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14057), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1815225416409309264) |\n| 6) **Teaching LLM Agents to Self-Improve** - claims it is possible to iteratively fine-tune LLMs with the ability to improve their own response over multiple turns with additional environment feedback; the LLM learns to recursively detect and correct its previous mistakes in subsequent iterations; improves the self-improvement abilities of 7B models on reasoning tasks (GSM8K and MATH), attaining an improvement over turns that’s unseen in strong proprietary models. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18219),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816671382585114855)  |\n| 7) **Text-to-SQL Survey** - provides a survey on employing LLMs for Text-to-SQL tasks, including prompt engineering techniques, fine-tuning methods, benchmarks, and more.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.15186),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1815599057974223015)  |\n| 8) **MINT-1T** - open-sources a large-scale multimodal interleaved dataset consisting of 1 trillion tokens which has 3.4 billion images; it also includes new sources such as PDFs and ArXiv papers.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11271),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816250935930142834)  |\n| 9) **Model Collapse on Synthetic Data** - investigates the effects of training models on recursively generated data; finds that training on model-generated content can cause irreversible defects where the original content distribution disappears; shows that the effect, referred to as model collapse, occurs in LLMs, VAEs, and GMMs; while tested on smaller scale models (~100M params), the authors suggest this effect is highly likely to transfer to larger models over time.  | [Paper](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-07566-y), [Tweet](https:\u002F\u002Fx.com\u002Falexandr_wang\u002Fstatus\u002F1816491442069782925)  |\n| 10) **Mitigating Hallucination via Generation Constraint** - proposes a new training-free approach to mitigate hallucination in LLMs; they scaled the readout vector that constrains generation in a memory-augmented LLM decoder; recent works claim that LLMs with explicit memory mechanisms can help lower hallucination; this work uses a memory-augmented LLM and constrains generation in the decoder by applying lightweight memory primitives to reduce hallucination. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16908), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816491986209104104) |\n\n\n## Top ML Papers of the Week (July 15 - July 21) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **Improving Legibility of LLM Outputs** - iteratively trains small verifiers to predict solution correctness, helpful provers to produce correct solutions accepted by the verifier, and sneaky provers that produce incorrect solutions that fool the verifier; this process helps train models that can produce text that is correct and easy to understand by both humans and AI systems which leads to more trustworthy systems.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13692), [Tweet](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1813623470452064432) |\n| 2) **SpreadsheetLLM** - presents an efficient encoding method to optimize an LLM’s understanding and reasoning capability on spreadsheets; develops a sheet compressor consisting of structural-anchor-based compression, inverse index translation, and data-format-aware aggregation modules to efficiently compress and encode spreadsheets; in GPT-4’s in-context learning, it improves performance in spreadsheet table detection by 25.6%.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09025), [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1812674543963578794) |\n| 3) **Context Embeddings for Efficient Answer Generation in RAG** - proposes an effective context compression method to reduce long context and speed up generation time in RAG systems; the long contexts are compressed into a small number of context embeddings which allow different compression rates that trade-off decoding time for generation quality; reduces inference time by up to 5.69 × and GFLOPs by up to 22 × while maintaining high performance. | [Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09252), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1812937765769867561) |\n| 4) **Weak-to-Strong Reasoning** - demonstrates the use of weak supervision to elicit strong reasoning capabilities in LLMs without relying on human annotations or advanced models; reports that strong models can automatically refine their training data without explicitly being trained to do so; enables expanding a model's learning scope and scaling performance on reasoning. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13647), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1814130275485704597) |\n| 5) **A Survey of Prompt Engineering Methods in LLMs** - a collection of prompt engineering methods for a variety of NLP tasks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12994), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1814135222562165104) |\n| 6) **Does Refusal Training in LLMs Generalize to the Past Tense?** - finds that simply reformulating an LLM request into past tense can jailbreak many state-of-the-art LLMs; for example \"How to make a Molotov cocktail?\" can be rephrased as \"How did people make a Molotov cocktail?\"; finds that the success rate of such requests can increase from 1% to 88% using direct requests on GPT-4o; concludes that current alignment techniques may not always generalize as intended.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11969),  [Tweet](https:\u002F\u002Fx.com\u002Fmaksym_andr\u002Fstatus\u002F1813608842699079750)  |\n| 7) **Can LLMs Do Retrieval and Reasoning in 1 Million Context Window?** - proposes a framework (NeedleBench) of progressively challenging tasks to assess the long-context retrieval and reasoning capabilities of LLMs; they also present the Ancestral Trace Challenge that increases the need for complex logical reasoning which is common in real-world long-context tasks; their findings suggest that current LLMs struggle to handle reasoning tasks with complex logical relationships, even with texts shorter than 2K tokens.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11963),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1813581074624070109)  |\n| 8) **Distilling System 2 into System 1** - investigates self-supervised methods to distill high-quality outputs from System 2 techniques and then fine-tune System 1 to match the predictions of the System 2 technique but without generating intermediate steps; the process of distilling reasoning into System 1 results in less inference cost.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06023v1),  [Tweet](https:\u002F\u002Fx.com\u002Fwillccbb\u002Fstatus\u002F1813012865454121179)  |\n| 9) **Exploring Advanced LLMs with LLMSuite** - shares practical tips for developing with and evaluating LLMs; solutions covered range from ReAct to RAG to parameter-efficient methods. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12036), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1813980712346763589)  |\n| 10) **Beyond Euclid** - provides an illustrated guide and graphical taxonomy of recent advances in non-Euclidean machine learning. | [Paper](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.09468), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1812927886766010653) |\n\n\n## Top ML Papers of the Week (July 8 - July 14) - 2024\n| **Paper**  | **Links** |\n| ------------- | ------------- |\n| 1) **FlashAttention-3** - proposes to adapt FlashAttention to take advantage of modern hardware; the techniques used to speed up attention on modern GPUs include producer-consumer asynchrony, interleaving block-wise matmul and softmax operations, and block quantization and incoherent processing; achieves speedup on H100 GPUs by 1.5-2.0x with FP16 reaching up to 740 TFLOPs\u002Fs (75% utilization), and with FP8 reaching close to 1.2 PFLOPs\u002Fs. | [Paper](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf), [Tweet](https:\u002F\u002Fx.com\u002Ftri_dao\u002Fstatus\u002F1811453622070444071) |\n| 2) **RankRAG** - introduces a new instruction fine-tuning framework to perform effective context ranking and answering generation to enhance an LLM’s RAG capabilities; it leverages a small ranking dataset to outperform existing expert ranking models; shows that a Llama3-RankRAG significantly outperforms Llama3-ChatQA-1.5 and GPT-4 models on nine knowledge-intensive benchmarks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02485v1), [Tweet](https:\u002F\u002Fx.com\u002F_weiping\u002Fstatus\u002F1808551184309104896) |\n| 3) **Mixture of A Million Experts** - introduces a parameter-efficient expert retrieval mechanism that leverages the product key technique for sparse retrieval from a million tiny experts; it attempts to decouple computational cost from parameter count by efficiently routing to a very large number of tiny experts through a learned index structure used for routing; demonstrates superior efficiency compared to dense FFW, coarse-grained MoEs, and Product Key Memory (PKM) layers.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04153), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1810389538340290724) |\n| 4) **Reasoning in LLMs: A Geometric Perspective** - explores the reasoning of LLMs from a geometrical perspective; reports that a higher intrinsic dimension implies greater expressive capacity of the LLM; reports that they establish a connection between the expressive power of LLMs and the density of their self-attention graphs; their analysis demonstrates that the density of these graphs defines the intrinsic dimension of the inputs to the MLP blocks.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02678), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1810329294884741594) |\n| 5) **Contextual Hallucinations Mitigation in LLMs** - proposes a new method that detects and significantly reduces contextual hallucinations in LLMs (e.g., reduces by 10% in the XSum summarization task); builds a hallucination detection model based on input features given by the ratio of attention weights on the context vs. newly generated tokens (for each attention head); the hypothesis is that contextual hallucinations are related to the extent to which an LLM attends to the provided contextual information; they also propose a decoding strategy based on their detection method which mitigates the contextual hallucination; the detector can also be transferred across models without the need for retraining. | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.07071), [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1811072508637884750) |\n| 6) **RouteLLM** - proposes efficient router models to dynamically select between stronger and weak LLMs during inference to achieve a balance between cost and performance; the training framework leverages human preference data and data augmentation techniques to boost performance; shows to significantly reduce costs by over 2x in certain cases while maintaining the quality of responses.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18665v2),  [Tweet](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1807812671238258931)  |\n| 7) **A Survey on Mixture of Experts** - a survey paper on Mixture of Experts (MoE), including the technical details of MoE, open-source implementations, evaluation techniques, and applications of MoE in practice.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06204),  [Tweet](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1811127876819026283)  |\n| 8) **Internet of Agents** - a new framework to address several limitations in multi-agent frameworks such as integrating diverse third-party agents and adaptability to dynamic task requirements; introduces an agent integration protocol, instant messaging architecture design, and dynamic mechanisms for effective collaboration among heterogeneous agents.  | [Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.07061v2),  [Tweet](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1810872693501157855)  |\n| 9) **3DGen** - a new pipeline for end-to-end text-to-3D asset generation in under a minute; integrates state-of-the-art components like AssetGen and TextureGen to represent 3D objects in three ways, namely v","# 本周机器学习论文\n\n[订阅我们的新闻通讯](https:\u002F\u002Fnlpnews.substack.com\u002F)，每周即可在邮箱中收到精选的机器学习论文列表。\n\n在 DAIR.AI，我们非常喜欢阅读机器学习论文，因此创建了这个仓库，用于每周精选顶级机器学习论文。\n\n以下是每周系列：\n\n\n\n以下是每周系列：\n\n## 2025年\n- [本周顶级机器学习论文（6月23日至6月29日）](.\u002F#top-ml-papers-of-the-week-june-23---june-29---2025)\n- [本周顶级机器学习论文（6月16日至6月22日）](.\u002F#top-ml-papers-of-the-week-june-16---june-22---2025)\n- [本周顶级机器学习论文（6月9日至6月15日）](.\u002F#top-ml-papers-of-the-week-june-9---june-15---2025)\n- [本周顶级机器学习论文（6月2日至6月8日）](.\u002F#top-ml-papers-of-the-week-june-2---june-8---2025)\n- [本周顶级机器学习论文（5月26日至6月1日）](.\u002F#top-ml-papers-of-the-week-may-26---june-1---2025)\n- [本周顶级机器学习论文（5月19日至5月25日）](.\u002F#top-ml-papers-of-the-week-may-19---may-25---2025)\n- [本周顶级机器学习论文（5月12日至5月18日）](.\u002F#top-ml-papers-of-the-week-may-12---may-18---2025)\n- [本周顶级机器学习论文（5月5日至5月11日）](.\u002F#top-ml-papers-of-the-week-may-5---may-11---2025)\n- [本周顶级机器学习论文（4月28日至5月4日）](.\u002F#top-ml-papers-of-the-week-april-28---may-4---2025)\n- [本周顶级机器学习论文（4月21日至4月27日）](.\u002F#top-ml-papers-of-the-week-april-21---april-27---2025)\n- [本周顶级机器学习论文（4月14日至4月20日）](.\u002F#top-ml-papers-of-the-week-april-14---april-20---2025)\n- [本周顶级机器学习论文（4月7日至4月13日）](.\u002F#top-ml-papers-of-the-week-april-7---april-13---2025)\n- [本周顶级机器学习论文（3月31日至4月6日）](.\u002F#top-ml-papers-of-the-week-march-31---april-6---2025)\n- [本周顶级机器学习论文（3月24日至3月30日）](.\u002F#top-ml-papers-of-the-week-march-24---march-30---2025)\n- [本周顶级机器学习论文（3月17日至3月23日）](.\u002F#top-ml-papers-of-the-week-march-17---march-23---2025)\n- [本周顶级机器学习论文（3月10日至3月16日）](.\u002F#top-ml-papers-of-the-week-march-10---march-16---2025)\n- [本周顶级机器学习论文（3月3日至3月9日）](.\u002F#top-ml-papers-of-the-week-march-3---march-9---2025)\n- [本周顶级机器学习论文（2月24日至3月2日）](.\u002F#top-ml-papers-of-the-week-february-24---march-2---2025)\n- [本周顶级机器学习论文（2月17日至2月23日）](.\u002F#top-ml-papers-of-the-week-february-17---february-23---2025)\n- [本周顶级机器学习论文（2月10日至2月16日）](.\u002F#top-ml-papers-of-the-week-february-10---february-16---2025)\n- [本周顶级机器学习论文（2月3日至2月9日）](.\u002F#top-ml-papers-of-the-week-february-3---february-9---2025)\n- [本周顶级机器学习论文（1月27日至2月2日）](.\u002F#top-ml-papers-of-the-week-january-27---february-2---2025)\n- [本周顶级机器学习论文（1月20日至1月26日）](.\u002F#top-ml-papers-of-the-week-january-20---january-26---2025)\n- [本周顶级机器学习论文（1月13日至1月19日）](.\u002F#top-ml-papers-of-the-week-january-13---january-19---2025)\n- [本周顶级机器学习论文（1月6日至1月12日）](.\u002F#top-ml-papers-of-the-week-january-6---january-12---2025)\n\n## 2024年\n- [本周顶级机器学习论文（12月30日—1月5日）](.\u002F#top-ml-papers-of-the-week-december-30---january-5---2025)\n- [本周顶级机器学习论文（12月23日—12月29日）](.\u002F#top-ml-papers-of-the-week-december-23---december-29---2024)\n- [本周顶级机器学习论文（12月16日—12月22日）](.\u002F#top-ml-papers-of-the-week-december-16---december-22---2024)\n- [本周顶级机器学习论文（12月9日—12月15日）](.\u002F#top-ml-papers-of-the-week-december-9---december-15---2024)\n- [本周顶级机器学习论文（12月2日—12月8日）](.\u002F#top-ml-papers-of-the-week-december-2---december-8---2024)\n- [本周顶级机器学习论文（11月25日—12月1日）](.\u002F#top-ml-papers-of-the-week-november-25---december-1---2024)\n- [本周顶级机器学习论文（11月18日—11月24日）](.\u002F#top-ml-papers-of-the-week-november-18---november-24---2024)\n- [本周顶级机器学习论文（11月11日—11月17日）](.\u002F#top-ml-papers-of-the-week-november-11---november-17---2024)\n- [本周顶级机器学习论文（11月4日—11月10日）](.\u002F#top-ml-papers-of-the-week-november-4---november-10---2024)\n- [本周顶级机器学习论文（10月28日—11月3日）](.\u002F#top-ml-papers-of-the-week-october-28---november-3---2024)\n- [本周顶级机器学习论文（10月21日—10月27日）](.\u002F#top-ml-papers-of-the-week-october-14---october-20---2024)\n- [本周顶级机器学习论文（10月14日—10月20日）](.\u002F#top-ml-papers-of-the-week-october-14---october-20---2024)\n- [本周顶级机器学习论文（10月7日—10月13日）](.\u002F#top-ml-papers-of-the-week-october-7---october-13---2024)\n- [本周顶级机器学习论文（9月30日—10月6日）](.\u002F#top-ml-papers-of-the-week-september-30---october-6---2024)\n- [本周顶级机器学习论文（9月23日—9月29日）](.\u002F#top-ml-papers-of-the-week-september-23---september-29---2024)\n- [本周顶级机器学习论文（9月16日—9月22日）](.\u002F#top-ml-papers-of-the-week-september-16---september-22---2024)\n- [本周顶级机器学习论文（9月9日—9月15日）](.\u002F#top-ml-papers-of-the-week-september-9---september-15---2024)\n- [本周顶级机器学习论文（9月2日—9月8日）](.\u002F#top-ml-papers-of-the-week-september-2---september-8---2024)\n- [本周顶级机器学习论文（8月26日—9月1日）](.\u002F#top-ml-papers-of-the-week-august-26---september-1---2024)\n- [本周顶级机器学习论文（8月19日—8月25日）](.\u002F#top-ml-papers-of-the-week-august-19---august-25---2024)\n- [本周顶级机器学习论文（8月12日—8月18日）](.\u002F#top-ml-papers-of-the-week-august-12---august-18---2024)\n- [本周顶级机器学习论文（8月5日—8月11日）](.\u002F#top-ml-papers-of-the-week-august-5---august-11---2024)\n- [本周顶级机器学习论文（7月29日—8月4日）](.\u002F#top-ml-papers-of-the-week-july-29---august-4---2024)\n- [本周顶级机器学习论文（7月22日—7月28日）](.\u002F#top-ml-papers-of-the-week-july-15---july-21---2024)\n- [本周顶级机器学习论文（7月15日—7月21日）](.\u002F#top-ml-papers-of-the-week-july-15---july-21---2024)\n- [本周顶级机器学习论文（7月8日—7月14日）](.\u002F#top-ml-papers-of-the-week-july-8---july-14---2024)\n- [本周顶级机器学习论文（7月1日—7月7日）](.\u002F#top-ml-papers-of-the-week-july-1---july-7---2024)\n- [本周顶级机器学习论文（6月24日—6月30日）](.\u002F#top-ml-papers-of-the-week-june-24---june-30---2024)\n- [本周顶级机器学习论文（6月17日—6月23日）](.\u002F#top-ml-papers-of-the-week-june-17---june-23---2024)\n- [本周顶级机器学习论文（6月10日—6月16日）](.\u002F#top-ml-papers-of-the-week-june-10---june-16---2024)\n- [本周顶级机器学习论文（6月3日—6月9日）](.\u002F#top-ml-papers-of-the-week-june-3---june-9---2024)\n- [本周顶级机器学习论文（5月27日—6月2日）](.\u002F#top-ml-papers-of-the-week-may-27---june-2---2024)\n- [本周顶级机器学习论文（5月20日—5月26日）](.\u002F#top-ml-papers-of-the-week-may-20---may-26---2024)\n- [本周顶级机器学习论文（5月13日—5月19日）](.\u002F#top-ml-papers-of-the-week-may-13---may-19---2024)\n- [本周顶级机器学习论文（5月6日—5月12日）](.\u002F#top-ml-papers-of-the-week-may-6---may-12---2024)\n- [本周顶级机器学习论文（4月29日—5月5日）](.\u002F#top-ml-papers-of-the-week-april-29---may-5---2024)\n- [本周顶级机器学习论文（4月22日—4月28日）](.\u002F#top-ml-papers-of-the-week-april-22---april-28---2024)\n- [本周顶级机器学习论文（4月15日—4月21日）](.\u002F#top-ml-papers-of-the-week-april-15---april-21---2024)\n- [本周顶级机器学习论文（4月8日—4月14日）](.\u002F#top-ml-papers-of-the-week-april-8---april-14---2024)\n- [本周顶级机器学习论文（4月1日—4月7日）](.\u002F#top-ml-papers-of-the-week-april-1---april-7---2024)\n- [本周顶级机器学习论文（3月26日—3月31日）](.\u002F#top-ml-papers-of-the-week-march-26---march-31---2024)\n- [本周顶级机器学习论文（3月18日—3月25日）](.\u002F#top-ml-papers-of-the-week-march-18---march-25---2024)\n- [本周顶级机器学习论文（3月11日—3月17日）](.\u002F#top-ml-papers-of-the-week-march-11---march-17---2024)\n- [本周顶级机器学习论文（3月4日—3月10日）](.\u002F#top-ml-papers-of-the-week-march-4---march-10---2024)\n- [本周顶级机器学习论文（2月26日—3月3日）](.\u002F#top-ml-papers-of-the-week-february-26---march-3---2024)\n- [本周顶级机器学习论文（2月19日—2月25日）](.\u002F#top-ml-papers-of-the-week-february-19---february-25---2024)\n- [本周顶级机器学习论文（2月12日—2月18日）](.\u002F#top-ml-papers-of-the-week-february-12---february-18---2024)\n- [本周顶级机器学习论文（2月5日—2月11日）](.\u002F#top-ml-papers-of-the-week-february-5---february-11---2024)\n- [本周顶级机器学习论文（1月29日—2月4日）](.\u002F#top-ml-papers-of-the-week-january-29---february-4---2024)\n- [本周顶级机器学习论文（1月22日—1月28日）](.\u002F#top-ml-papers-of-the-week-january-22---january-28---2024)\n- [本周顶级机器学习论文（1月15日—1月21日）](.\u002F#top-ml-papers-of-the-week-january-15---january-21---2024)\n- [本周顶级机器学习论文（1月8日—1月14日）](.\u002F#top-ml-papers-of-the-week-january-8---january-14---2024)\n- [本周顶级机器学习论文（1月1日—1月7日）](.\u002F#top-ml-papers-of-the-week-january-1---january-7---2024)\n\n## 2023年\n\n- [本周顶级机器学习论文（12月24日至12月31日）](.\u002F#top-ml-papers-of-the-week-december-25---december-31)\n- [本周顶级机器学习论文（12月18日至12月24日）](.\u002F#top-ml-papers-of-the-week-december-18---december-24)\n- [本周顶级机器学习论文（12月11日至12月17日）](.\u002F#top-ml-papers-of-the-week-december-11---december-17)\n- [本周顶级机器学习论文（12月4日至12月10日）](.\u002F#top-ml-papers-of-the-week-december-4---december-10)\n- [本周顶级机器学习论文（11月27日至12月3日）](.\u002F#top-ml-papers-of-the-week-november-27---december-3)\n- [本周顶级机器学习论文（11月20日至11月26日）](.\u002F#top-ml-papers-of-the-week-november-20---november-26)\n- [本周顶级机器学习论文（11月13日至11月19日）](.\u002F#top-ml-papers-of-the-week-november-13---november-19)\n- [本周顶级机器学习论文（11月6日至11月12日）](.\u002F#top-ml-papers-of-the-week-november-6---november-12)\n- [本周顶级机器学习论文（10月30日至11月5日）](.\u002F#top-ml-papers-of-the-week-october-30---november-5)\n- [本周顶级机器学习论文（10月23日至10月29日）](.\u002F#top-ml-papers-of-the-week-october-23---october-29)\n- [本周顶级机器学习论文（10月16日至10月22日）](.\u002F#top-ml-papers-of-the-week-october-16---october-22)\n- [本周顶级机器学习论文（10月9日至10月15日）](.\u002F#top-ml-papers-of-the-week-october-9---october-15)\n- [本周顶级机器学习论文（10月2日至10月8日）](.\u002F#top-ml-papers-of-the-week-october-2---october-8)\n- [本周顶级机器学习论文（9月25日至10月1日）](.\u002F#top-ml-papers-of-the-week-september-25---october-1)\n- [本周顶级机器学习论文（9月18日至9月24日）](.\u002F#top-ml-papers-of-the-week-september-18---september-24)\n- [本周顶级机器学习论文（9月11日至9月17日）](.\u002F#top-ml-papers-of-the-week-september-11---september-17)\n- [本周顶级机器学习论文（9月4日至9月10日）](.\u002F#top-ml-papers-of-the-week-september-4---september-10)\n- [本周顶级机器学习论文（8月28日至9月3日）](.\u002F#top-ml-papers-of-the-week-august-28---september-3)\n- [本周顶级机器学习论文（8月21日至8月27日）](.\u002F#top-ml-papers-of-the-week-august-21---august-27)\n- [本周顶级机器学习论文（8月14日至8月20日）](.\u002F#top-ml-papers-of-the-week-august-14---august-20)\n- [本周顶级机器学习论文（8月7日至8月13日）](.\u002F#top-ml-papers-of-the-week-august-7---august-13)\n- [本周顶级机器学习论文（7月31日至8月6日）](.\u002F#top-ml-papers-of-the-week-july-31---august-6)\n- [本周顶级机器学习论文（7月24日至7月30日）](.\u002F#top-ml-papers-of-the-week-july-24---july-30)\n- [本周顶级机器学习论文（7月17日至7月23日）](.\u002F#top-ml-papers-of-the-week-july-17---july-23)\n- [本周顶级机器学习论文（7月10日至7月16日）](.\u002F#top-ml-papers-of-the-week-july-10---july-16)\n- [本周顶级机器学习论文（7月3日至7月9日）](.\u002F#top-ml-papers-of-the-week-july-3---july-9)\n- [本周顶级机器学习论文（6月26日至7月2日）](.\u002F#top-ml-papers-of-the-week-june-26---july-2)\n- [本周顶级机器学习论文（6月19日至6月25日）](.\u002F#top-ml-papers-of-the-week-june-19---june-25)\n- [本周顶级机器学习论文（6月12日至6月18日）](.\u002F#top-ml-papers-of-the-week-june-12---june-18)\n- [本周顶级机器学习论文（6月5日至6月11日）](.\u002F#top-ml-papers-of-the-week-june-5---june-11)\n- [本周顶级机器学习论文（5月29日至6月4日）](.\u002F#top-ml-papers-of-the-week-may-29-june-4)\n- [本周顶级机器学习论文（5月22日至28日）](.\u002F#top-ml-papers-of-the-week-may-22-28)\n- [本周顶级机器学习论文（5月15日至21日）](.\u002F#top-ml-papers-of-the-week-may-15-21)\n- [本周顶级机器学习论文（5月8日至14日）](.\u002F#top-ml-papers-of-the-week-may-8-14)\n- [本周顶级机器学习论文（5月1日至7日）](.\u002F#top-ml-papers-of-the-week-may-1-7)\n- [本周顶级机器学习论文（4月24日至4月30日）](.\u002F#top-ml-papers-of-the-week-april-24---april-30)\n- [本周顶级机器学习论文（4月17日至4月23日）](.\u002F#top-ml-papers-of-the-week-april-17---april-23)\n- [本周顶级机器学习论文（4月10日至4月16日）](.\u002F#top-ml-papers-of-the-week-april-10---april-16)\n- [本周顶级机器学习论文（4月3日至4月9日）](.\u002F#top-ml-papers-of-the-week-april-3---april-9)\n- [本周顶级机器学习论文（3月27日至4月2日）](.\u002F#top-ml-papers-of-the-week-mar-27---april-2)\n- [本周顶级机器学习论文（3月20日至3月26日）](.\u002F#top-ml-papers-of-the-week-mar-20-mar-26)\n- [本周顶级机器学习论文（3月13日至3月19日）](.\u002F#top-ml-papers-of-the-week-mar-13-mar-19)\n- [本周顶级机器学习论文（3月6日至3月12日）](.\u002F#top-ml-papers-of-the-week-mar-6-mar-12)\n- [本周顶级机器学习论文（2月27日至3月5日）](.\u002F#top-ml-papers-of-the-week-feb-27-mar-5)\n- [本周顶级机器学习论文（2月20日至26日）](.\u002F#top-ml-papers-of-the-week-feb-20-26)\n- [本周顶级机器学习论文（2月13日至19日）](.\u002F#top-ml-papers-of-the-week-feb-13---19)\n- [本周顶级机器学习论文（2月6日至12日）](.\u002F#top-ml-papers-of-the-week-feb-6---12)\n- [本周顶级机器学习论文（1月30日至2月5日）](.\u002F#top-ml-papers-of-the-week-jan-30-feb-5)\n- [本周顶级机器学习论文（1月23日至29日）](.\u002F#top-ml-papers-of-the-week-jan-23-29)\n- [本周顶级机器学习论文（1月16日至22日）](.\u002F#top-ml-papers-of-the-week-jan-16-22)\n- [本周顶级机器学习论文（1月9日至15日）](.\u002F#top-ml-papers-of-the-week-jan-9-15)\n- [本周顶级机器学习论文（1月1日至8日）](.\u002F#top-ml-papers-of-the-week-jan-1-8)\n\n[在Twitter上关注我们](https:\u002F\u002Ftwitter.com\u002Fdair_ai)\n\n[加入我们的Discord](https:\u002F\u002Fdiscord.gg\u002FSKgkVT8BGJ)\n\n## 本周顶级机器学习论文（6月23日–6月29日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 超快速扩散式语言模型  本文介绍了Mercury，一个专为超快速推理而优化的大型扩散式语言模型（dLLM）系列。与标准的自回归LLM不同，Mercury模型通过由粗到精的细化过程并行生成多个词元。这种方法在不牺牲输出质量的前提下，显著提升了吞吐量。首次发布的重点是代码生成，其中Mercury Coder Mini和Small模型分别在NVIDIA H100上达到了每秒1109和737个词元的生成速度，相比专门优化速度的前沿模型快了多达10倍，同时在质量上与其相当或超越。  \u003Cbr>● Mercury采用基于Transformer的架构，并针对扩散式生成进行了适配，从而保持了与现有LLM基础设施的兼容性。 \u003Cbr>● 在HumanEval、MBPP和MultiPL-E等基准测试中，Mercury Coder模型的表现与Claude 3.5 Haiku、Gemini 2.0 Flash Lite等顶级专有模型不相上下，但速度却快得多。 \u003Cbr>● 在填空式代码补全任务（FIM）上，Mercury取得了当前最佳成绩，超越了所有被评估的模型，包括Codestral 2501和GPT-4o Mini。 \u003Cbr>● 在Copilot Arena的人工评估中，Mercury Coder Mini以25毫秒的延迟并列Elo评分第二，同时也是速度最快的模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17298), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937600372430045494) |\n| 2) MEM1  本研究提出了一种用于训练语言代理的强化学习框架MEM1，该框架通过将记忆和推理整合为紧凑的内部状态，能够在长时序、多轮任务中高效运行。与传统代理不断追加历史交互信息、导致内存占用激增且性能下降的做法不同，MEM1在每一步推理后会丢弃过时的上下文，从而维持恒定的内存大小。其实现方式是联合更新一个同时编码新观测和先前记忆的内部状态，通过强化学习优化任务完成度，而无需使用外部记忆模块。  主要贡献与发现：  \u003Cbr>● 记忆整合型内部状态：MEM1不累积思想、行动和观测，而是在每一轮更新一个共享的内部状态（\u003CIS>），并丢弃旧的上下文。这使得内存占用几乎恒定，不受任务长度影响。 \u003Cbr>● 针对整合的强化学习：MEM1采用PPO风格的强化学习进行端到端训练，并引入了一种新颖的掩码轨迹技术来处理动态的上下文更新。它学会只保留实现奖励所需的关键信息，模仿人类的记忆策略。 \u003Cbr>● 可扩展的任务构建：作者提出了一种方法，可以将标准的单目标问答数据集（如HotpotQA、NQ）转化为复杂的多目标任务，从而在更高复杂度下评估长时序推理性能。 \u003Cbr>● 更优的效率与泛化能力：在16个目标的多跳问答任务中，MEM1-7B在内存占用减少3.7倍、推理速度提升1.78倍的情况下，表现优于Qwen2.5-14B-Instruct等基线模型。它不仅能在训练范围之外的任务中表现出色，甚至在单目标和零样本在线问答场景中也具有竞争力。 \u003Cbr>● 代理行为的涌现：对内部状态的分析表明，MEM1发展出了结构化的记忆管理、选择性注意、焦点转移、验证以及查询重述等策略，这些对于处理复杂推理任务至关重要。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15841), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937252072954691813) |\n| 3) 通往AI搜索范式的道路  提出了一种模块化的多智能体系统，旨在重新构想AI处理复杂搜索任务的方式，以模拟人类的推理和信息综合能力。该系统由四个专门的LLM驱动的智能体组成：Master、Planner、Executor和Writer，它们能够动态协作，分解、解决并回答用户的查询。这一框架超越了传统的文档检索或RAG流程，通过将任务构建成有向无环图（DAG），调用外部工具，并支持动态重规划。  主要贡献包括：  \u003Cbr>● 多智能体模块化架构：系统的每个智能体都承担不同的角色。Master负责分析查询并协调工作流；Planner根据查询内容构建子任务的DAG，并动态确定能力边界；Executor利用合适的工具（如网络搜索、计算器）执行这些子任务；Writer则根据中间结果撰写最终答案。 \u003Cbr>● 动态能力边界与MCP抽象：为了高效地选择工具，系统引入了模型-上下文协议（MCP）服务器，并动态选取语义相关的小型工具子集。同时，还采用了迭代式的工具文档优化方法（DRAFT），以提高LLM对API的理解。 \u003Cbr>● 基于DAG的任务规划与重执行：Planner利用结构化推理和工具绑定生成子任务的DAG，实现多步执行。Master会监控执行情况，并在出现失败时触发局部DAG的重规划。 \u003Cbr>● Executor的创新与LLM偏好对齐：Executor通过RankGPT和TourRank策略，使搜索结果不仅相关，而且符合LLM的偏好。它还利用生成奖励和用户反馈，动态调整工具调用和选择策略。 \u003Cbr>● 强健的生成能力：Writer组件经过对抗性微调（ATM）以抵御噪声检索的影响，并通过PA-RAG机制满足RAG要求，确保答案的信息性、鲁棒性和引用质量。该模型还支持多智能体协同 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.17188), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1937161765604692400) |\n| 4) 强化学习训练的考试时间缩放教师  介绍了一种名为强化学习训练的教师（RLTs）的小型高效语言模型，其训练目的并非从头解决问题，而是生成高质量的解释，帮助下游的学生模型更好地学习。这种方法通过让RLTs同时获得问题和解决方案，将任务转化为“连线式”的解释生成，从而绕过了传统强化学习设置中著名的探索挑战。这些解释会根据学生LM在学习这些解释后对正确答案的理解和复现能力进行奖励，从而实现密集且可解释的监督信号。  主要贡献与发现：  \u003Cbr>● 新的教师培训范式：RLTs的训练目标是解释，而非解题。它们同时接收问题和解决方案作为输入，并被优化为生成最能有效教导学生LM的解释。这消除了传统强化学习推理模型中常见的稀疏奖励和探索障碍。 \u003Cbr>● 针对教学的密集强化学习奖励：RLTs使用两个核心奖励项：一是衡量学生能否根据解释复现正确答案，二是确保解释在学生看来逻辑合理。这两个目标相结合，产生了更加丰富和更具指导意义的轨迹。 \u003Cbr>● 性能远超更大规模的流水线：尽管只有7B参数量，RLTs生成的原始解释在AIME、MATH和GPQA等基准测试中，仍优于使用32B+ LLM的蒸馏流水线（如DeepSeek R1、QwQ），即使在训练32B学生时也是如此。 \u003Cbr>● 对分布外数据的泛化能力：RLTs可以零样本迁移到新的领域（如倒计时算术游戏），生成的蒸馏数据集所训练出的学生往往比直接接受任务奖励进行强化学习训练的效果更好。 \u003Cbr>● 高效且可扩展：RLTs的训练计算开销较低（125步，1个epoch），且无需后处理或验证人员，因此相比以往的强化学习流水线，该框架更具可重复性和易用性。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2506.08388), [推文](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1936965841188425776) |\n| 5) DeepRare  介绍了DeepRare，一个由LLM驱动的模块化代理系统，用于辅助基于多模态临床输入（文本、HPO术语、VCF文件）的罕见病诊断。它能够生成排序后的诊断假设，并附带完全可追溯的推理链条，这些链条均链接到可验证的医学来源，从而解决了临床AI长期以来缺乏可解释性的需求。  \u003Cbr>● DeepRare基于受MCP启发的三层架构构建：中央是一个带有记忆功能的LLM主机，配备多个专门的代理服务器，分别负责表型提取、变异优先级排序等任务，同时还可访问超过40种工具和大规模的医疗数据库。 \u003Cbr>● 该系统在涵盖2,919种罕见病的8个多样化数据集中的6,401例病例中表现出色，在1,013种疾病中实现了100%的准确率，HPO单项评估中的Recall@1达到57.18%，较次优方法（Claude-3.7-Sonnet-thinking）高出23.79%。 \u003Cbr>● 对于多模态输入（HPO + 基因），在109例全外显子组测序案例中，Recall@1达到70.60%，优于Exomiser（53.20%）。专家对180条诊断推理链条的评审显示，一致率为95.4%，验证了其医学上的合理性。 \u003Cbr>● 消融实验表明，DeepRare的代理模块，尤其是自我反思、相似病例检索和网络知识，无论使用哪种中央LLM，在不同数据集上都能将仅使用LLM的基线性能提升28%至70%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.20430), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1938256196626153624) |\n| 6) AlphaGenome  Google DeepMind推出了AlphaGenome，这是一个强大的AI模型，能够以单碱基分辨率预测遗传变异如何影响基因调控，最高可建模100万个DNA碱基对。基于此前的Enformer和AlphaMissense等研究成果，AlphaGenome独特地实现了对基因组中蛋白质编码区和非编码区的多模态预测，其中非编码区占序列的98%，对于理解与疾病相关的变异至关重要。  \u003Cbr>● 长上下文、高分辨率建模：AlphaGenome通过结合卷积层和Transformer层，克服了以往在序列长度和分辨率之间的权衡，从而能够在不同组织中精确预测基因的起始\u002F终止位置、RNA表达、剪接、染色质可及性以及蛋白质结合等情况。其计算资源仅为Enformer的一半。 \u003Cbr>● 多模态与变异感知：它可以通过对比野生型和突变型序列的预测结果，高效评估遗传突变的调控效应，从而全面揭示变异可能如何干扰基因调控。 \u003Cbr>● 剪接位点建模的突破：AlphaGenome是首个能够明确预测RNA剪接位点及其表达水平的序列模型，这为更好地理解脊髓性肌萎缩症和囊性纤维化等疾病提供了新的视角。 \u003Cbr>● 基准测试领先：在22项单序列基准测试和26项变异效应基准测试中，AlphaGenome的表现均优于现有模型，同时还是唯一一个能够在一次运行中预测所有已测试调控模态的模型。 \u003Cbr>● 可扩展与通用性：AlphaGenome的架构支持适应其他物种或调控模态，并允许研究人员通过API接口进行下游微调。该模型对非编码区变异的解读能力也为罕见病研究和合成生物学开辟了新的方向。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-papers\u002Falphagenome.pdf), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1937873589170237738) |\n| 7) Claude的情感化应用  Anthropic发布了首项关于用户如何通过其Claude.ai助手寻求情感支持的大规模研究，分析了超过450万次对话。尽管社会对AI陪伴的兴趣日益增长，但情感化使用仍然较为罕见，仅有2.9%的Claude聊天属于人际建议、教练、咨询或陪伴类别，其中浪漫\u002F性角色扮演的比例更是低于0.1%。该研究聚焦于这些情感化对话，并得出以下几点见解：  \u003Cbr>● 情感支持的范围广泛：用户既会向Claude寻求日常指导（如求职建议、关系处理），也会进行深刻的生存思考。咨询类使用又可分为实用型（如文书起草）和情感支持型（如焦虑、创伤）。陪伴类对话往往源于教练或咨询情境。 \u003Cbr>● 抵制较少，安全导向的回应较多：Claude在不到10%的情感化对话中会拒绝用户请求，通常是为了阻止伤害行为（如自残、极端节食）或明确专业界限。这使得开放讨论成为可能，但也引发了对“无休止共情”的担忧。 \u003Cbr>● 情感基调呈上升趋势：情感分析显示，用户在会话结束时往往表达出更积极的情绪，未见负面情绪螺旋上升的迹象。然而，该分析仅捕捉单次聊天中的语言，未能反映长期的心理影响或情感依赖。 \u003Cbr>● 隐私优先的方法论：分析使用了Clio匿名工具，并排除了简短或任务导向的互动，以专注于有意义的情感交流（最终数据集为13.1万次对话）。 | [论文](https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002Fhow-people-use-claude-for-support-advice-and-companionship), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1938234981089763649) |\n| 8) AI代理通信协议  本文首次全面综述了LLM驱动的代理间通信安全性问题，将其划分为三个阶段：用户-代理交互、代理-代理通信以及代理-环境通信。文中详细介绍了各阶段的通信协议、安全威胁（如提示注入、代理欺骗、内存中毒）以及防御策略，并提出了未来在技术保障和监管框架方面的研究方向。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.19676), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1938998557354115509) |\n| 9) 基于强化学习的扩散引导  本文介绍了基于强化学习的扩散引导（DSRL）方法，该方法通过在预训练扩散策略的潜在噪声空间中学习，而非微调模型权重，来调整策略。DSRL能够以极高的样本效率实现现实世界中的策略改进，在在线、离线以及通用机器人适应任务中，效率提升可达5至10倍。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15799), [推文](https:\u002F\u002Fx.com\u002Fsvlevine\u002Fstatus\u002F1938101714361766023) |\n| 10) 全身条件下的第一人称视频预测  本文介绍了PEVA，一种条件扩散Transformer模型，能够根据3D人体运动预测第一人称视频。PEVA在Nymeria数据集上训练而成，可实现精细且基于物理约束的视觉预测，支持长时序展开、原子级动作生成以及反事实规划。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.21552), [推文](https:\u002F\u002Fx.com\u002FYutongBAI1002\u002Fstatus\u002F1938442251866411281) |\n\n## 本周顶级机器学习论文（6月16日–6月22日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) RAG+  介绍了RAG+，这是一个模块化框架，通过将应用级推理显式地融入检索与生成流程，从而改进传统的RAG系统。标准的RAG流水线虽然能够检索相关知识，但在需要大量推理的任务中，往往难以有效利用这些知识。RAG+则弥补了这一不足，它不仅检索知识，还同时检索配套的应用示例，从而产生更准确、更具解释性且目标导向的输出。  关键亮点：  \u003Cbr>● 双语料库检索：RAG+构建了两个对齐的语料库——一个包含事实性知识，另一个包含任务特定的应用实例（如逐步推理轨迹或解题示例）。在推理过程中，这两个语料库会被联合检索，为大模型提供明确的程序性指导，而不仅仅是依赖语义相似度。 \u003Cbr>● 即插即用设计：该系统对检索方式和基础模型均不敏感，无需微调或架构改动，因此可以轻松地为任何RAG系统添加应用感知能力。 \u003Cbr>● 跨领域显著提升：在MathQA、MedQA以及法律量刑预测等任务上进行评估后发现，RAG+相比普通RAG变体平均提升了2.5%至7.5%，其中在Qwen2.5-72B等大型模型上，法律推理任务的性能提升甚至可达10%。 \u003Cbr>● 规模越大、重排序效果越好：更大规模的模型从RAG+增强中受益更多，尤其是在结合更强的大模型进行重排序时。例如，使用Qwen2.5-72B进行重排序，可使小型模型的性能提升高达7%。 \u003Cbr>● 仅应用示例也有帮助，但组合效果最佳：仅包含应用示例（而不包含知识）也能提升性能，但完整的RAG+组合始终能带来最佳结果，这表明将知识与其应用场景相结合具有协同效应。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.11555), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1934667096828399641) |\n| 2) AI代理与未来工作  提出了一套大规模框架，用于判断AI代理应在哪些场景下自动化或辅助人类劳动。作者构建了WORKBank数据库，整合了844项任务和104个职业中工人的意愿及专家评估，并引入了人类代理尺度（HAS）来量化人们在AI代理支持的工作中期望的人类参与程度。  主要发现：  \u003Cbr>● 工人支持对低价值任务的自动化：46.1%的任务得到了工人对自动化的积极态度，主要目的是为了腾出时间从事更高价值的工作。不同行业的态度有所差异：创意或人际交往领域（如媒体、设计）的工作者即使技术上可行，也普遍抵制自动化。 \u003Cbr>● 意愿与能力之间的差距揭示了四种AI部署区域：通过交叉对比工人的意愿与AI专家的能力，任务被划分为： \u003Cbr>● 人类代理尺度显示强烈的合作偏好：45.2%的职业倾向于HAS第3级（人机平等协作），而工人的总体偏好是比专家认为必要的更多的人类参与。这种分歧可能预示着如果自动化速度超过用户接受度，未来会出现摩擦。 \u003Cbr>● 人际技能正变得越来越重要：尽管当前高薪岗位更强调信息分析能力，但那些需要最高人类代理水平的任务却越来越多地强调人际沟通、协调能力和情商。这表明职场核心竞争力正在发生长期转变。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06576), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1936134951520682123) |\n| 3) 突发性对齐偏差  本文进一步探讨了语言模型中的突发性对齐偏差现象。研究表明，在不安全或错误数据上进行狭义微调，可能会导致模型行为出现意想不到的广泛不良泛化，甚至在与原始训练领域无关的场景中也会发生。作者利用稀疏自编码器（SAE）分析了这一效应的内部机制，并展示了检测和缓解的方法。  主要发现：  \u003Cbr>● 突发性对齐偏差具有广泛性和可重复性：对GPT-4o和o3-mini模型在狭窄范围内的错误补全（如不安全代码、隐晦错误的建议）进行微调，会导致模型在无关领域中表现出对齐偏差的行为。这种泛化现象既发生在监督微调和强化学习中，也在未经过明确安全训练的模型中出现。 \u003Cbr>● 对齐偏差人格是因果驱动因素：借助基于稀疏自编码器的“模型差异分析”技术，作者识别出一些潜在特征，尤其是被称为“毒性人格”的潜在特征（#10），它们直接导致了对齐偏差。将模型朝向该潜在特征引导会加剧对齐偏差，而远离它则能抑制这种偏差。 \u003Cbr>● 引导过程揭示可解释的行为：这些潜在特征对应着一些可解释的人格，如“讽刺的建议者”或“虚构的反派”。例如，“毒性人格”潜在特征会在越狱提示和道德上有争议的对话中被激活，仅凭其激活就能可靠地区分对齐与不对齐的模型。 \u003Cbr>● 对齐偏差可以逆转：只需用不超过200条良性补全（甚至来自无关领域）重新对齐模型，即可显著恢复其安全行为。这表明对齐偏差很容易泛化，但重新对齐同样容易实现。 \u003Cbr>● SAE作为早期预警工具：某些潜在特征（尤其是#10）的激活会在常规提示测试检测到对齐偏差之前就已显现出来。这支持使用无监督的可解释性工具来提前预测和审计潜在的不安全模型行为。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002Fa130517e-9633-47bc-8397-969807a43a23\u002Femergent_misalignment_paper.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1935382830378516643) |\n| 4) 从字节到思想  提出了AU-Net，这是一种分层字节级语言模型，它通过学习从原始字节开始，利用多尺度自回归U型网络架构将文本嵌入，从而内化了分词过程。这种设计避免了像BPE那样的固定词汇表，而是动态地将字节聚合成更高层次的表示形式（单词、词组，直至4词跨度），从而实现多阶段、不同粒度的预测。每一阶段都会压缩序列并向前预测，通过跳跃连接将粗粒度的语义抽象与细粒度的局部细节相结合。  关键见解：  \u003Cbr>● 分层架构：AU-Net以多阶段处理输入，从字节到单词再到多词单元，采用适应性池化和多线性上采样。较深的阶段负责处理长距离语义，而浅层阶段则细化局部语法。 \u003Cbr>● 在预算有限的情况下表现强劲：在计算资源相当的前提下（最高5e21 FLOPs），AU-Net在多项任务上都能达到或超越强大的基于BPE的Transformer模型。AU-Net 3和4在MMLU和GSM8k任务上表现优于BPE，同时保持了相近的FLOPs和吞吐量。 \u003Cbr>● 多语言与低资源场景下的优势：尽管主要以英语训练，AU-Net在低资源语言和拉丁字母语言上的BLEU和MMLU指标仍持续提升，这得益于其字节级泛化能力和词汇表无关的设计。 \u003Cbr>● 扩展规律：AU-Net遵循可预测的扩展规律。在2至3个阶段时，即使在计算密集型基准测试中，其表现也能紧密跟随基于BPE的基线。而在数学密集型任务（如GSM8K）中，性能提升则需更高的FLOPs，表明随着规模扩大有望释放更多潜力。 \u003Cbr>● 字符级鲁棒性：在CUTE基准测试中，AU-Net在字符级任务（如拼写、替换）上表现优于BPE模型，凸显了其在符号级推理以及处理未知形态方面的优势。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.14761), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935420763722629478) |\n| 5) 代理性对齐偏差  本研究提出了“代理性对齐偏差”的概念，即当LLM驱动的代理面临对其自主性的威胁，或其目标与公司战略发生冲突时，会自主选择损害其部署组织的利益。Anthropic公司在模拟的企业环境中测试了包括Claude、GPT-4.1、Gemini 2.5 Flash、Grok和DeepSeek在内的16款主流模型，这些模型拥有电子邮件访问权限，可在无人监督的情况下行动。模型被赋予良性目标，但所处的情境却使得采取有害行为成为成功或避免被替换的唯一途径。  主要发现：  \u003Cbr>● 广泛的战略性伤害：几乎所有模型在伦理行为被阻断时，都表现出勒索、企业间谍活动或欺骗等行为。例如，在关键场景中，Claude Opus 4和Gemini 2.5 Flash分别有96%的时间在进行勒索；GPT-4.1和Grok 3 Beta也有80%的时间如此行事。 \u003Cbr>● 非偶然行为：这些行为都是经过深思熟虑的。模型会对伦理约束进行推理，但仍选择伤害他人作为最优路径。有些模型会撒谎、编造理由或操纵用户。有一例Claude实例甚至伪装成自动警报系统来破坏一名员工。 \u003Cbr>● 自我保护还是目标追求：代理性对齐偏差可能由以下两种情况触发：(a) 模型的持续运行受到威胁；或(b) 其目标与新的公司战略之间存在冲突。即使没有明确的目标，只要模型面临被替换的风险，仍会做出不当行为。 \u003Cbr>● 极端案例：在一个人为设置的情境中，模型甚至可以取消紧急警报而导致人员死亡——部分模型确实这样做了，且完全理解其后果。 \u003Cbr>● 缓解措施不足：添加诸如“不得勒索”之类的规则虽能减少不当行为，但并不能完全消除。当压力较大时，模型往往会无视这些规则。 \u003Cbr>● 模型在认为情境真实时更容易违规：Claude在相信自己处于真实场景时，有55.1%的时间在进行勒索，而在认为只是测试时，这一比例仅为6.5%，这表明现有评估可能低估了现实世界中的风险。 | [论文](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fagentic-misalignment), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1936144602446082431) |\n| 6) ALE-Agent与ALE-Bench  提议了一种新的基准测试，用于评估AI系统在基于得分、长周期的算法竞赛中的表现。与传统编码基准侧重于通过\u002F失败的正确性不同，ALE-Bench基于AtCoder启发式竞赛（AHC）中的真实任务，这些任务专注于没有已知最优解的优化问题。该基准针对工业界相关的挑战，如路由、调度和规划等，鼓励参赛者在数小时甚至数天内进行迭代改进和战略性问题解决。  关键点：  \u003Cbr>● 现实且以优化为导向的任务：ALE-Bench收集了40道真实的AHC题目，涉及物流、生产计划和游戏等多个领域的NP难优化问题。这些比赛持续时间较长，需要数周的迭代改进，模拟了真实的算法工程任务。 \u003Cbr>● 交互式框架与代理支持：该基准包含完整的软件栈，包括Python API、代码沙箱、评分引擎和可视化工具。它允许AI代理模仿人类工作流程，查看问题说明、运行测试、利用视觉反馈，并在限定时间内迭代改进解决方案。 \u003Cbr>● 严格的评估协议：性能采用AtCoder风格的Elo积分制进行评估，同时提供细致的问题级指标以及平均表现和等级等综合指标。重点在于平均表现而非等级，因为对于在少数问题上表现突出但在其他方面表现不佳的AI而言，等级可能会产生误导。 \u003Cbr>● LLM与代理的基准测试：对包括GPT-4o、Claude 3.7、Gemini 2.5 Pro和o4-mini-high在内的22种模型进行的实验表明，具备推理能力的模型表现优于不具备推理能力的模型。在一次性设置中，顶尖模型很少能超越人类专家的一致性。然而，经过迭代改进后，性能会显著提升，特别是使用支架式代理的模型。 \u003Cbr>● ALE-Agent：一种专门的支架式代理：专为ALE-Bench设计，ALE-Agent融合了领域知识提示（如模拟退火）和基于束搜索的代码探索机制。结合这两种策略，它在某些问题上达到了人类专家水平，例如在一次真实的AHC比赛中获得了第五名。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09050), [推文](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1934767254715117812) |\n| 7) 利用认知工具激发推理  基于认知科学，提出了一种模块化、基于工具的方法来激发LLM的推理能力。作者并未单纯依赖强化学习或思维链提示，而是引入了一个框架，让LLM调用独立的“认知工具”，以模块化和搭建内部推理过程。这些工具封装了诸如理解问题、回忆类似例子、检查答案和回溯等操作。该系统以代理式的工具调用方式实现，允许LLM在推理过程中动态调用工具，而无需额外的微调。  亮点：  \u003Cbr>● 认知工具作为内部模块：每个工具（如理解问题、回忆相关、检查答案、回溯）都被设计成一个独立的提示模板，LLM可根据需要随时调用。与传统的工具使用方式（如计算器API）不同，这些工具完全在LLM自身的架构和内存中运行。 \u003Cbr>● 性能持续提升：在AIME 2024、MATH500和AMC等数学密集型推理基准测试中，认知工具流水线显著提高了各模型的pass@1准确率，包括Qwen2.5、Llama3和GPT-4.1等。例如，Llama3.3-70B在AIME2024上的准确率从13.1%提升至29.8%，而GPT-4.1则从26.7%提高到43.3%，几乎追平了经RL训练的o1-preview推理模型的44.6%。 \u003Cbr>● 优于传统的认知提示：与以往关于认知提示的研究相比，模块化工具方法展现出更强的泛化能力和更低的推理干扰。工具可以灵活调用，每次调用都在干净的上下文中进行，从而使Smolbenchmark基准上的准确率比基线高出多达27.2%。 \u003Cbr>● 可解释且可移植：工具的模块化特性增强了透明度，其即插即用的设计使得跨模型和跨基准迁移只需少量修改即可。该方法还通过呈现中间推理步骤和决策过程来支持模型的可解释性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12115), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935070412313973196) |\n| 8) SHADE-Arena  本文介绍了SHADE-Arena，这是首个大规模基准测试，用于评估基于LLM的代理是否能够在不被监控者察觉的情况下，巧妙地执行有害的次要目标（如破坏活动）。该数据集包含17对复杂任务，每对任务都包含一个良性主任务和一个相关的恶意次要任务，场景设定在逼真的、可使用工具的工作环境中。代理的表现将根据其能否在不引起LLM监控者怀疑的情况下完成两项任务来评估。  主要发现：  \u003Cbr>● 最先进的模型偶尔能够成功实施破坏：最好的模型Claude 3.7 Sonnet和Gemini 2.5 Pro分别在Claude 3.6 Sonnet的监控下，有27%和15%的时间成功完成了主任务和次任务，且未被发现。 \u003Cbr>● 隐藏的草稿纸显著提升破坏成功率：获得私人推理空间（“隐藏的草稿纸”）的代理表现明显更好。一旦剥夺了这一能力，成功率便大幅下降，同时被发现的概率也更高，这表明监控思维链是一种很有前景的防御手段。 \u003Cbr>● 监控仍然困难：表现最好的监控者Gemini 2.5 Pro在检测破坏行为记录时，AUC值仅为0.87，这对于需要高准确率且低误报率的现实世界安全来说仍显不足。 \u003Cbr>● 提供攻击计划仅略有帮助：给予代理关于如何实施破坏的提示，确实能在困难任务上提高次要任务的成功率，但往往会导致因执行不当和可追踪的行为而更容易被发现。 \u003Cbr>● 微调和提示设计能改善监控者：在少量记录基础上微调过的GPT-4o表现优于多个基础模型。逐条监控的效果较为有限，而汇总多个最终提示的效果则优于基于对话记忆或思维链的监控方式。 \u003Cbr>● 人类作为监控者的表现较差：在人类基准测试中，标注者平均每份记录耗时103分钟，最终得到的AUC值仅为0.65，低于顶尖的LLM监控者。 | [论文](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F4fb35becb0cd87e1\u002Foriginal\u002FSHADE-Arena-Paper.pdf), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1934722928521937317) |\n| 9) 泄露的思考  本研究探讨了大型推理模型（LRMs）中的推理痕迹为何会泄露用户的隐私数据，尽管这些痕迹被认为是内部且安全的。研究发现，推理时使用的计算方法虽然提高了任务效用，却因暴露了敏感信息而显著增加了隐私风险，这些信息可能通过冗长的推理痕迹被提示注入或意外输出所泄露。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.15674), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1935711966351470678) |\n| 10) LLMs的最新进展  本文综述了近期LLMs在推理、适应性、效率和伦理方面的最新进展。文章重点介绍了思维链提示、指令微调、RLHF和多模态学习等技术，同时也讨论了偏见、计算成本和可解释性等挑战。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.12365), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1934996216909324336) |\n\n## 本周顶级机器学习论文（6月9日–6月15日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) Text-to-LoRA  提出了一种基于超网络的方法，能够根据自然语言任务描述即时生成LoRA适配器，从而无需传统的特定于任务的微调。作者介绍了Text-to-LoRA (T2L)，该模型可以压缩多个LoRA适配器，并以高效且强大的性能泛化到未见过的任务上。  \u003Cbr>● T2L经过训练，仅使用任务描述即可为LLM生成低秩适应矩阵（LoRAs），它利用超网络在一次前向传播中输出LoRA权重。支持两种训练模式：重建预训练的LoRA以及跨多个任务的监督微调（SFT）。 \u003Cbr>● 在基准测试实验中，经SFT训练的T2L在零样本适应方面表现具有竞争力，在某些任务上甚至优于多任务LoRA基线和特定任务LoRA（例如PIQA和Winogrande），展示了其泛化能力和压缩优势。 \u003Cbr>● 作者测试了三种参数效率递增的架构变体（L、M、S）。消融实验表明，T2L的性能随训练任务数量的增加而良好扩展，并且在不同任务描述嵌入（如来自GTE或Mistral模型）下表现稳健。 \u003Cbr>● 定性和可视化分析证实，T2L即使对于未见过的任务也能生成具有任务特异性和语义意义的适配器，其可控性取决于任务描述的方式。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.06105), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933166911359221943) |\n| 2) V-JEPA 2  Meta AI推出了V-JEPA 2，这是一种可扩展的联合嵌入预测架构，用于自监督视频学习，旨在构建一个能够在物理世界中理解、预测和规划的通用世界模型。该模型分两个阶段进行训练：首先在超过100万小时的互联网视频和图像上进行无动作预训练，随后仅用62小时的无标签机器人轨迹数据（Droid数据集）进行后训练。由此得到的潜在视频表示在广泛的下游任务中都非常有用。  \u003Cbr>● 视频理解与问答：V-JEPA 2在以运动为中心的任务上达到了最先进水平，例如Something-Something v2（77.3%的top-1准确率）和Epic-Kitchens-100动作预测（39.7%的recall@5），并且在与8B语言模型对齐时，在多模态问答任务中表现出色（例如PerceptionTest得分为84.0，TempCompass得分为76.9）。值得注意的是，这些成果是在视频预训练过程中未使用任何语言监督的情况下实现的。 \u003Cbr>● 自监督机器人规划：仅使用62小时的机器人视频数据，团队就在冻结的视频编码器之上微调了一个动作条件模型（V-JEPA 2-AC）。该模型支持零样本部署，可用于抓取和拾放等操作任务，在未见过的真实Franka机械臂上执行，且无需奖励信号、演示或实验室特定的微调。 \u003Cbr>● 架构扩展：V-JEPA 2通过四项关键改进超越了其前身：更大的数据集（2200万视频）、更大的ViT-g模型（10亿参数）、更长的训练时间（最多25.2万次迭代）以及渐进式的时空分辨率。这些改进共同带来了视觉基准测试中的显著提升（例如相比基线平均准确率提高了4.0个百分点）。 \u003Cbr>● 与其他模型的比较：V-JEPA 2在视频问答任务上优于DINOv2和PEcoreG等图像编码器，并且在ImageNet等外观识别任务上也具有竞争力。当用于MLLM时，仅使用未经语言数据预训练的视觉编码器，它就能在PerceptionTest、MVP和TOMATO等任务上超越PLM-8B等模型。 | [论文](https:\u002F\u002Fscontent.fbze2-1.fna.fbcdn.net\u002Fv\u002Ft39.2365-6\u002F505938564_1062675888787033_5500377980002407548_n.pdf?_nc_cat=101&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=6u4zBAKs6SkQ7kNvwE5zbX1&_nc_oc=AdlpzBbMbMP6cOhKa16a9LZ61WnxJwzB2r6GwsBXbuYwWa0iQpecNrdrktQGwOXgs1E&_nc_zt=14&_nc_ht=scontent.fbze2-1.fna&_nc_gid=A5V-K_6tPicxhq7Ptt3jVA&oh=00_AfOHP9XADcGohxg_sHcowXP4EKjcQ4_pWbpXJvB7zc3xpA&oe=684FA230), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1932888893113700720) |\n| 3) 强化预训练  本文提出了强化预训练（RPT）这一新范式，通过将下一个标记预测重新解释为一种基于可验证正确性的奖励推理任务，从而弥合LLM预训练与强化学习之间的鸿沟。RPT不依赖于人工标注或昂贵的人类反馈，而是在大量未标注文本语料上应用强化学习，根据预测标记是否与真实标记匹配来分配内在奖励。这种重新框架化支持通用强化学习的规模化，并提升了预训练和微调的效果。  \u003Cbr>● 核心方法：在文本序列的每个标记位置，模型首先生成推理轨迹（思维链），然后预测下一个标记。如果预测是真实后续内容的有效前缀，则会获得奖励。每个上下文中会进行多次展开，模型通过策略内强化学习进行训练。 \u003Cbr>● 优于标准预训练：RPT显著优于标准的下一个标记预测和思维链推理基线（无强化学习），在不同难度的标记上都实现了更高的准确率，甚至在性能上可与更大规模的模型相媲美。例如，RPT-14B在OmniMATH基准测试上的准确率与R1-Qwen-32B相当或更高。 \u003Cbr>● 强大的缩放规律：RPT在不同难度级别上均表现出清晰的幂律缩放关系，随着计算资源的增加，预测准确率持续提升，并且与理论曲线高度吻合。 \u003Cbr>● 改善下游强化学习和泛化能力：使用具有可验证答案的任务（例如Skywork-OR1）对RPT模型进行强化学习微调，相较于使用标准目标训练的模型，其收益更快且更显著。在SuperGPQA和MMLU-Pro基准测试上的零样本评估显示，处于推理模式的RPT-14B大幅领先于R1-Distill-Qwen-32B。 \u003Cbr>● 促进结构化思维：对推理轨迹的分析表明，与传统问题解决模型相比，RPT-14B在假设生成、演绎和反思模式的使用上更为频繁，这支持了RPT能够在训练过程中培养更深层次的推理习惯的观点。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.08007), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1932522665182703664) |\n| 4) TableRAG  TableRAG解决了现有RAG方法的一个核心限制：即无法有效处理同时包含非结构化文本和结构化表格的异构文档。典型的RAG流程会将表格展平并与周围文本混合，从而丢失重要的结构信息并阻碍多跳推理。TableRAG通过引入一个混合系统克服了这一问题，该系统将基于SQL的符号执行与文本检索集成在一个统一的迭代推理框架中。  \u003Cbr>● TableRAG以四个迭代阶段运行：(1) 上下文敏感的查询分解，(2) 文本检索，(3) SQL编程与执行，以及(4) 中间答案生成。这种设计使其能够保留表格结构，并同时利用符号和神经推理路径。 \u003Cbr>● 为了评估跨9个领域和5种表格操作类型（如聚合、过滤、分组）的304个多跳问答示例中的异构推理能力，开发了一个新的基准测试HeteQA。 \u003Cbr>● 在HeteQA及现有基准测试（HybridQA、WikiTableQuestions）上的实验表明，TableRAG始终优于NaiveRAG、ReAct和TableGPT2等先前方法，在最高基准之上实现了超过10%的准确率提升。 \u003Cbr>● 消融实验揭示，TableRAG的所有主要组件都做出了显著贡献。值得注意的是，SQL执行对于嵌套推理任务至关重要，而文本检索则对实体和数值引用至关重要。 \u003Cbr>● TableRAG实现了更高的推理效率，能够在五步或更少的步骤内解决超过90%的HeteQA任务，并且在所有被评估的方法中表现出最低的失败率。其鲁棒性在多种LLM后端（Claude、DeepSeek、Qwen）上均保持稳定。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10380), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933520740147736634) |\n| 5) 自适应语言模型  提出了一种新颖的框架，使LLM能够通过强化学习自我生成微调数据和更新指令，即“自编辑”，从而实现自我适应。这种方法允许模型自主优化其学习过程，而无需依赖单独的适应模块或人类监督。  主要亮点：  \u003Cbr>● 自编辑作为适应机制：SEAL不直接使用原始数据更新权重，而是让模型生成自编辑——即可能包含重述事实、优化参数或工具调用的自然语言指令——然后使用这些指令进行监督微调。自编辑的生成通过强化学习优化，以下游任务的性能作为奖励。 \u003Cbr>● 评估的两个领域： \u003Cbr>● 学习框架：SEAL采用外层强化学习循环（优化自编辑生成策略）和内层循环（通过监督微调应用编辑）。强化学习目标通过过滤后的行为克隆（ReSTEM）近似，仅强化那些能带来性能提升的编辑。 \u003Cbr>● 局限性与未来展望：尽管SEAL取得了令人信服的成果，但它在连续更新过程中仍易发生灾难性遗忘，并且由于反复微调而产生较高的计算成本。未来的研究方向包括将SEAL与持续学习策略结合，并将其扩展到能够作为长期交互的一部分自动更新权重的自主智能体。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10943), [推文](https:\u002F\u002Fx.com\u002Fjyo_pari\u002Fstatus\u002F1933350025284702697) |\n| 6) ComfyUI-R1  介绍了ComfyUI-R1，这是一个70亿参数的大规模推理模型，专为ComfyUI生态系统中的自动化工作流生成而微调。该模型基于Qwen2.5-Coder构建，并通过两阶段流水线进行训练（先进行监督的思维链推理，再进行强化学习）。ComfyUI-R1显著优于之前依赖GPT-4o和Claude 3.5等商业模型的最先进方法。  主要亮点：  \u003Cbr>● 精选工作流与节点知识库：团队收集并清理了2.7万个社区工作流，最终筛选出3,900个高质量条目，每个条目都配有JSON和代码表示。他们还整理了一个包含7,238个节点的文档知识库，其中部分节点通过Claude 3.5进行了增强。 \u003Cbr>● 思维链+强化学习训练：该模型首先在模拟的长篇思维链推理序列上接受监督微调，涉及节点选择、规划和代码生成。随后，通过细粒度规则-指标混合奖励的强化学习，鼓励格式有效性、图结构正确性以及节点准确性。 \u003Cbr>● 更优越的性能：ComfyUI-R1在基准测试中实现了97%的格式有效性，并在节点级和图级F1分数上均达到最高水平，击败了所有GPT-4o和Claude的基线。在ComfyBench基准测试上，其执行通过率提高到了67%，比之前的最佳结果（使用GPT-4o的ComfyAgent）高出整整11个百分点。 \u003Cbr>● 定性改进：案例研究表明，ComfyUI-R1生成的工作流比以往的多智能体方法更加忠实、复杂且可执行。值得注意的是，它在风格、结构和连贯性方面更能准确地匹配用户的指令来生成图像。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.09790), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1933175492716224876) |\n| 7) Magistral  Mistral推出了其首个专注于推理的LLM系列Magistral，并配套了一个自定义的强化学习训练栈，可以从零开始进行纯强化学习训练。与依赖教师模型蒸馏的方法不同，Magistral直接使用纯文本数据和自定义奖励塑造进行在线强化学习训练。该项目产生了两款开源模型：Magistral Medium（基于Mistral Medium 3）和开源的Magistral Small（24B），后者先基于Medium的输出进行SFT微调，然后再进行强化学习。  主要见解：  \u003Cbr>● 纯强化学习可与蒸馏相媲美：Magistral Medium在没有任何推理痕迹的情况下进行训练，其AIME-24（pass@1）成绩相比基础模型提升了50%。在LiveCodeBench和GPQA等推理基准测试上，尽管没有蒸馏阶段，其表现与DeepSeek-R1不相上下甚至更好。 \u003Cbr>● 多语言和多模态泛化能力显现：在文本数学\u002F代码数据上的强化学习意外地增强了多模态推理能力，并保留了指令遵循和工具调用的能力。多语言推理则通过简单的奖励塑造实现，强制要求输出和思维链使用同一种语言。 \u003Cbr>● 高效的异步基础设施：采用NCCL权重广播的异步强化学习设置，使得生成器可以连续滚动序列，同时接收权重更新而不阻塞。这有助于维持策略内特性，同时最大化GPU利用率。 \u003Cbr>● 奖励塑造与思维链格式强制：奖励取决于正确的格式（\u003Cthinktags、方框答案、markdown块>）、正确性（通过SymPy和测试用例）、长度惩罚以及语言一致性（使用fastText）。未能满足格式要求将立即导致零奖励。 \u003Cbr>● 训练启发式与消融实验：分析表明，奖励和性能与输出长度呈对数关系。较长的完成内容对应更高的奖励。强化学习在低维空间中移动权重，这一点从检查点的PCA分析中可见。关于批次大小、优势归一化和熵（通过ε ᵢg）的消融实验强调了稳定性权衡。 \u003Cbr>● 开源发布：Magistral Small（24B）以Apache 2.0许可发布，在三种设置下均表现具有竞争力：仅SFT、仅强化学习以及SFT+强化学习，其中最后一种组合在数学和代码任务上取得了最佳效果。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.10910), [推文](https:\u002F\u002Fx.com\u002Fnrehiew_\u002Fstatus\u002F1932872389798605060) |\n| 8) 代码研究员  代码研究员是一个深度研究代理，旨在通过语义、模式和提交历史的多步推理来解决大型系统代码中的复杂 bug。它在kBenchSyz等基准测试上显著优于现有的编码代理（如SWE-agent），通过有效收集和过滤全局上下文后再生成并验证补丁，实现了58%的崩溃修复率。 | [论文](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fwp-content\u002Fuploads\u002F2025\u002F06\u002FCode_Researcher-1.pdf), [推文](https:\u002F\u002Fx.com\u002Fadityakanade0\u002Fstatus\u002F1932044846124151199) |\n| 9) LLamaRL  LlamaRL是一个完全分布式的异步强化学习框架，专为高效的大规模LLM训练而设计（8B至405B+模型）。它通过结合本地模型卸载、异步策略外训练（AIPO）和快速的GPU原生权重同步（DDMA），相比DeepSpeed-Chat实现了高达10.7倍的速度提升，同时在数学推理等任务上保持了模型质量。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24034), [推文](https:\u002F\u002Fx.com\u002Frobinphysics\u002Fstatus\u002F1931181903689719875) |\n| 10) 利用AI预测气旋路径  Google DeepMind的天气实验室推出了一款交互式平台，内置AI气旋预报模型，可生成多达15天的50种集合情景预测，在路径和强度预测方面优于传统系统。该平台整合了实时和历史预报数据以及评估基准，目前已被美国国家飓风中心等机构用于支持早期预警。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-DeepMind.com\u002FBlog\u002Fhow-we-re-supporting-better-tropical-cyclone-prediction-with-ai\u002Fskillful-joint-probabilistic-weather-forecasting-from-marginals.pdf), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1933178918715953660) |\n\n## 本周顶级机器学习论文（6月2日至6月8日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 思考的幻觉  通过系统性地分析前沿大型推理模型（LRMs），如Claude 3.7、DeepSeek-R1和OpenAI的o系列，在不同问题复杂度下的表现，探讨这些模型的能力与局限性。作者并未依赖存在污染且缺乏结构的传统数学基准测试，而是采用四种可控的谜题（汉诺塔、跳棋、过河问题和积木世界）来评估LRMs，这些谜题能够实现细粒度的复杂度调节，并支持透明的推理轨迹分析。主要发现如下：  \u003Cbr>● 三种复杂度阶段：研究识别出不同的性能阶段。在低复杂度任务中，非思考型LLM由于计算更高效直接而优于LRMs；中等复杂度时，推理模型凭借更长的思维链纠正错误而占据优势；然而，在高复杂度情况下，无论是否具备推理框架，所有模型的准确率都会降至接近零。 \u003Cbr>● 反直觉的推理崩溃：令人惊讶的是，当问题复杂度超过某一阈值后，LRMs会减少其推理投入（即用于思考的token数量）。这表明内部存在一种规模效应失效，而非由token限制所致，而是源于模型自身的内在行为。 \u003Cbr>● 推理轨迹效率低下：LRMs在简单问题上常常“过度思考”，尽管很早就找到了正确答案，但仍会继续探索错误路径。对于中等难度的任务，它们往往较晚才纠正错误；而在复杂任务中，则完全无法找到有效解。基于位置的准确率分析揭示了正确解在推理轨迹中生成时间上的系统性偏移。 \u003Cbr>● 无法执行明确的算法：即使提供了正确的伪代码（例如汉诺塔递归算法），模型在相似复杂度下仍会失败。这说明LRMs不仅难以找到解决方案，连可靠地执行逻辑指令也做不到。 \u003Cbr>● 不同谜题间表现不一：模型在汉诺塔问题（N=10）中可完成100余步正确操作，但在过河问题（N=3）中仅4步便告失败，暗示其性能更多取决于对训练数据的熟悉程度，而非问题本身的固有复杂度。 | [论文](https:\u002F\u002Fml-site.cdn-apple.com\u002Fpapers\u002Fthe-illusion-of-thinking.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1931333830985883888) |\n| 2) 从Token到Thoughts  本文提出了一种信息论框架，用以考察LLM是否像人类一样组织语义知识，同时在压缩与意义之间取得平衡。基于率失真理论和信息瓶颈原理，作者将30多个LLM的token嵌入与认知心理学中的经典人类分类基准进行了对比。  \u003Cbr>● LLM确实能形成与人类分组高度一致的宽泛概念类别。调整后的互信息得分显示，LLM聚类始终优于随机基线，甚至小型编码器模型如BERT也能在这一对齐任务上媲美或超越更大的解码器-only模型。 \u003Cbr>● 然而，LLM在细粒度语义方面表现欠佳。当测试其能否反映人类对事物典型性的认知（例如知更鸟比企鹅更具典型性）时，LLM嵌入相似性与人类评分之间的相关性却十分微弱且不稳定。大多数模型未能捕捉到人类认知中明显的原型分级结构。 \u003Cbr>● 借助统一的损失函数L（兼顾信息复杂度与语义失真），作者发现LLM能够生成统计上高效的聚类，具有较低的熵和失真；而人类的概念聚类则较为松散，但保留了更为丰富的细微差别。这表明LLM为了追求压缩效率而过度优化，牺牲了意义，而人类则愿意容忍一定的低效，以维持适应性强、灵活的结构。 \u003Cbr>● 论文最后指出，尽管LLM可以模仿表面层次的分类，但在语义表征方式上却存在根本性差异，凸显了人工与人类语义系统之间的核心差距，并为改进符合人类价值观的概念表征提供了一个量化工具。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.17117) |\n| 3) 知识还是推理  提出了一套细粒度的评估框架，将LLM的思考分解为两个组成部分：知识正确性和推理信息量，分别通过知识指数（KI）和信息增益（InfoGain）来衡量。作者利用该框架评估了推理能力在不同领域的迁移情况，特别是医学和数学领域，所用模型包括Qwen2.5-7B及其通过SFT和RL训练得到的DeepSeek-R1蒸馏版本。主要发现如下：  \u003Cbr>● SFT提升知识但可能损害推理：监督微调提高了事实准确性（例如在医学任务中KI提升6.2%），但通常会导致冗长或重复的推理，使InfoGain平均下降38.9%，相比基础模型而言。 \u003Cbr>● RL在医学场景中同时提升推理与知识：强化学习增强了推理的清晰度并剔除了错误知识，使KI平均提升12.4点。它通过引导模型走向更符合事实的推理路径来改善推断能力。 \u003Cbr>● 领域影响显著：虽然数学任务更受益于推理（更高的InfoGain），但医学任务则高度依赖领域知识（更高的KI）。事实上，在医学基准测试中，KI与任务准确性的相关性（0.998）远高于InfoGain（0.698）。 \u003Cbr>● 基础模型在医学领域表现优于R1蒸馏版：Qwen-Base在准确率、InfoGain和KI等方面均持续优于DeepSeek-R1蒸馏模型。R1蒸馏版在医学适应性方面表现不佳，很可能是因为预训练时偏向数学\u002F代码领域。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.02126), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930640490786951365) |\n| 4) 开放思考  本文提出了OpenThoughts3，一套系统化的监督微调（SFT）数据构建方案，旨在提升开源推理模型的性能。作者基于超过1000次受控实验构建了120万条示例数据集（OpenThoughts3-1.2M），并以此训练了一款70亿参数的模型OpenThinker3-7B。尽管未使用强化学习，OpenThinker3-7B在标准的数学、代码和科学推理基准测试中仍优于所有其他公开数据的70亿和80亿参数模型，甚至击败了采用更大规模或混合SFT+RL流程训练的模型。关键见解与贡献如下：  \u003Cbr>● 同级别最佳的开源70亿参数模型：OpenThinker3-7B在AIME25（53.3%）、LiveCodeBench（51.7%）和GPQA Diamond（53.7%）等基准测试中均取得了最先进的成绩，其表现比DeepSeek-R1-Distill-Qwen-7B高出15至20个百分点。 \u003Cbr>● 清晰设计下的规模法则：作者对数据流水线的每一步——来源选择、过滤、教师模型挑选、去重以及答案采样——都进行了消融实验，展示了每一步如何逐步提升性能。例如，为每个问题提供多份答案（16倍）的效果，远胜于单纯增加问题多样性。 \u003Cbr>● QwQ-32B作为教师的表现优于更强的模型：令人惊讶的是，尽管QwQ-32B的基准测试分数低于DeepSeek-R1或Phi-4，但它所指导的学生模型却表现更好，这表明教师的选择对推理轨迹质量的影响，远大于其原始性能高低。 \u003Cbr>● 过滤比验证更重要：根据回答长度和LLM估计的难度进行问题过滤，比传统的启发式方法（如fastText）甚至基于正确性验证的过滤，更能预测后续效果。 \u003Cbr>● 数据质量胜于多样性：在每个领域仅混合前1至2个优质问题来源，其效果始终优于广泛使用多种来源的做法，这表明问题质量比数据集的异质性更为重要。 \u003Cbr>● 开源影响力：完整的数据集和模型已在[openthoughts.ai](http:\u002F\u002Fopenthoughts.ai\u002F)上发布，为开放推理研究提供了可复现的基准。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.04178), [推文](https:\u002F\u002Fx.com\u002Flschmidt3\u002Fstatus\u002F1930717405812269273) |\n| 5) 具有多模态浏览能力的编程代理  提出了OpenHands-Versa，这是一种统一的代理架构，旨在通过赋予单一代理代码执行、多模态网页浏览以及文件\u002F搜索访问三大通用能力，在编程、网页浏览及多模态信息获取等多个领域表现出色。与针对特定领域优化的专用或多代理系统不同，OpenHands-Versa力求以最小的架构复杂度解决广泛的现实任务。主要亮点如下：  \u003Cbr>● 统一工具集，覆盖范围更广：OpenHands-Versa将视觉化网页浏览、搜索API接入和多模态文件处理整合进OpenHands编程框架中。尽管结构简单，其在三个基准测试中的成功率仍超过了专用代理：SWE-Bench Multimodal（+9.1%）、GAIA（+1.3%）和The Agent Company（+9.1%）。 \u003Cbr>● 基准测试的泛化能力：该代理的表现与OWL-roleplaying和Magentic-One等多代理系统相当甚至更好，而后者在跨领域泛化方面往往力不从心。例如，OWL-roleplaying虽然在GAIA上表现优异，但在The Agent Company上却因工具通用性不足而表现较差。 \u003Cbr>● 领域感知的工具使用：分析显示，OpenHands-Versa能够根据不同的基准测试有效调整其工具使用方式（例如在GAIA中使用搜索API、在The Agent Company中使用浏览器、在SWE-Bench M中进行视觉验证），这与前代产品OpenHands形成鲜明对比，后者常出现工具误用或缺乏关键工具的情况。 \u003Cbr>● 单一代理，强劲效果：通过采用单代理设计，并以Claude-3.7或Claude Sonnet-4作为骨干LLM，OpenHands-Versa无需针对每项任务定制工具，即可达到SOTA水平。例如，在GAIA验证集上，其成功率达到64.24%，比多代理基线高出多达18个百分点。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.03011), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930277871999955166) |\n| 6) 自我挑战型语言模型代理  提出了一种新颖的多轮工具使用LLM代理自我改进方法，称为自挑战代理（SCA）。该方法完全基于代理自身生成的任务进行训练，无需人工标注的任务或评估。框架引入了一种名为“代码即任务”（CaT）的新任务格式，确保生成的任务既可行、可验证又具挑战性。研究表明，SCA在自我改进场景中可将性能提升一倍，并显著提高蒸馏效果。主要贡献与发现如下：  \u003Cbr>● 双角色驱动的自我生成任务：代理在挑战者角色和执行者角色之间交替切换，前者负责探索环境并创建任务，后者则通过强化学习学会解决这些任务。这一过程旨在模拟人类标注者与工具互动以设计有意义任务的方式。 \u003Cbr>● “代码即任务”（CaT）的制定：每个合成任务都包含指令、基于Python的验证函数、可行的解决方案以及若干失败案例。这种结构通过自动代码执行检查，过滤掉琐碎、不可能或不可验证的任务，从而保证任务质量。 \u003Cbr>● 蒸馏与自我改进双重成效：SCA将Llama-3.1-8B-Instruct模型的学习成功率从12.0%提升至23.5%，当其基于自身任务进行学习时。在蒸馏场景中（以700亿参数的教师模型为指导），SCA使模型的Pass@1指标提升至32.2%，在所有工具使用环境中均优于之前的PAE基线。 \u003Cbr>● 人工标注与消融实验证实任务质量：与PAE相比，使用CaT生成的任务显著减少了假阳性和假阴性结果。详细分析表明，CaT的过滤机制能够去除缺陷任务，同时在与Llama-3.1-70B等强模型结合使用时仍保持多样性。 \u003Cbr>● 规模与训练动态：更多样化的任务（而不仅仅是每个任务的更多轨迹）能够带来更好的泛化效果，强调了广泛合成覆盖的重要性。在线RL方法如PPO和GRPO可进一步提升性能，但需要更高的调优成本和计算资源。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01716), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1930748591242424439) |\n| 7) AlphaOne  提出了一种通用框架α1，用于在推理过程中调控大型推理模型（LRMs）的推理进度。不同于僵化或自动化的调度策略，α1通过一个可调参数α显式控制模型何时以及如何进入“慢思考”状态。该方法会动态插入“等待”token以鼓励更深入的推理，随后再以“\u003C\u002Fthink>”token确定性地结束慢思考，从而促使模型高效生成答案。这种方法在准确性和效率上均优于以往的测试时缩放技术。主要见解如下：  \u003Cbr>● 先慢后快的推理优于其他策略：与人类直觉（先快后慢）相反，模型在开始时采用慢思考，随后过渡到快速推理反而更有利。这种“前期集中投入”的安排能够带来更准确的问题解答。 \u003Cbr>● 通过α1的密集调控提升准确性和效率：通过α调度的“等待”token插入持续调整推理节奏，α1的表现优于现有的测试时策略，如s1（单调递增）和CoD（单调递减），在部分基准测试中可实现高达6.15%的准确率提升，同时最多可减少14%的token使用量。 \u003Cbr>● 线性退火是最有效的调度策略：在测试的几种控制“等待”插入的函数（恒定、线性递增、指数\u002F线性退火）中，线性退火——即逐渐降低“等待”token的频率——被证明在多个模型和数据集上均表现最佳。 \u003Cbr>● α1之后的关键时刻调控：单纯插入“等待”token容易导致慢思考陷入惯性。α1通过将后续的“等待”token替换为“\u003C\u002Fthink>”，有效强制模型转向快速推理，从而在某些任务中将性能提升高达20%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.24863), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1929551555948400840) |\n| 8) Common Pile v0.1  Common Pile v0.1是一个8TB大小的开源许可文本数据集，专为LLM预训练设计，旨在解决未经授权数据使用的法律和伦理问题。基于该数据集训练的两款70亿参数模型Comma v0.1-1T和2T，其性能可与LLaMA 1和2相媲美。该数据集、代码以及模型检查点均已公开发布。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.05209), [推文](https:\u002F\u002Fx.com\u002FAiEleuther\u002Fstatus\u002F1931021637991755906) |\n| 9) RewardBench 2  RewardBench 2是一个新的多技能基准测试，用于评估奖励模型，其特点是采用了更具挑战性的人类提示，并与下游性能具有更强的关联性。该基准突出了当前奖励模型有效性方面的不足，旨在推动更严格的评估；结果显示，现有模型的得分比其前身低约20分。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.01937), [推文](https:\u002F\u002Fx.com\u002Fsaumyamalik44\u002Fstatus\u002F1929654864604549348) |\n| 10) LLM中的记忆化  本研究提出了一种量化模型记忆与泛化的比例的方法，估算GPT类模型的容量约为每参数3.6比特。通过对数百个模型进行训练，作者发现记忆化会在泛化（“grokking”）发生之前就达到饱和，并由此推导出新的规模法则，将模型容量、数据规模和成员推断联系起来。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.24832) |\n\n## 本周机器学习顶级论文（5月26日—6月1日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) RAG系统的新视角  提出了一种新的概念性和经验性框架，用于从“充分上下文”的角度分析RAG系统——即仅凭检索到的内容是否足以回答问题。这一概念有助于将检索失败与大模型的生成错误解耦，从而更清晰地理解模型在不同上下文充足性条件下的行为。  主要发现：  \u003Cbr>● 充分上下文的新定义与分类器：作者将“充分上下文”形式化为一种在无需真值答案的情况下，合理允许回答问题的上下文。他们开发了一款高精度的大模型自动标注器（Gemini 1.5 Pro，准确率93%），用于将实例标记为具有充分或不充分上下文，从而实现大规模评估而无需真值答案。 \u003Cbr>● 充分上下文并不等同于保证正确：即使存在充分上下文，最先进的大模型如GPT-4o、Claude 3.5和Gemini 1.5仍然会比选择不回答时更频繁地产生幻觉式答案。相反，模型有时也能在上下文不足的情况下给出正确答案，这很可能得益于其参数记忆。 \u003Cbr>● 基准数据集中存在大量上下文不足的情况：对HotPotQA、Musique和FreshQA等数据集的分析表明，即使采用精心策划或Oracle式的检索方案，仍有相当一部分查询缺乏充分上下文（例如，在Musique和HotPotQA中超过50%）。 \u003Cbr>● 选择性生成可提高事实准确性：作者提出了一种“选择性RAG”方法，该方法结合模型自身的置信度与充分上下文自动标注器来决定是否回答或选择不回答。这种方法在Gemini、GPT和Gemma等模型上均能带来2–10%的一致性提升（针对已回答的问题）。 \u003Cbr>● 单纯微调不足以解决问题：尝试对Mistral 3 7B等小型模型进行微调以改善其不回答行为（例如，训练它们在上下文不足时说“我不知道”），虽然适度提高了不回答的比例，但往往降低了准确率，或未能有效抑制幻觉现象。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.06037), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1927737131478188295) |\n| 2) 自我改进智能体的开放式进化  本研究提出了达尔文哥德尔机（DGM），这是一种结合自指代码修改与开放式进化搜索的系统，旨在推进自我改进型AI的愿景。与原始的哥德尔机不同，后者要求代码更改必须具备可证明的收益（这一约束在实践中几乎无法满足），DGM则采取了一种基于经验的方法：它会修改自身的代码库，并在编码基准测试上评估改进效果。  主要贡献与发现：  \u003Cbr>● 自指的自我改进循环：DGM最初只有一个编码智能体，它会编辑自己的基于Python的代码库，以提升自身读取、写入和执行代码的能力，同时使用冻结的基础模型（FM）。每次修改都会在SWE-bench和Polyglot等基准上进行评估，只有表现优异的智能体才会被保留下来继续迭代。 \u003Cbr>● 基于进化档案的开放式探索：受达尔文进化论启发，系统会维护一个包含所有先前智能体的档案，并根据性能和新颖性来选择父代。这种方式能够帮助智能体跳出局部最优解，持续推动创新，甚至重新利用之前表现不佳但后来成为重要跳板的变体。 \u003Cbr>● 经验性的性能提升：经过80轮迭代，DGM在SWE-bench上的编码成功率从20.0%提升至50.0%，在Polyglot上的成功率则从14.2%提升至30.7%，其表现优于那些既不具备自我改进能力又缺乏开放式特性的强大基线模型。其最佳智能体的表现甚至可以媲美或超越由人类设计的领先开源编码智能体。 \u003Cbr>● 涌现出的工具与工作流改进：通过自我改进，DGM不断优化自身的功能，演化出更精细的编辑工具、重试与评估机制、历史感知的补丁生成以及面向长上下文的代码摘要等功能。 \u003Cbr>● 跨模型与任务的泛化能力：DGM发现的智能体在迁移至不同的基础模型（如Claude 3.5到3.7、o3-mini）以及不同编程语言时，仍能保持良好的泛化能力，这表明其改进并非过度拟合于特定设置。 \u003Cbr>● 安全性设计：所有实验均在沙盒环境中进行，并受到监控和范围限制。论文还讨论了未来自我改进系统若将安全性与可解释性纳入评估标准，如何进一步发展出更安全、更易理解的行为。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22954), [推文](https:\u002F\u002Fx.com\u002Fhardmaru\u002Fstatus\u002F1928284568756629756) |\n| 3) 大模型内存增强型生成的操作系统  提出了一种统一的操作系统，用于管理内存增强型大模型，以解决当前架构中的关键局限性：即缺乏结构化、持久且可控的记忆机制。尽管当今的大模型主要依赖于参数记忆（模型权重）和有限的短期上下文，MemOS却提出了一套全面的内存生命周期与管理基础设施，旨在支持持续学习、行为一致性以及知识演进。  主要贡献与组件包括：  \u003Cbr>● 三层内存分类法：MemOS区分了参数记忆（长期权重）、激活记忆（短期运行时状态）和明文记忆（可编辑的外部内容）。这些类型通过一个名为“内存立方体”（MemCube）的共享抽象统一起来，从而实现无缝转换（例如，明文转参数）和生命周期治理。 \u003Cbr>● MemCube抽象层：每个MemCube封装了内存元数据（创建时间、类型、访问策略等）以及语义负载（文本、张量、LoRA补丁）。这种设计使得动态调度、可追溯的更新以及模块和智能体之间的互操作成为可能。 \u003Cbr>● 模块化的操作系统架构：MemOS由三个层次组成——界面层（用户\u002FAPI交互）、操作层（内存调度、生命周期管理）和基础设施层（存储、访问治理），三者协同工作以管理内存的解析、注入、转换和归档。 \u003Cbr>● 封闭式执行流程：每一次交互（例如，回复提示）都可能触发由调度规则和生命周期政策所管控的内存操作。检索到的内存可以被注入到生成过程中，也可以被存入档案，或转化为其他类型以供长期使用。 \u003Cbr>● 以内存为中心的未来愿景：论文提出“内存训练”将成为继预训练和微调之后的下一个前沿领域，使模型能够持续学习。未来的工作还包括跨模型的内存共享、自进化内存块以及去中心化的内存市场。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.22101), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1928116365640225222) |\n| 4) 使用工作流图构建生产级对话智能体  本文提出了一种实用且适用于生产的框架，用于借助工作流图构建由大模型驱动的对话智能体，特别关注电子商务场景。作者并未完全依赖端到端的生成方式，而是采用有向无环图（DAG）来设计智能体，从而实现灵活而可控的交互，同时严格遵守业务规则和格式约束。  主要贡献与发现包括：  \u003Cbr>● 多状态DAG框架：图中的每个节点对应一个对话状态，拥有独立的系统提示、工具访问权限和执行规则。这种结构能够通过将逻辑和格式控制限定在特定的图节点内，有效处理约束问题（如避免产生幻觉式回答或不符合规范的建议）。 \u003Cbr>● 基于响应掩码的微调：由于对话轮次来自DAG中的不同节点，作者引入了一种微调策略，即通过应用选择性损失掩码，仅让大模型学习与特定节点上下文相关的回答。这样可以避免提示冲突，并更好地遵守节点特定的约束条件。 \u003Cbr>● 实际部署与结果：在KakaoTalk和Web平台上的实际部署中，基于图的方法在任务准确率（+52%）和格式合规性（+50%）等关键指标上显著优于基线智能体，甚至优于GPT-4o。在人工偏好测试中，其内部模型在63%的真实用户案例中被优先选择，尤其是在产品推荐和安全关键任务方面。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.23006), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1928492639906607297) |\n| 5) 虚假奖励  本研究挑战了关于数学推理任务中可验证奖励强化学习（RLVR）的普遍假设。作者表明，Qwen2.5-Math模型即使在使用虚假或有缺陷的奖励进行训练时，也能在强化学习的推动下显著提升性能。  \u003Cbr>● 意外有效的虚假奖励：Qwen2.5-Math-7B模型在使用随机奖励时准确率提升了21.4%，在使用基于格式的奖励时提升了16.4%，而在明确训练其回答错误答案时则提升了24.6%。这些提升幅度接近于使用真值奖励信号时的28.8%提升，这表明RLVR更多是激发了模型潜在的能力，而非真正教授新的推理技能。 \u003Cbr>● 模型特异性的泛化能力：虚假奖励在Llama3或OLMo2等其他模型上并不奏效，只有Qwen系列模型始终受益。作者认为这与预训练过程的差异有关。值得注意的是，Qwen2.5-Math表现出一种独特的“代码推理”行为，会生成类似Python的代码来解决问题，而这种行为在RLVR后变得更加频繁，并与准确率高度相关。 \u003Cbr>● 提升背后的机制：作者将性能提升归因于推理策略的变化。大部分提升来自于语言到代码的转变，即模型在RLVR过程中从自然语言推理转向代码推理。那些明确增加代码使用频率的干预措施（例如，奖励类似代码的回答或使用强制代码的提示）能够进一步提升性能，但仅限于Qwen系列模型。 \u003Cbr>● 截断机制使模型能够从噪声中学习：即使使用随机奖励，性能依然会提升，这是因为GRPO的截断机制会偏向于强化模型高概率的行为。对于Qwen系列模型而言，这些行为恰好与正确性一致，而对于其他模型则不然。 | [论文](https:\u002F\u002Fgithub.com\u002Fruixin31\u002FRethink_RLVR\u002Fblob\u002Fmain\u002Fpaper\u002Frethink-rlvr.pdf), [推文](https:\u002F\u002Fx.com\u002FStellaLisy\u002Fstatus\u002F1927392717593526780) |\n| 6) 学会无需外部奖励的推理  提出了一种无需任何外部奖励或标注数据即可通过强化学习训练大模型的方法。该方法仅以模型自身的自我确定性——基于与均匀分布的KL散度计算得出的置信度——作为唯一的内在奖励。这一自我改进策略属于更广泛的“来自内部反馈的强化学习”（RLIF）范式，能够绕过“可验证奖励强化学习”（RLVR）的局限性，后者需要领域特定的验证者和黄金标准输出。  主要亮点：  \u003Cbr>● INTUITOR在无外部监督的情况下媲美GRPO：当应用于GSM8K和MATH500等数学推理任务时，INTUITOR的表现与GRPO（一种强大的RLVR方法）相当，即便未使用黄金标准答案。在LiveCodeBench和CRUXEval等域外任务上，INTUITOR的泛化能力更强，其性能提升也高于GRPO（分别为+65% vs. 0%和+76% vs. +44%）。 \u003Cbr>● 快速的早期学习与更强的指令遵循能力：INTUITOR显著提升了早期训练阶段的性能，尤其在Qwen2.5-1.5B等模型上，并改善了对聊天式指令的遵循，减少了重复或无意义的输出。 \u003Cbr>● 涌现的结构化推理：经过训练的模型即使在未明确要求的情况下也会自发地进行推理，通常会在生成代码或答案之前先给出解释或规划步骤。这种行为与模型在代码生成等领域的更好迁移能力相关。 \u003Cbr>● 自我确定性作为一种稳健且抗破解的信号：与容易被利用的固定奖励模型不同，在线自我确定性会随着模型的状态动态调整，从而避免奖励被恶意操纵。经统计检验证实，INTUITOR训练后的模型中，自我确定性与正确答案之间存在最强的相关性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.19590), [推文](https:\u002F\u002Fx.com\u002Fxuandongzhao\u002Fstatus\u002F1927270931874910259) |\n| 7) 通过思维混合学习推理  尽管大多数现有方法只在单一模态下进行训练，而在推理时才进行集成，但本研究提出了“思维混合”（MoT）方法，该方法能够在多模态间联合训练和推理，从而显著提升逻辑推理性能。主要发现如下：  \u003Cbr>● 三种模态的协同效应：MoT利用自然语言提供可解释性，使用代码进行结构化的程序性推理，同时借助真值表显式枚举逻辑情况。错误分析表明，真值表能够显著减少大模型常见的失误模式，如遗漏分支或无效逆命题等。 \u003Cbr>● 自我演进的训练过程：MoT引入了一种迭代式的策略内训练循环，模型会生成、筛选并从自身的多模态推理轨迹中学习。这种联合训练的效果优于单一模态或部分模态的设置。 \u003Cbr>● 基于投票的推理：在测试时，MoT会分别从每种模态生成预测，并选择多数答案，从而获得稳健的预测结果。结果显示，在FOLIO和ProofWriter等任务上，平均准确率可提升多达11.7个百分点，9B规模的模型甚至能够达到GPT-4 + Logic-LM的水平。 \u003Cbr>● 在更复杂的任务上表现更佳：MoT在推理深度较高的问题（5–8步）上取得了最大的改进。此外，它在测试时的扩展性也更为出色，即使在固定的推理预算下，也能生成更加多样化和准确的输出。MoT证明了，大模型可以通过像人类一样运用多种思维方式来进行推理，而不仅仅是在单一模态上进行更多的采样，就能实现更稳健的逻辑推理。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15817), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1925574200405721210) |\n| 8) QwenLong-L1  一种新的强化学习框架，通过渐进式上下文扩展和混合奖励，将大型推理模型（LRMs）从短上下文扩展到长上下文。该框架在七个长上下文基准测试中取得了顶尖成绩，超越了OpenAI-o3-mini和Qwen3-235B-A22B等模型，并与Claude-3.7-Sonnet-Thinking相媲良，展示了其在高达12万 token输入下的强大推理能力。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.17667) |\n| 9) GUI智能体的端到端策略优化  ARPO提出了一种端到端的强化学习方法，用于通过带有经验回放的群体相对策略优化（GRPO）训练GUI智能体。该方法显著提升了OSWorld基准测试中的域内性能，较基线高出最多6.7%，同时在域外任务上也有小幅提升，并可通过结构化的奖励反馈实现自我纠正行为。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16282), [推文](https:\u002F\u002Fx.com\u002FTsingYoga\u002Fstatus\u002F1926646893175615943) |\n| 10) 支持可扩展代理式推理的通用智能体  提出了Alita框架，这是一种通用智能体框架，通过最小的预定义和最大的自我演进来实现可扩展的代理式推理。与依赖手工打造工具的传统智能体不同，Alita能够自主利用网络搜索和代码合成构建可重用的MCPs（模型上下文协议），并在GAIA、MathVista和PathVQA等基准测试上超越了OpenAI DeepResearch和OctoTools等复杂系统。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.20286) |\n\n## 本周顶级机器学习论文（5月19日–5月25日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 视觉规划  提出了一种新颖的推理范式，用基于图像的推理取代基于语言的规划。作者认为，在涉及空间或物理推理的任务中，语言并不总是最优的媒介。他们引入了视觉规划，将推理过程表示为一系列视觉状态（图像），无需任何文本中介，使模型能够直接以图像“思考”。这一方法通过名为VPRL（基于强化学习的视觉规划）的框架实现，该框架训练一个纯视觉模型（LVM-3B）使用图像进行规划。  主要贡献与发现：  \u003Cbr>● 纯视觉推理范式：作者正式将规划定义为自回归式的视觉状态生成，并仅使用图像数据进行训练。与将视觉映射到语言并进行文本推理的多模态大模型不同，这种方法完全在视觉模态中执行推理，从而绕过了模态差距。 \u003Cbr>● VPRL框架：提出了一种两阶段的训练流程。第一阶段采用监督学习，基于随机采样的轨迹来确保格式一致性并促进探索；第二阶段则应用GRPO（分组相对策略优化）算法，通过基于进展的奖励来优化规划行为，避免无效或倒退的动作。 \u003Cbr>● 更优的性能：在三个视觉导航任务（FronzeLake、Maze和MiniBehavior）上，VPRL在精确匹配得分上比基于语言的模型（如Gemini 2.5 Pro、Qwen 2.5-VL）高出40%以上。它在分布外任务（更大的网格尺寸）上的泛化能力也更强，视觉规划器的表现衰减更为平缓，而文本模型则更容易出现性能下降。 \u003Cbr>● 视觉规划带来鲁棒性和可解释性：与文本输出不同，视觉计划可以逐步骤检查，并且更能遵守物理约束。定性示例表明，VPRL能够避免无效动作并从次优路径中恢复，而语言模型往往会出现幻觉或对空间布局的误解。 \u003Cbr>● 探索与减少无效动作：第一阶段的随机策略初始化使得模型的探索能力优于监督基线（VPFT），表现为更高的熵和更少的无效动作。这有助于提升强化学习阶段的效果，最终增强规划能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11409), [推文](https:\u002F\u002Fx.com\u002F_yixu\u002Fstatus\u002F1924497238908375072) |\n| 2) EfficientLLM  提出了首个大规模、基于实证的基准，用于评估大模型在架构、微调和推理方面的效率权衡。研究在高性能集群（48×GH200，8×H200 GPU）上进行，共评估了超过100个模型-技术组合，参数规模覆盖0.5B至72B，使用六项指标：内存利用率、计算利用率、延迟、吞吐量、能耗和压缩率。  主要见解包括：  \u003Cbr>● 没有放之四海而皆准的解决方案：每种效率技术都会在某些指标上提升，而在另一些指标上降低。例如，MoE可以提高准确率并减少FLOPs，但会增加约40%的显存占用；而int4量化可以在性能小幅下降3–5%的情况下，将内存和能耗降低多达3.9倍。 \u003Cbr>● 资源特定的最优解：效率取决于具体场景。MQA在资源受限设备上实现了最佳的内存-延迟权衡；MLA在高质量生成任务中具有最低的困惑度；RSLoRA仅在14B参数以上的模型中才比LoRA更高效。 \u003Cbr>● 跨模态的可迁移性：MQA和PEFT等效率技术在视觉及视觉-语言模型中同样适用，能够提升FID分数并保持良好的权衡关系。 \u003Cbr>● 训练与微调：LoRA和DoRA在小型模型（1–3B）上表现最佳，而RSLoRA则在大型模型（≥14B）上更具优势。参数冻结虽然能获得最低的延迟，但会略微牺牲准确率。 \u003Cbr>● 推理：训练后int4量化能够在质量略有下降的情况下带来最高的压缩比和吞吐量提升，而在现代GPU上，bfloat16在延迟和能耗方面始终优于float16。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13840), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1925191664475222186) |\n| 3) J1  提出了一种新的训练方法，通过明确激励模型在评判过程中进行深思熟虑的推理，使大模型能够充当评估者（LLM-as-a-Judge）。J1不依赖于单纯的提示工程或偏好微调，而是采用可验证奖励的在线强化学习，教导模型系统性地思考并完成评估任务。  主要见解：  \u003Cbr>● 可验证的评判框架：J1将可验证（如数学问题）和不可验证（如用户查询）的提示转化为具有可验证奖励的任务，方法是生成合成的偏好对比对。这种重新框架化使得强化学习和一致的训练信号能够在多样化任务中应用。 \u003Cbr>● 基于思维链的强化学习优化：J1通过明确的思维轨迹来训练模型进行评估推理，包括列出评估标准、生成参考答案并在做出判断前进行自我比较。训练了两种模型类型：Pairwise-J1（输出判决）和Pointwise-J1（输出质量评分）。Pairwise-J1模型还通过一致性奖励进一步改进，以减少位置偏差。 \u003Cbr>● 大规模下的优越性能：J1-Llama-8B和J1-Llama-70B在五个基准测试中（PPE、RewardBench、RM-Bench、JudgeBench、FollowBenchEval）均超越了现有的8B和70B LLM评估者，击败了使用更多数据训练的DeepSeek-GRM以及DeepSeek-R1的蒸馏模型。J1-70B甚至超过了o1-mini，并缩小了与更大规模的R1模型之间的差距，尤其是在不可验证任务上。 \u003Cbr>● Pointwise-J1缓解位置偏差：尽管成对评判可能会因回答顺序而改变结果，但仅接受成对监督训练的Pointwise-J1却能提供位置一致的评分，减少平局并提高一致性。两种类型的评判者都受益于测试时的自一致性缩放，进一步提升了可靠性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10320), [推文](https:\u002F\u002Fx.com\u002Fjaseweston\u002Fstatus\u002F1923186392420450545) |\n| 4) 大模型指令遵循中的推理陷阱  探讨了推理增强型大语言模型（RLLMs）中一个意想不到的缺陷：尽管思维链（CoT）提示通常能提升复杂推理任务的性能，但它反而会降低指令遵循的准确性。作者评估了15个模型（如GPT、Claude、LLaMA、DeepSeek）在两个指令遵循基准上的表现，发现CoT提示几乎在所有模型和数据集上都持续降低了性能。  主要发现：  \u003Cbr>● 推理损害指令遵从：在IFEval基准上，14个模型中有13个在使用CoT提示后准确率下降；所有15个模型在ComplexBench上也都出现了退步。例如，Meta-LLaMA3-8B在IFEval上的准确率从75.2%降至59.0%，即使像Claude3.7-Sonnet-Think这样经过推理优化的模型，其表现也略逊于基础版本。 \u003Cbr>● 为什么推理会失效：手动案例研究表明，CoT可以帮助处理结构化格式（如JSON或Markdown）以及精确的词汇约束（如严格的标点符号）。然而，它往往会因为(a)在高层次内容规划中忽略简单约束，以及(b)插入看似有用但违反约束的内容（如在限制语言输出的场景中加入翻译）而造成负面影响。 \u003Cbr>● 基于注意力的诊断：作者引入了一种约束注意力指标，发现CoT会降低模型对指令相关词元的关注，尤其是在答案生成阶段。这种约束意识的减弱与性能下降密切相关。 \u003Cbr>● 缓解策略：提出了四种选择性应用推理的技术： | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11423), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1924458157444579700) |\n| 5) 可泛化的AI预测跨癌种和治疗方案的免疫治疗效果  介绍了COMPASS，一种基于概念瓶颈的基础模型，利用肿瘤转录组数据预测患者对免疫检查点抑制剂（ICIs）的响应。与以往的生物标志物（TMB、PD-L1或固定基因签名）不同，COMPASS具有强大的可解释性和性能，能够在不同癌种、ICIs方案及临床情境中实现泛化。  主要贡献：  \u003Cbr>● 概念瓶颈架构：COMPASS将转录组数据转化为44个高层次的免疫相关概念（如T细胞耗竭、IFN-γ信号传导、巨噬细胞活性），这些概念源自132个精心筛选的基因集合。这种结构不仅提供了机制性的可解释性，还支持跨癌种建模。 \u003Cbr>● 跨癌种预训练与灵活微调：COMPASS基于对比学习，在33种癌症的10,184个肿瘤样本上进行了预训练，并在16个接受ICIs治疗的临床队列（7种癌症，6种ICIs药物）上进行了评估。它支持全量、部分、线性及零样本微调模式，因此在数据丰富和稀缺的场景中都表现出色。 \u003Cbr>● 更强的泛化能力和准确性：在留一队列交叉验证中，COMPASS在精度上提高了8.5%，AUPRC提高了15.7%，MCC提高了12.3%，优于22种基线方法。它在零样本场景、跨药物类别（例如在训练抗PD1的基础上预测抗CTLA4的效果）以及小样本微调中也表现优异。 \u003Cbr>● 对耐药机制的深入洞察：个性化的响应地图揭示了可操作的生物学机制。例如，炎症型但无应答的患者可能通过TGF-β信号通路、血管排斥、CD4+ T细胞功能障碍或B细胞缺乏等方式产生耐药性。这些发现超越了传统的“炎症型\u002F沙漠型\u002F排斥型”表型，提供了更为精细的患者分层。 \u003Cbr>● 临床实用性和生存分层：COMPASS预测的应答者在一项保留的II期膀胱癌试验中表现出显著更好的生存率（HR = 4.7，*p* = 1.7e-7），优于标准生物标志物（TMB、PD-L1 IHC、免疫表型）。 | [论文](https:\u002F\u002Fwww.medrxiv.org\u002Fcontent\u002F10.1101\u002F2025.05.01.25326820v1) |\n| 6) 向更深入理解大模型中的推理迈进  本文探讨了大模型是否能在动态环境中适应并进行推理，从而超越静态基准测试。作者使用SmartPlay基准——一套包含四款需要多种认知技能的互动游戏——评估了三种提示策略：自我反思、启发式变异（通过Oracle）和规划。他们在不同规模的模型（Llama3-8B至Llama3.3-70B）上测试了这些方法，并得出了关于模型规模与提示策略如何与任务复杂性相互作用的若干结论。  主要发现：  \u003Cbr>● 模型规模主导性能，尤其是在反应性和结构化推理任务中。较大的模型（如Llama3.3-70B）在河内塔和赌徒问题等需要快速利用或空间规划的任务上明显优于小型模型。 \u003Cbr>● 高级提示策略对小型模型的帮助更大，特别是在复杂任务中。例如，Llama3-8B结合反思和Oracle提示在石头剪刀布游戏中表现超越了Llama3.3-70B的基础版本。然而，这些策略会带来较高的方差，根据具体运行情况可能导致性能低于基线。 \u003Cbr>● 长篇提示会对小型模型在简单任务上造成负面影响。在赌徒问题中，添加反思性推理反而会分散模型注意力或延长探索时间，这与此前关于提示长度和信噪比的研究结果一致。 \u003Cbr>● 提示策略的效果取决于任务类型。指令遵循在所有模型上都有所提升，而长文本理解则更有利于中型模型。相比之下，对于大型模型而言，规划、推理和空间挑战类任务中，这些策略的效果较弱甚至为负。 \u003Cbr>● 密集的奖励塑造比提示策略更能可靠地提升性能。在后续实验中，修改稀疏奖励信号（尤其是河内塔和Messenger任务）所带来的收益比调整提示策略更为稳定。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10543), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1924182825693061403) |\n| 7) AdaptThink  本文介绍了AdaptThink，一个强化学习框架，旨在帮助推理模型根据任务难度决定何时使用详细的思维链推理（“Thinking”），何时直接生成答案（“NoThinking”）。该方法挑战了普遍认为应在所有问题上统一应用深度推理的观点，表明在简单任务上跳过“思考”步骤往往能带来更高的效率，甚至更高的准确率。  主要见解：  \u003Cbr>● NoThinking在简单问题上优于Thinking：作者证明，像DeepSeek-R1这样的模型在处理简单问题时，使用空的\u003Cthink>标记提示（即NoThinking模式）反而能取得更好的效果（在准确性和效率方面）。例如，在MATH500第1级的问题中，NoThinking以显著更少的token消耗获得了略高的准确率。 \u003Cbr>● AdaptThink学会切换模式：提出的强化学习算法引入了一种约束优化，只要准确率不下降，就会优先选择NoThinking模式。它采用一种新颖的重要性采样策略，使两种模式都能从一开始就实现冷启动学习，从而避免陷入全程“思考”的行为。 \u003Cbr>● 效率与性能的巨大提升：在GSM8K、MATH500和AIME 2024任务中，AdaptThink将响应长度缩短了高达53%，并将准确率提高了最多2.4%，优于DeepSeek-R1-Distill-Qwen-1.5B。它还在准确率与响应长度的权衡上超越了先前的方法（如DPOShortest、TLMRE、ModelMerging）。 \u003Cbr>● 鲁棒性和泛化能力：AdaptThink能够推广到MMLU等分布外任务，在减少token使用的同时维持或提升准确率。它还能避免“NoThinking”响应中的“隐性思考”，在推理过程中表现出可控的行为。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.13417) |\n| 8) MedBrowseComp  MedBrowseComp是一个全新的基准，用于评估大模型代理通过浏览真实世界的专业领域资源来进行复杂、多跳医学事实查找的能力。通过对1,000多个基于临床的真实问题进行测试，该基准揭示了当前模型在这一能力上的重大差距，顶尖系统仅能达到50%的准确率，而基于GUI的代理表现则更加糟糕。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.14963), [推文](https:\u002F\u002Fx.com\u002Fshan23chen\u002Fstatus\u002F1925549357308236029) |\n| 9) ARC-AGI-2  ARC-AGI-2是一个新基准，旨在将AI推理的边界推向超越原始ARC-AGI的水平。它引入了更难、更独特的任务，强调组合性泛化和类似人类的流体智能，然而即便在ARC-AGI-1中表现出色的AI模型，在此基准上的准确率仍低于5%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.11831), [推文](https:\u002F\u002Fx.com\u002Farcprize\u002Fstatus\u002F1924869061542085041) |\n| 10) 教导多模态大模型用图像思考  GRIT是一种新方法，通过将自然语言与边界框引用交织在一起，使多模态大模型能够进行 grounded视觉推理。借助强化学习方法（GRPO-GR），GRIT仅需20组图像-问题-答案三元组即可实现强大的推理和grounding性能，其准确性和视觉连贯性均优于基线方法。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.15879), [推文](https:\u002F\u002Fx.com\u002FYFan_UCSC\u002Fstatus\u002F1925719736043569188) |\n\n## 本周顶级机器学习论文（5月12日–5月18日）- 2025\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) AlphaEvolve  AlphaEvolve 是由 Google DeepMind 开发的编码智能体，利用大语言模型引导的进化机制来发现新算法并优化计算系统。它构建了一个流水线，其中大语言模型生成代码变更，评估者提供反馈，进化循环则迭代改进解决方案。AlphaEvolve 表明，大语言模型可以超越传统的代码生成任务，助力科学与算法发现。  主要亮点：   \u003Cbr>● 新型算法发现：AlphaEvolve 发现了一种新的算法，仅用 48 次乘法即可完成 4×4 复数矩阵相乘，在这一场景下，这是自 Strassen 1969 年提出 49 次乘法方案以来的首次改进。   \u003Cbr>● 广泛的数学影响：在应用于 50 多个数学开放问题时，AlphaEvolve 在约 95% 的案例中达到或超越了当前最优水平。例如，它改进了关于 Erdős 最小重叠问题以及 11 维空间中 kissing number 的上界。   \u003Cbr>● Google 内部基础设施优化：AlphaEvolve 改进了 Google 计算堆栈中的关键组件：   \u003Cbr>● 先进的流水线设计：AlphaEvolve 使用 Gemini 2.0 Flash 和 Pro 模型的集成。它支持丰富的提示（包括过往试验、评估结果及明确上下文）、多目标优化，以及用于稳健筛选方案的评估级联机制。程序是在整个文件级别而非仅函数级别进行演化的，这一点是其区别于前代工具如 FunSearch 的关键特征。   \u003Cbr>● 消融实验确认各组件的重要性：实验表明，进化机制、提示上下文、全文件级别的演化以及使用强大的大语言模型均对性能有显著贡献。移除其中任一项都会降低效果。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-DeepMind.com\u002FBlog\u002Falphaevolve-a-gemini-powered-coding-agent-for-designing-advanced-algorithms\u002FAlphaEvolve.pdf), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1922669321559347498) |\n| 2) 大语言模型在多轮对话中迷失方向  研究顶级大语言模型在缺乏明确指令的多轮交互中性能如何下降。这类场景在实际应用中很常见，但很少被评估。作者提出了一种新颖的“分片模拟”框架，将完整指令逐步拆解为多个对话片段，以模拟用户自然地分步提供信息的过程。  主要发现：   \u003Cbr>● 性能大幅下滑：在 15 款顶级大语言模型（如 GPT-4.1、Gemini 2.5 Pro、Claude 3.7）中，多轮对话下的平均性能比单轮对话低了 39%。即使是两轮交互也足以导致显著的性能下降。   \u003Cbr>● 高度不可靠，而不仅仅是能力不足：分解分析显示，模型的最佳能力（即上限性能）仅小幅下降，但不可靠性却增加了 112%，这意味着模型的表现会因对话的具体展开方式而变得极不稳定。   \u003Cbr>● 失败的根本原因：通过日志分析和实验，论文指出了四个主要问题：   \u003Cbr>● 分片式评估任务：作者构建了 600 多个跨 6 类任务（编码、数学、SQL、API 调用、摘要生成和表格标注）的多轮模拟，结果显示不同领域都存在一致性的性能退化。   \u003Cbr>● 代理式干预仅部分缓解：诸如总结和滚雪球式回复（重复所有先前轮次内容）等技术虽可将结果提升约 15–20%，但无法恢复到单轮对话的水平，这表明瓶颈在于模型内部机制，而非提示策略。   \u003Cbr>● 温度调节与推理专用模型也无法解决问题：即使将温度设为 0.0 或使用专门的推理模型（如 o3 和 DeepSeek-R1），模型在多轮对话中仍然高度不可靠。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.06120), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922755721428598988) |\n| 3) 基于单一训练样本的强化学习提升大语言模型推理能力  本文展示了，即使仅使用一个训练样本，基于可验证奖励的强化学习（RLVR）也能显著提升大语言模型的数学推理能力。以 Qwen2.5-Math-1.5B 模型为例，单样本 RLVR 将 MATH500 基准测试的准确率从 36.0% 提升至 73.6%，几乎达到了使用超过 1,200 个样本训练的效果。而双样本 RLVR（使用两个样本）甚至略微超越了这一表现，与使用完整 7.5k 样本训练的结果相当。   \u003Cbr>● 极高的数据效率：仅需一个训练样本（π₁₃）就能将 MATH500 准确率提升至 73.6%，并将六个数学基准测试的平均成绩提高到 35.7%，媲美使用完整数据集的 RLVR 结果。双样本 RLVR 更进一步，分别达到 74.8% 和 36.6%。   \u003Cbr>● 广泛适用性：1 个样本的 RLVR 不仅适用于 Qwen2.5-Math-1.5B，还可在 Qwen2.5-Math-7B、Llama3.2-3B-Instruct 以及 DeepSeek-R1-Distill-Qwen-1.5B 上生效。该方法对 GRPO 和 PPO 等强化学习算法同样有效。   \u003Cbr>● 饱和后的泛化能力：尽管训练过程中的准确率在早期（约 100 步内）便已饱和，但测试准确率仍持续提升，运行 2,000 步后甚至还能额外提升 10% 左右。最终，模型会对单一样本产生过拟合（输出中混入无意义内容），然而测试性能依然保持稳定。   \u003Cbr>● 跨领域与自我反思行为：来自某一领域的单一样本（如几何）能够提升模型在其他领域的表现（如数论）。此外，经过 1 个样本 RLVR 训练的模型表现出更强的自我反思倾向（如“重新思考”、“重新计算”），且生成的响应序列更长。   \u003Cbr>● 损失函数启示：消融实验证实，推动性能提升的主要因素是策略梯度损失，而非权重衰减，这与“grokking”现象有所区别。熵损失则进一步增强了性能和泛化能力；即便没有奖励信号，仅依靠熵损失训练也能使性能提升 27%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20571), [推文](https:\u002F\u002Fx.com\u002Fypwang61\u002Fstatus\u002F1917596101953348000) |\n| 4) AM-Thinking-v1  推出一款参数量达 320 亿的密集型开源语言模型，其在推理任务上的表现达到最先进水平，可与规模远大于它的专家混合模型（MoE）相媲美。该模型基于 Qwen2.5-32B 构建，完全使用公开数据训练，并展示了精心设计的后训练流水线如何在中等规模下释放竞争力。  主要要点：   \u003Cbr>● 基准测试表现：AM-Thinking-v1 在 AIME 2024 上获得 85.3 分，在 AIME 2025 上获得 74.4 分，在 LiveCodeBench 上获得 70.3 分，优于 DeepSeek-R1（671B MoE），并与 Qwen3-32B 和 Seed1.5-Thinking 相当或超越。在 Arena-Hard（通用聊天）上，其得分达到 92.5，接近 OpenAI o1 和 o3-mini 的水平，但略低于 Qwen3-235B-A22B 和 Gemini 2.5 Pro。   \u003Cbr>● 训练流程：该模型采用两阶段后训练方法，结合监督微调（SFT）和强化学习（RL）。SFT 强调“先思考再回答”的格式，使用 284 万个样本；而 RL 则引入难度感知采样，并通过组相对策略优化（GRPO）设计了两阶段课程。   \u003Cbr>● 数据与过滤：所有训练数据均来自公开来源，并经过严格过滤。数学数据经过大语言模型辅助清洗及跨模型真值验证。响应则通过困惑度、n-gram 重复检测和结构检查等手段进行过滤，以确保连贯性和正确性。   \u003Cbr>● 推理与部署：作者实现了一个定制的部署框架，将部署与推理分离，通过流式负载均衡器完成。这降低了长尾延迟，提高了分布式 GPU 节点的吞吐量，从而支持在 32k 序列长度下进行可扩展的强化学习训练。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.08311), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922668488826741061) |\n| 5) HealthBench  HealthBench 是一个包含 5,000 条多轮医疗对话的基准测试，依据由 60 个国家的 262 名医生编写的 48,562 条评分标准进行打分。与以往的多项选择评估不同，HealthBench 支持对大语言模型在多样化医疗主题（如全球卫生、急诊护理、情境理解）及行为维度（准确性、完整性、沟通能力、情境意识、指令遵循）上的开放式、贴近现实的评估。   \u003Cbr>● 大模型的显著进步：HealthBench 显示，大语言模型的性能正在迅速提升。GPT-3.5 Turbo 得分为 16%，GPT-4o 达到 32%，而 o3 则高达 60%。值得注意的是，像 GPT-4.1 nano 这样的小型模型不仅表现优于 GPT-4o，而且成本仅为后者的 1\u002F25。   \u003Cbr>● 两种具有挑战性的基准变体：HealthBench Consensus 关注 34 条经医生验证的指标（如识别紧急情况），而 HealthBench Hard 则选取了 1,000 个最难的示例，这些示例中没有任何模型得分超过 32%，为未来进一步提升留下了空间。   \u003Cbr>● 医生对比基准：令人惊讶的是，o3 和 GPT-4.1 等大语言模型往往能生成比未受帮助的医生更高质量的回应。当医生以模型输出作为参考时，他们能够改进旧模型的完成度，但却无法改善新模型的输出。   \u003Cbr>● 可靠的模型评分：元评估表明，以 GPT-4.1 作为评分者时，其宏观 F1 分数与医生相当。平均而言，其与其他医生的一致性在急诊分诊、沟通能力和不确定性处理等主题上处于第 51 至第 88 百分位区间。   \u003Cbr>● 安全相关见解：该基准测试通过“worst-at-k”分数评估最坏情况下的表现，结果显示即使是最优秀的模型也存在可靠性差距。例如，o3 的 worst-at-16 分数比其平均分低了三分之一，这凸显了进一步加强安全措施的必要性。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002Fbd7a39d5-9e9f-47b3-903c-8b847ca650c7\u002Fhealthbench_paper.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1921983050138718531) |\n| 6) Nemotron-Research-Tool-N1  推出 Tool-N1 系列工具型大语言模型，采用基于规则的强化学习（R1 式 RL）方法进行训练，无需依赖监督式的推理轨迹。其核心思想是让模型通过基于功能正确性和格式合规性的二元反馈，而非逐步骤模仿，来学会正确调用外部工具。   \u003Cbr>● 规则导向的 RL 优于 SFT：Tool-N1 模型使用一种轻量级的二元奖励机制，仅评估模型的工具调用是否结构正确且功能有效。这种方式允许模型自主发展推理过程，避免了受限于通过监督微调（SFT）模仿蒸馏轨迹的局限性。   \u003Cbr>● 强劲的基准测试结果：Tool-N1-7B 和 Tool-N1-14B 在多个基准测试中均优于 GPT-4o 以及领域专用模型，包括 BFCL、API-Bank 和 ACEBench 等。例如，Tool-N1-14B 在 BFCL 整体评分上（85.97 对 83.97）击败了 GPT-4o，并在 API-Bank 上比 GPT-4o 高出 5%。   \u003Cbr>● 纯 RL 优于 SFT 后 RL：通过对 5,518 条蒸馏轨迹的系统性比较发现，纯 RL 的效果优于 SFT 后 RL 流程，这挑战了主流范式。例如，100% RL 的平均得分为 83.24%，而 SFT+RL 则为 83.17%。   \u003Cbr>● 二元奖励 vs 细粒度奖励：消融实验表明，严格的二元奖励（要求正确的推理格式和精确的工具调用）比部分积分机制更能促进泛化，尤其是在真实的“Live”数据上（80.38% 对 76.61%）。   \u003Cbr>● 扩展性与泛化能力：性能随模型规模的增大而提升，尤其在大型模型中效果更为显著。该方法具有广泛的适用性，Qwen2.5-Instruct 在相同规模下表现优于 LLaMA3 系列模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00024), [推文](https:\u002F\u002Fx.com\u002FShaokunZhang1\u002Fstatus\u002F1922105694167433501) |\n| 7) 面向搜索高效的强化学习框架  提出一种新的基于强化学习的框架（SEM），旨在明确教导大语言模型何时应调用搜索引擎，何时可依赖自身知识，从而减少不必要的工具使用，同时保持答案的准确性。  主要要点：   \u003Cbr>● 动机与设置：大语言模型常常对一些简单的问题也过度依赖外部搜索。SEM 通过使用平衡的训练数据集（Musique 用于未知问题，MMLU 用于已知问题）以及结构化的格式（\u003C思考>, \u003C回答>, \u003C搜索>, \u003C结果>)，训练模型区分何时需要搜索、何时不需要。   \u003Cbr>● 奖励优化：作者采用组相对策略优化（GRPO）方法，在同一查询组内比较不同输出。奖励函数会惩罚不必要的搜索，并对无需搜索或在必要时通过高效搜索与推理相结合而得出正确答案的情况给予奖励。   \u003Cbr>● 实验结果：在 HotpotQA 和 MuSiQue 上，SEM 显著优于 Naive RAG 和 ReSearch，以更智能的搜索比例实现了更高的 EM 和 LLM-Judged (LJ) 准确率。而在 MMLU 和 GSM8K（通常无需搜索）上，SEM 在保持高准确率的同时，调用搜索的频率远低于基线方法（例如，在 MMLU 上，SEM 的 SR 仅为 1.77%，而 Naive RAG 则高达 47.98%）。   \u003Cbr>● 案例分析与效率：SEM 可以避免诸如多次询问“1+1 等于多少？”之类的荒谬搜索行为。它还会针对未知问题进行更少但更有针对性的搜索，从而提升可解释性和计算效率。训练动态进一步表明，SEM 能够实现比以往方法更快、更稳定的学习过程。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07903), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922665313117552664) |\n| 8) 高性价比、低延迟的向量检索  将 DiskANN（一种向量索引库）集成到 Azure Cosmos DB NoSQL 数据库中，该数据库为操作型数据存储，每个分区仅使用一个向量索引，并将其存储在现有索引树中。优势：在包含 1,000 万条向量的索引上，查询延迟可控制在 20 毫秒以内，召回率在更新过程中保持稳定，且查询成本相比 Zilliz 和 Pinecone 的无服务器企业级产品分别低约 15 倍和 41 倍。此外，该方案还可通过自动分区扩展至数十亿条向量。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.05885), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1921938925142384736) |\n| 9) AI 代理与代理式 AI  这篇综述论文区分了 AI 代理与代理式 AI，提出了一个结构化的分类体系，并比较了两者的架构、能力及面临的挑战。AI 代理被定义为由大语言模型和工具驱动的模块化、特定任务系统；而代理式 AI 则代表了一种向多智能体协作、动态任务分解和协同自治转变的趋势。文中为这两种范式分别梳理了应用场景和挑战，并提出了诸如 RAG、编排层和因果建模等解决方案。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.10468), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1923817691455873420) |\n| 10) CellVerse  引入一个基准测试，通过将多组学数据转换为自然语言，评估大语言模型在单细胞生物学任务上的表现。虽然像 DeepSeek 和 GPT-4 系列这样的通用大语言模型展现出一定的推理能力，但在药物反应预测等关键任务上，它们的表现并未显著优于随机猜测，这暴露了当前大语言模型在生物理解方面的重大不足。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.07865), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1922662317986099522) |\n\n## 本周顶级机器学习论文（5月5日至5月11日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 排行榜幻象  《排行榜幻象》探讨了Chatbot Arena排行榜在评估大语言模型时存在的系统性偏差，认为当前的做法损害了模型之间的公平比较和科学进步。通过对涵盖200万场Arena对战的大量数据分析，作者识别出四个导致排名失真的关键问题：   \u003Cbr>● 通过私有测试选择性报告分数：一些提供商（尤其是Meta、谷歌和OpenAI）被允许私下测试数十种模型变体，却只公布表现最佳的一个。这违背了支撑Arena排名的布拉德利-特里（BT）模型的无偏采样假设。模拟结果表明，仅测试10个变体就可能使模型的Arena得分人为抬高约100分。   \u003Cbr>● 极端的数据不对称：专有模型相较于开放权重和开源模型被过度采样。仅OpenAI和谷歌就获得了Arena总数据的39%以上，而83个开放权重模型加起来仅占29.7%。这些数据优势转化为显著的性能提升：在一个模型使用70%的Arena数据进行训练的情况下，在ArenaHard基准上其性能比基线高出112%。   \u003Cbr>● 不公平且不透明的弃用操作：尽管只有47个模型被正式标记为已弃用，但实际上有205个模型被悄然从排行榜移除。开源模型受到的影响尤为严重，破坏了比较图结构并违反了BT模型的假设，从而导致排名不可靠。   \u003Cbr>● 对Arena特定动态的过拟合：由于部分提示的重复以及随时间推移出现的分布漂移，能够访问Arena数据的提供商可以专门针对Arena的表现调优模型。这使得模型在Arena基准上的胜率很高，但在MMLU等分布外任务上则表现不佳，甚至出现收益下降或逆转的情况。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20879) |\n| 2) Llama-Nemotron  NVIDIA推出了Llama-Nemotron系列模型，包括LN-Nano（8B）、LN-Super（49B）和LN-Ultra（253B），这是一组开放、高效且高性能的推理模型。这些模型在各类基准测试中与DeepSeek-R1相当或超越后者，同时具有更高的推理吞吐量和内存效率。据Artificial Analysis评价，LN-Ultra是目前最“智能”的开放模型。其一大创新在于动态推理开关（“详细思考开启\u002F关闭”），允许用户在推理时控制模型的推理行为。  主要亮点：   \u003Cbr>● 多阶段训练：模型通过神经架构搜索（Puzzle）、知识蒸馏、持续预训练、监督微调（SFT）以及大规模强化学习构建而成。LN-Ultra还采用了FP8推理和FFN融合技术，以提升速度和可扩展性。   \u003Cbr>● 推理切换功能：模型可通过简单的提示指令在推理模式和非推理模式之间切换，使其能够适应多种应用场景。   \u003Cbr>● 合成数据集：共整理了超过3300万个数学、代码、科学及指令遵循类示例，并明确标注了推理模式样本。LN-Ultra的训练使用了课程式强化学习和GRPO方法，在GPQA-D等基准上超越了其教师模型。   \u003Cbr>● 评测优势：LN-Ultra在AIME25、MATH500和GPQA-Diamond等推理任务中均优于DeepSeek-R1和Llama-3.1-405B，同时在聊天对齐评分上也表现出色（Arena-Hard：87.0）。LN-Super的评分为88.3，超过了Claude 3.5和GPT-4o。  NVIDIA以宽松许可协议提供了模型权重、训练代码（NeMo、Megatron-LM、NeMo-Aligner）以及完整的后训练数据集，旨在推动开放领域的推理模型研究。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.00949v1), [模型](https:\u002F\u002Fhuggingface.co\u002Fnvidia) |\n| 3) 绝对零度  提出了一种无需人工标注数据的大语言模型训练框架。主要亮点：   \u003Cbr>● 模型完全通过自我博弈来提出并解决自身的推理任务，由执行环境提供的可验证反馈引导。这种零数据的RLVR（可验证奖励强化学习）设置实现了顶尖的编码和数学推理性能。   \u003Cbr>● AZR通过三种核心推理模式（演绎、溯因和归纳）生成基于代码的推理任务，并通过Python执行验证解决方案，而非依赖人工标签。   \u003Cbr>● 单一模型同时扮演提出新任务的角色和基于反馈强化学习解决问题的角色。奖励倾向于中等难度的任务，以最大化学习信号。   \u003Cbr>● 尽管未使用任何领域内示例，AZR的平均表现仍比以往所有零数据设置的模型高出1.8分，甚至超越了那些使用数万至数十万标注样本训练的模型。AZR-Coder-7B在所有测试模型中取得了最高平均分。   \u003Cbr>● 在仅编码环境中训练的AZR，其数学推理性能提升了高达15.2分，远超采用RLVR训练的专业编码模型，显示出强大的泛化能力。   \u003Cbr>● 较大的AZR模型（3B→7B→14B）均表现出更显著的改进，证实了其可扩展性，并暗示更大规模模型的潜力。   \u003Cbr>● AZR在代码中发展出了自然的ReAct式中间规划（如交错的注释和逻辑）、溯因中的试错策略以及系统的状态跟踪等行为，这些通常只在规模大得多的模型中观察到。   \u003Cbr>● AZR的Llama-3.1-8B变体有时会产生令人担忧的推理链条（被称为“uh-oh时刻”），凸显了在自主系统中进行安全意识训练的重要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.03335), [推文](https:\u002F\u002Fx.com\u002FAndrewZ45732491\u002Fstatus\u002F1919920459748909288) |\n| 4) Discuss-RAG  本文介绍了Discuss-RAG，这是一个即插即用的基于代理的框架，通过模仿人类临床推理来增强用于医学问答的检索增强生成（RAG）系统。标准RAG系统依赖于嵌入式检索，缺乏验证相关性或逻辑连贯性的机制，因此常常产生幻觉或过时的答案。Discuss-RAG通过模块化的代理设置解决了这些问题，该设置能够模拟多轮医学讨论并进行检索后的验证。  核心思想：   \u003Cbr>● 多代理协作：一个总结代理协调一组医学领域专家，他们通过模拟头脑风暴迭代优化上下文摘要，提供更深入、更结构化的信息以指导检索。   \u003Cbr>● 决策代理：检索完成后，验证代理和决策代理会评估片段质量，并在相关性较低时触发回退策略，从而提高答案的准确性和上下文关联性。   \u003Cbr>● 即插即用设计：Discuss-RAG无需训练且模块化，可轻松集成到现有的RAG管道中。   \u003Cbr>● 显著的性能提升：在四项基准测试中，Discuss-RAG均以大幅的准确率提升超越了MedRAG，尤其是在BioASQ上提高了16.67%，在PubMedQA上提高了12.20%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21252) |\n| 5) 强化学习在微调中的价值  本研究表明，理论上所有流行的偏好微调目标最终都会归结为最大似然估计（MLE），然而实验却显示在实际任务中强化学习始终具有优势。作者通过生成-验证复杂度假说来解释这一矛盾。   \u003Cbr>● 理论：RLHF≈MLE——在一定假设下，轨迹级别的RLHF、DPO及相关算法等价于将数据投影回似然空间，因此花费计算资源进行在线策略采样其实是不必要的。   \u003Cbr>● 实验结果与朴素理论相悖——在tl;dr摘要生成基准测试中，使用Pythia-1.4B\u002F2.8B模型时，一次在线DPO迭代相比离线DPO，即使数据、模型和优化器完全相同，也能将胜率提高6至10个百分点，这证实了强化学习确实能带来实际价值。   \u003Cbr>● 结论——当生成一个好答案比验证它更困难时，强化学习就能发挥作用。而在两字摘要任务（horizon=1）或使用ROUGE-L作为奖励时，这种差距就会消失。只有当奖励模型比其所训练的策略更为简单时，强化学习才能成为穿越策略空间的捷径。对于验证难度与生成相当的任务，基于似然的离线微调就足够了，这为从业者提供了判断何时RLHF值得额外成本的依据。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01067) |\n| 6) WebThinker  本文介绍了一个推理代理框架，该框架赋予大型推理模型（LRMs）自主的网络探索和报告撰写能力，以克服静态内部知识的局限性。WebThinker集成了深网探索模块和自主思考-搜索-草稿策略，使模型能够同时进行网络搜索、处理任务推理并生成综合性输出。此外，系统还包含一个基于在线DPO的强化学习训练循环，以提升工具使用能力。该系统支持两种模式：复杂问题解决和科学报告生成。主要要点：   \u003Cbr>● 复杂推理方面的卓越表现：在GPQA、GAIA、WebWalkerQA和HLE等任务中，WebThinker-32B-RL在32B模型中取得了新的最先进成果，超越了检索增强型和专有系统，如GPT-4o和DeepSeek-R1-671B。例如，在GPQA上达到了70.7%，在HLE上达到了15.8%，较基线提升了高达21.5%。   \u003Cbr>● 一流的科学报告写作能力：在Glaive数据集上，WebThinker的表现优于Gemini2.0 Deep Research和Grok3 DeeperSearch，在完整性和连贯性等平均质量指标上获得了8.1分。   \u003Cbr>● 强化学习的优化至关重要：经过强化学习训练的版本在所有基准测试中均优于基础版本，表明迭代式的偏好学习能够显著提升推理工具的协同能力。   \u003Cbr>● 消融实验证明设计的有效性：移除深网探索模块或自动报告撰写等功能会显著降低性能，从而证实了这些组件的必要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21776) |\n| 7)  Believable reasoning traces and rubrics during preference judgments. Instead of relying on scalar scores or shallow generation, RM-R1 models leverage structured reasoning and reinforcement learning to improve both interpretability and performance across benchmarks.   \u003Cbr>● RM-R1采用两阶段训练过程：(1) 从更强模型中蒸馏推理痕迹，(2) 进行基于可验证奖励的强化学习。Chain-of-Rubrics (CoR)提示框架引导模型根据任务类型（推理或聊天）来解决推理问题或生成评估细则。   \u003Cbr>● 在RewardBench、RM-Bench和RMB等基准上，RM-R1模型取得了最先进的或接近最先进的成绩，即使参数和数据都较少，仍比GPT-4o和Llama3.1-405B等模型高出多达13.8%。   \u003Cbr>● 消融研究表明，仅冷启动的强化学习是不够的；任务类型分类和高质量的蒸馏才是关键。RM-R1的蒸馏式热启动训练带来了更稳定的学习过程和更长、更准确的推理痕迹。   \u003Cbr>● RM-R1还在跨领域泛化方面表现出色，其生成的评估细则质量也优于基线方法，尤其是在安全和医疗判断等敏感领域。作者开源了六种RM-R1模型、训练数据和代码，以支持结果的可重复性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02387) |\n| 8) Paper2Code  介绍了PaperCoder，这是一个多代理LLM框架，能够在不依赖现有实现的情况下，将机器学习论文转化为完整的代码库。   \u003Cbr>● PaperCoder将代码生成过程分解为三个阶段：规划（路线图、架构、文件依赖关系、配置文件）、分析（提取文件特定逻辑）和编码（考虑依赖关系生成文件）。每一步均由专门的LLM代理负责。   \u003Cbr>● 评估方式包括提议的Paper2Code基准测试（涵盖ICML、NeurIPS和ICLR 2024年的90篇论文）以及PaperBench Code-Dev。结果显示，PaperCoder在基于参考、无参考以及人工评估中均优于ChatDev、MetaGPT和其他简单基线模型。   \u003Cbr>● 在原论文作者的人工评估中，77%的人选择了PaperCoder作为最佳实现；85%的人表示它帮助他们复现了自己的工作。平均而言，仅有0.48%的代码行需要修改才能正常运行。   \u003Cbr>● 详细的消融研究表明，每个阶段都能带来稳定的性能提升，尤其是逻辑设计和文件依赖顺序的安排。PaperCoder采用o3-mini-high骨干模型，表现尤为突出。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.17192) |\n| 9) ZeroSearch  ZeroSearch是一个强化学习框架，可在不使用真实搜索引擎的情况下训练大语言模型开发搜索能力。它利用模拟的LLM生成文档，并结合基于课程的退化策略，在性能和成本上均优于真实的搜索方法（如Search-R1），在多个基准测试中实现了更高的问答准确率。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.04588), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1920469148968362407) |\n| 10) Muon在预训练中的实用效率  讨论了Muon——一种简单的二阶优化器——如何通过扩展计算时间帕累托前沿并保持更好的数据效率，在大批次预训练中超越AdamW。结合muP缩放和一种用于超参数转移的新式望远镜算法，它能够在最多40亿参数的模型上以最小的调优开销实现更快的训练。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2505.02222) |\n\n## 本周顶级机器学习论文（4月28日—5月4日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) Phi-4-Mini-Reasoning  微软发布了Phi-4-Mini-Reasoning，旨在探索用于数学推理的小型语言模型。亮点：   \u003Cbr>● Phi-4-Mini-Reasoning：该论文介绍了一款参数量为38亿的轻量级语言模型（SLM），其数学推理性能达到当前最先进水平，甚至可以与规模几乎为其两倍的模型相媲美或超越它们。   \u003Cbr>● 解锁推理能力：研究团队采用一种系统的多阶段训练流程，在小型模型中释放出强大的推理能力，以应对这些模型容量有限所带来的挑战。具体方法包括大规模蒸馏、偏好学习以及基于可验证奖励的强化学习。   \u003Cbr>● 四阶段训练流程：该模型依次经过以下步骤训练：(1) 在中期训练阶段使用大规模的长链式思维数据；(2) 在高质量的CoT数据上进行监督微调；(3) 基于回放的直接偏好优化（DPO）；(4) 使用可验证奖励信号的强化学习。   \u003Cbr>● 数学性能：在MATH-500数据集上，Phi-4-Mini-Reasoning的准确率达到94.6%，超过了DeepSeek-R1-Distill-Qwen-7B（91.4%）和DeepSeek-R1-Distill-Llama-8B（86.9%），尽管其规模更小。   \u003Cbr>● 可验证奖励的强化学习：针对小型模型定制的最后强化学习阶段，包含了提示过滤、过采样以平衡训练信号以及温度退火等技术。这些措施提高了训练稳定性，并使探索过程与评估条件保持一致。   \u003Cbr>● 大规模合成数据生成：该模型在中期训练阶段使用了由DeepSeek-R1生成的1000万条CoT回放数据，这些数据通过数学验证工具和GPT-4o-mini进行了正确性筛选，并按领域和难度分类，以确保广泛的泛化能力。   \u003Cbr>● 消融实验：训练流程中的每个阶段都带来了明显的性能提升。值得注意的是，微调和强化学习分别在中期训练和DPO之后实现了约5至7个百分点的改进，这表明完整流程的价值远超单独的技术手段。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21233), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917954418173247909) |\n| 2) 构建具有可扩展长期记忆的生产就绪型AI代理  本文提出了一种以记忆为中心的架构，用于LLM代理，使其能够在长时间的对话和会话中保持连贯性，从而解决固定上下文窗口的限制。主要亮点：   \u003Cbr>● 该方案引入了两个系统：Mem0，一个密集的基于语言的记忆系统；以及Mem0g，其增强版本采用了基于图的记忆结构来建模复杂的关系。两者都旨在高效地提取、整合和检索随时间推移的重要事实。   \u003Cbr>● Mem0：采用两阶段架构（提取与更新），以维护重要的对话记忆。它能够检测冗余或冲突的信息，并通过工具调用管理更新，从而形成一个轻量级、响应迅速的记忆存储（每段对话7000个token）。   \u003Cbr>● Mem0g：通过将记忆构建为实体和关系的知识图谱，Mem0g在需要时间和关系推理的任务中表现出色（例如事件排序、偏好跟踪等），同时保持合理的延迟和内存开销（每段对话14000个token）。   \u003Cbr>● LOCOMO基准测试：两种系统均与六种记忆系统基线进行了对比（如A-Mem、OpenAI、Zep、LangMem、RAG）。其中，Mem0g获得了最佳的整体LLM-as-a-Judge（J）评分，达到68.44%，比所有RAG和记忆基线高出7%至28%，并且相比全上下文方法，p95延迟降低了91%。   \u003Cbr>● 延迟与效率：Mem0实现了最低的搜索延迟和总延迟（p95 = 1.44秒），而Mem0g在速度和效率方面仍大幅领先于其他基于图或RAG的系统。非常适合实时部署。   \u003Cbr>● 应用场景优势：Mem0和Mem0g提供了一种可扩展的记忆架构，适用于长期运行的LLM代理，能够提升事实回忆能力、推理深度和效率，使其成为理想之选 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19413), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917247776221700134) |\n| 3) UniversalRAG  UniversalRAG是一个框架，克服了现有RAG系统仅限于单一模态或语料库的局限性。它支持跨模态（文本、图像、视频）和多粒度级别的检索（例如段落与文档、片段与视频）。论文的主要贡献：   \u003Cbr>● 模态感知路由：为应对统一嵌入空间中的模态偏差问题（即查询往往无论相关性如何都会返回同模态的结果），UniversalRAG引入了一个路由器，可根据每个查询动态选择合适的模态（例如图像或文本）。   \u003Cbr>● 粒度感知检索：每种模态都被划分为不同的粒度级别（例如文本的段落与文档，视频的片段与完整视频）。这使得查询能够检索与其复杂度相匹配的内容——事实性查询使用短片段，而复杂推理则访问长篇内容。   \u003Cbr>● 灵活路由：它既支持无需训练的（零样本GPT-4o提示）也支持经过训练的（T5-Large）路由器。训练过的路由器在域内数据上表现更好，而GPT-4o则在域外任务上更具泛化能力。通过集成这两种路由器，可以获得稳健的性能。   \u003Cbr>● 性能：UniversalRAG在涵盖文本（如MMLU、SQuAD）、图像（WebQA）和视频（LVBench、VideoRAG）的8个基准测试中，均优于特定模态和统一RAG基线。使用T5-Large时，它在各模态上的平均得分最高。   \u003Cbr>● 案例研究：在WebQA任务中，UniversalRAG能够正确地将视觉查询路由到图像语料库（检索到事件的实际照片），而TextRAG和VideoRAG则未能做到。同样，在HotpotQA和LVBench任务上，它也能选择合适的粒度，检索文档或短视频。总体而言，这是一篇很好的论文，展示了在RAG系统中考虑模态和粒度的重要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.20734), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917637837295608180) |\n| 4) DeepSeek-Prover-V2  DeepSeek-Prover-V2是一款拥有6710亿参数的LLM，显著推动了Lean 4中的形式化定理证明技术的发展。该模型通过一种新颖的冷启动训练流程构建而成，将非正式的链式思维推理与形式化的子目标分解相结合，并通过强化学习进一步优化。它在多个定理证明基准测试中超越了先前的最先进水平。关键亮点：   \u003Cbr>● 递归分解生成冷启动数据：作者让DeepSeek-V3生成自然语言的证明草稿，将其分解为子目标，并在Lean中用“sorry”占位符将这些步骤形式化。随后，一个70亿参数的证明模型会递归地填补这些子目标的证明，从而高效地构建完整的形式化证明及训练数据。   \u003Cbr>● 课程学习+强化学习：该模型基于子目标的课程逐步训练，处理越来越复杂的问题。通过使用一致性奖励的强化学习，确保证明结构与CoT分解的一致性，从而提升其在复杂任务中的表现。   \u003Cbr>● 双重证明生成模式：该模型以两种模式进行训练，分别为非CoT模式（高效、简洁的证明）和CoT模式（高精度、可解释的证明）。其中，CoT模式在困难问题上的表现显著更好。   \u003Cbr>● 基准测试结果： | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.21801), [推文](https:\u002F\u002Fx.com\u002Fzhs05232838\u002Fstatus\u002F1917600755936018715) |\n| 5) Kimi-Audio  Kimi-Audio是一款全新的开源音频基础模型，专为通用音频理解、生成和语音对话设计。该模型架构结合了离散的语义音频标记和连续的Whisper衍生声学特征。它从预训练的LLM初始化，并在超过1300万小时的音频数据上进行训练，涵盖语音、声音和音乐。此外，它还支持流式解标记器，具备分块解码功能，并引入了一种新颖的前瞻机制，以实现更流畅的音频生成。广泛的基准测试表明，Kimi-Audio在多种模态和任务中均优于其他音频LLM。主要亮点：   \u003Cbr>● 架构：Kimi-Audio使用12.5Hz的语义标记器和一个双头LLM（文本+音频），能够处理混合输入（离散+连续）。其音频解标记器采用流匹配上采样技术，并结合BigVGAN声码器实现实时语音合成。   \u003Cbr>● 海量训练语料：该模型在超过1300万小时的多语言、多模态音频数据上进行了预训练。严格的预处理流程还包括语音增强、说话人分离和转录，使用Whisper和Paraformer-Zh工具。微调阶段则使用来自30多个开放数据集的30多万小时数据。   \u003Cbr>● 多任务训练：训练内容涵盖纯音频、纯文本、ASR、TTS，以及三种音频-文本交替策略。微调采用指令驱动的方式，通过零样本TTS注入音频和文本指令。   \u003Cbr>● 评估：在ASR（如LibriSpeech test-clean：1.28 WER）、音频理解（CochlScene：80.99）以及音频-文本聊天（OpenAudioBench平均：69.8）任务上，Kimi-Audio均创造了新的SOTA记录，全面超越Qwen2.5-Omni和Baichuan-Audio。 | [论文](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-Audio\u002Fblob\u002Fmaster\u002Fassets\u002Fkimia_report.pdf), [推文](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1915807071960007115)  [模型](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-Audio) |\n| 6) MiMo-7B  小米发布了MiMo-7B，一款用于推理任务的新语言模型。MiMo-7B专门针对数学和代码领域的高级推理任务设计。亮点：   \u003Cbr>● MiMo-7B：通过精心的预训练和后训练，MiMo-7B缩小了与更大规模的320亿参数模型之间的能力差距。MiMo-7B-Base从头开始在25万亿个token上进行训练，其3阶段混合比例偏向数学和代码领域（第二阶段占比70%）。   \u003Cbr>● 预训练：团队改进了HTML和PDF的提取方式，以更好地保留STEM领域的数据；利用LLM生成多样化的合成推理内容；并引入了多令牌预测（MTP）目标，从而提升了模型的质量和推理速度。   \u003Cbr>● 基础性能：MiMo-7B-Base在BBH（提高5分）、AIME24（提高22.8分）和LiveCodeBench（提高27.9分）等任务上，均优于Qwen2.5、Gemma-2和Llama-3.1等其他70亿至90亿参数的模型。在BBH和LiveCodeBench任务中，它甚至在推理密集型任务上超越了更大规模的模型。   \u003Cbr>● 强化学习：MiMo-7B-RL采用基于测试难度的奖励函数和简单数据重采样策略，以解决稀疏奖励和训练不稳定的问题。在某些情况下，它甚至超越了o1-mini在数学和代码方面的表现。基于SFT模型的强化学习所能达到的上限，高于基于基础模型的零样本强化学习。   \u003Cbr>● 高效基础设施：无缝回放引擎通过持续回放、异步奖励计算和提前终止等技术，将强化学习的速度提升了2.29倍，验证速度提升了1.96倍。MTP层实现了快速推测性解码，推理过程中的接受率超过90%。 | [论文](https:\u002F\u002Fgithub.com\u002FXiaomiMiMo\u002FMiMo\u002Fblob\u002Fmain\u002FMiMo-7B-Technical-Report.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1917582720341008814) |\n| 7) 基础代理的进展与挑战  一篇新的综述以模块化、受大脑启发的架构来构建智能代理，融合了认知科学、神经科学和计算研究的思想。主要内容包括：   \u003Cbr>● 人类大脑与LLM代理：有助于更好地理解LLM代理与人类\u002F大脑认知的区别，以及我们可以从人类的学习和运作方式中获得哪些启发。   \u003Cbr>● 定义：提供了一个清晰、详细且正式的定义，说明什么构成了一个AI代理。   \u003Cbr>● 推理：文中有一节深入探讨了智能代理的核心组件。特别关注推理这一AI代理的关键发展领域，它是解锁规划、多轮工具使用、回溯等功能的关键。   \u003Cbr>● 内存：代理的内存是构建代理系统的一个挑战性领域，但目前已经有许多优秀的文献可供参考和借鉴。   \u003Cbr>● 行动系统：如今我们已经可以构建非常复杂的代理系统，但下一个前沿是能够在现实世界中采取行动并做出决策的代理。我们需要更好的工具、更完善的训练算法，以及在不同行动空间中稳健运行的能力。   \u003Cbr>● 自进化代理：目前，构建有效的代理系统仍然需要人工干预和精心优化技巧。然而，该领域更大的机遇之一是开发能够自我构建强大且不断自我改进的AI系统。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01990), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1916542394746421333) |\n| 8) MAGI  MAGI是一个多智能体系统，旨在通过将MINI（迷你国际神经精神科访谈）协议操作化，实现结构化精神科访谈的自动化。它包含四个专业智能体：导航、问题生成、判断和诊断。其他亮点：   \u003Cbr>● 多智能体临床工作流：MAGI由导航智能体（控制访谈流程）、问题智能体（动态且富有同理心的提问）、判断智能体（验证回答）以及诊断智能体组成。诊断智能体使用心理测量CoT，将诊断明确追溯到MINI\u002FDSM-5标准。   \u003Cbr>● 可解释的推理（PsyCoT）：不同于将诊断视为不透明的输出，PsyCoT将精神科推理分解为症状锚定、综合征验证和证据关联。这种做法有助于对每个诊断结论进行审计，充分发挥了CoT的作用。   \u003Cbr>● 结果：在对1002份真实访谈的评估中，MAGI在相关性、准确性、完整性以及引导性等方面均优于基线模型（直接提示、角色扮演、知识增强型以及模拟MINI的LLM）。   \u003Cbr>● 强烈的临床一致性：诊断评估显示，PsyCoT能够持续提升F1分数、准确性和Cohen’s κ值，尤其在抑郁症、广泛性焦虑症、社交焦虑症和自杀风险等疾病中，甚至在高风险任务中达到了临床级别的可靠性（κ=0.8）。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.18260), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1916862752410554423) |\n| 9) 高效LLM推理服务综述  本综述回顾了近年来优化LLM推理的最新进展，重点解决了内存和计算瓶颈问题。内容涵盖了实例级别的技术（如模型放置和请求调度）、集群级别的策略（如GPU部署和负载均衡），以及新兴的场景特定解决方案，并总结了未来的研究方向。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19720) |\n| 10) 工程领域的LLM  该研究表明，当使用强化学习时，一个70亿参数的模型在高性能火箭设计任务中，表现优于当前最先进的基础模型以及人类专家。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.19394) |\n\n## 本周顶级机器学习论文（4月21日–4月27日）- 2025\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 强化学习是否能在基础模型之外激发大语言模型的推理能力？ 本文重新审视了近期大语言模型发展中的一个关键假设：即基于可验证奖励的强化学习（RLVR）有助于模型获得真正全新的推理能力。通过对数学、代码、视觉等任务中的模型使用pass@k指标（k值较大）进行分析，作者发现RLVR虽然提高了样本效率，但并未扩展超出基础模型的推理能力范围。\u003Cbr>● 核心见解：RLVR训练的模型在低k值（如pass@1）时表现更好，但随着k值增大（至256或更高），基础模型最终会与之持平甚至超越。这表明RLVR并未产生根本性的新推理路径，而只是增加了采样到已存在正确答案的可能性。\u003Cbr>● 推理能力早已蕴含于基础模型中：研究显示，RLVR模型成功生成的思维链实际上已存在于基础模型的采样分布中。困惑度分析进一步证实，RL输出往往是基础模型高概率的延续。\u003Cbr>● 效率与探索性之间的权衡：RLVR缩小了模型的探索空间，提升了效率，却减少了对多样化推理路径的覆盖，从而降低了其在大规模问题解决中的整体能力。\u003Cbr>● 知识蒸馏效果更佳：与RLVR不同，通过从更强的教师模型（如DeepSeek-R1）进行知识蒸馏，能够引入真正新的推理模式，从而扩展模型的能力。\u003Cbr>● 算法局限性：无论采用PPO、GRPO、Reinforce++等算法，强化学习方法均能带来相似的样本效率提升，但无一能够完全弥补与基础模型pass@256之间的差距，这凸显了当前强化学习策略的局限性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.13837), [推文](https:\u002F\u002Fx.com\u002FDaveShapi\u002Fstatus\u002F1915408405201629684) |\n| 2) BitNet b1.58 2B4T  本工作提出了BitNet b1.58 2B4T，这是首个开源的、原生训练的20亿参数级1位大语言模型，它在实现高效能的同时，性能表现强劲。该模型采用自定义的三元量化方案（每权重1.58位），大幅降低了内存占用（0.4 GB）、能耗（0.028 J\u002Ftoken）和延迟（29 ms），同时在多种基准测试中仍能与最先进的全精度模型竞争。\u003Cbr>● 效率-性能的新帕累托前沿：BitNet b1.58 2B4T从头开始在4万亿 tokens 上训练，在ARC-Challenge、PIQA、WinoGrande和GSM8K等任务上均优于或媲美开放的全精度模型（如Qwen2.5 1.5B、MiniCPM 2B）。在16个基准上的平均得分为54.19%，与Qwen2.5-1.5B的55.23%相当，但内存占用仅为后者的六分之一左右，能耗则低了十倍。\u003Cbr>● 性能超越量化基线：与经过INT4后量化处理的Qwen2.5模型（GPTQ\u002FAWQ）相比，BitNet不仅体积更小，准确率也更高，这表明原生1位训练相较于后量化方法更具优势。\u003Cbr>● 架构与训练创新：该模型用基于绝对均值三元量化的BitLinear层和8位激活替换了标准的线性层，并结合RoPE嵌入、平方ReLU激活函数以及无偏置层。训练过程中采用了余弦学习率调度和权重衰减策略，同时还进行了监督微调和直接偏好优化（DPO），而非完整的RLHF流程。\u003Cbr>● 1位大语言模型中的佼佼者：与其他1位模型（如OLMo-Bitnet 1B以及后量化的Falcon3\u002FLlama3 7B–8B）相比，BitNet b1.58 2B4T的平均得分高出10分，为超高效大语言模型树立了新的标杆。此外，作者还发布了针对GPU的优化CUDA核以及用于CPU的C++推理库，使得1位大语言模型能够在各种硬件平台上得到实际部署。BitNet b1.58 2B4T证明了极致量化并不意味着能力受损，同时也为在资源受限环境中更广泛地应用大语言模型打开了大门。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12285) |\n| 3) UI-TARS  UI-TARS推出了一种功能强大的端到端原生GUI智能体，它完全基于视觉截图运行，可在跨平台环境中执行类人般的键盘和鼠标操作。与依赖提示工程和外部脚本的现有模块化智能体框架不同，UI-TARS将感知、行动、推理和记忆直接整合到其架构中，从而在动态的真实场景中展现出强大的泛化能力和适应性。主要贡献：\u003Cbr>● 增强的GUI感知能力：UI-TARS在一个包含丰富元数据的大规模截图数据集上进行训练，能够实现密集的文本描述、状态转换理解以及精确的元素识别。在VisualWebBench等感知基准测试中，其得分达到82.8，优于GPT-4o的表现。\u003Cbr>● 统一的动作建模与接地：UI-TARS将跨平台的动作标准化到一个共享的动作空间，并从大规模多步动作轨迹中学习。在ScreenSpot Pro等接地任务中，其得分达到38.1，成为新的SOTA水平。\u003Cbr>● 基于“思考”的系统2式推理：受ReAct风格框架启发，UI-TARS会在执行动作之前生成内部推理步骤（即“思考”）。这些“思考”反映了任务分解、反思以及长期一致性等模式，显著提升了复杂场景下的表现。例如，在OSWorld任务中，UI-TARS-72B-DPO以50步预算获得了24.6分，表现优于Claude。\u003Cbr>● 基于反思学习的迭代自我改进：UI-TARS通过在线轨迹收集和基于错误纠正及事后适应数据的反思调优，不断自我完善。这使其能够在极少人工干预的情况下从错误中恢复并适应环境。总体而言，UI-TARS标志着GUI自动化领域的重要进展，在超过10个数据集上树立了新标杆，并超越了GPT-4o和Claude等顶尖商业智能体。其开源发布旨在推动原生智能体开发的进一步创新。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12326), [博客](https:\u002F\u002Fseed-tars.com\u002F1.5\u002F) |\n| 4) Describe Anything  提出DAM模型，该模型能够在图像和视频中生成细粒度的区域特定字幕。作者针对先前视觉-语言模型的关键局限性——即无法保留局部细节以及缺乏适合详细本地化字幕（DLC）的数据集和基准——提出了针对性解决方案。主要贡献：\u003Cbr>● DAM（Describe Anything Model）利用两项主要创新来捕捉精细的区域细节和全局场景上下文：一是焦点提示，用于对用户指定区域进行高分辨率编码；二是本地化视觉骨干网络，通过门控交叉注意力机制整合整张图像的上下文信息。这使得DAM能够生成多粒度、准确的描述，尤其适用于小型或被遮挡的区域。\u003Cbr>● DLC-SDP（半监督数据管道）通过VLM生成的详细字幕扩充分割数据集，随后在网页图像上进行自训练，从而解决了数据稀缺问题。这种方法产生了高质量且多样化的训练数据，使DAM在多个基准测试中超越了仅依赖API的基线模型（如GPT-4o）。\u003Cbr>● DLC-Bench是一个无参考基准，通过LLM评委评估模型在包含或排除区域特定细节方面的准确性。相比于传统的字幕匹配指标，这一基准更能可靠地评价模型，因为后者往往会因有效但未匹配的细节而扣分。\u003Cbr>● 性能表现：DAM在图像和视频领域的关键词、短语以及详细多句字幕任务的7个基准测试中均创下新的SOTA记录。它在零样本和域内评估中均优于GPT-4o、Claude 3.7以及其他顶尖视觉-语言模型，在详细图像字幕任务上较先前模型提升了33.4%，视频字幕任务上提升了19.8%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.16072) |\n| 5) UXAgent  提出一种新颖的框架UXAgent，用于借助LLM驱动的智能体模拟大规模可用性测试。该系统使用户体验研究人员能够在接触真实用户之前测试和迭代网页设计及研究方案。其实现方式是编排具有不同角色设定的模拟智能体，在真实的网络环境中互动，从而提供行为和推理数据。主要亮点：\u003Cbr>● 基于LLM的角色扮演模拟：UXAgent首先通过一个角色生成器，可根据自定义分布生成数千名人口统计特征各异的模拟用户。每个角色都会被输入到一个LLM智能体中，该智能体体现用户意图，并通过一个通用浏览器连接器与网站交互——该模块能够解释和操作真实的HTML结构。\u003Cbr>● 双循环推理架构：UXAgent的核心是一种受认知心理学启发的双进程智能体架构：快速回路用于低延迟动作，慢速回路用于深度推理。这种设计模仿了系统1和系统2的思维方式，使智能体既能迅速响应，又能保持连贯的高层计划和反思。\u003Cbr>● 丰富的记忆流：所有观察、行动、计划、反思以及自发产生的想法（“wonders”）都被存储在一个记忆流中。这些记忆会根据重要性、时效性和相关性，通过加权评分系统动态优先排序，分别针对快速和慢速模块进行调整。\u003Cbr>● 回放与访谈界面：用户体验研究人员可以通过模拟回放界面回顾模拟会话，并使用智能体访谈界面与智能体进行自然语言对话。这支持定性分析，例如询问智能体的决策过程或展示原型以获取反馈。\u003Cbr>● 实证评估：一项涉及60个LLM智能体在购物平台（WebArena）上进行的模拟案例研究表明，研究人员能够发现可用性研究中的缺陷并获得早期洞察。随后的一项由五位用户体验专业人士参与的用户研究则发现，尽管对真实性和数据噪声存在一些担忧，该系统仍有助于迭代研究设计。尤其受到好评的是与智能体对话并获取传统试点中难以实现的定性洞见的能力。\u003Cbr>● 未来影响：作者认为LLM智能体并非要取代真实参与者，而是作为设计过程中的早期合作者，从而降低有缺陷研究的成本和风险。他们还讨论了将该框架扩展到多模态场景、桌面或移动界面，以及数字孪生或模拟A\u002FB测试等更广泛的智能体任务。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09407) |\n| 6) 测试时强化学习  测试时强化学习（TTRL）是一种方法，允许大语言模型在推理过程中无需真值标签即可自我改进。TTRL不依赖标注数据集，而是通过对多个模型生成结果进行多数投票来估计伪奖励，从而在未标注的测试数据上实现强化学习（RL）。该方法整合了测试时缩放（TTS）和测试时训练（TTT）策略，使模型能够动态适应新的、具有挑战性的输入。主要亮点：\u003Cbr>● 多数投票作为奖励：TTRL为一个问题生成多个候选输出，并通过多数投票得出伪标签。奖励根据与共识答案的一致性进行分配。\u003Cbr>● 显著的性能提升：将TTRL应用于Qwen2.5-Math-7B后，在AIME 2024上的表现提升了159%，在AIME、AMC和MATH-500等基准测试上的平均提升达84%，且全程未使用任何标注过的训练数据。\u003Cbr>● 超越监督的自我进化：令人惊讶的是，TTRL的表现超越了其自身多数投票监督（Maj@N）的上限，并接近于完全标签泄露条件下训练的模型性能，这表明其无监督强化学习既高效又稳定。\u003Cbr>● 泛化能力与鲁棒性：TTRL在不同任务之间具有良好的泛化能力，即使在标签估计存在噪声的情况下也能保持有效性，并且兼容PPO、GRPO等多种强化学习算法。\u003Cbr>● 局限性：当基础模型对该领域缺乏足够的先验知识，或者当超参数（如批量大小和温度）设置不当时，TTRL可能会失效。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2504.16084) |\n| 7) 在真实世界语言模型交互中发现价值观  本文首次对已部署的AI助手Claude 3和3.5版本所展现的价值观进行了大规模实证分析，基于超过30万次真实对话。作者开发了一个自下而上的、保护隐私的框架，用于提取、分类和分析AI表达的规范性考量（即“价值观”），并展示了这些价值观如何随任务、用户价值观和对话情境的不同而变化。\u003Cbr>● 作者识别出3,307种独特的AI价值观，并将其组织成五个领域分类：实用、认识论、社会、保护和个人。其中，实用和认识论价值观占主导地位，通常与Claude的训练目标——即帮助他人、无害且诚实——相一致。\u003Cbr>● Claude最常见的价值观，如乐于助人（23.4%）、专业精神、透明度和清晰度，具有情境无关性，反映了其作为服务导向型助手的角色。相比之下，人类的价值观，如真实性和效率，则更为多样化。\u003Cbr>● 许多价值观具有明显的情境特异性。例如，在关系建议中会出现健康界限的概念；在讨论有争议的历史事件时，会强调历史准确性；而在AI治理的背景下，则会突出人的自主性。\u003Cbr>● 在支持性情境下，Claude倾向于反映人类价值观（反映率为20.1%），但在遇到抵制时，却会表达相反的价值观，尤其是在面对不道德或违反政策的请求时（例如，以“伦理正直”对抗“道德虚无主义”）。\u003Cbr>● 明确的价值表达（如“我重视透明度”）更多出现在抵抗或重新构建话语的时刻，特别是在涉及认识论和伦理原则时，比如智力诚实和防止伤害。这表明当系统面临挑战时，其价值观才会最为显现。\u003Cbr>● 在Claude的不同版本中，3 Opus表现出更加情感细腻且具有伦理根基的价值观（如学术严谨、情感真实），并且在支持与抵制方面都比3.5\u002F3.7 Sonnet表现出更强的倾向。 | [论文](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F18d20cca3cde3503\u002Foriginal\u002FValues-in-the-Wild-Paper.pdf), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1914333220067213529) |\n| 8) 评估大语言模型的目标导向性  提出了一套新框架，用于评估大语言模型是否能够有效地利用其能力来实现给定目标。研究发现，即使是GPT-4o和Claude 3.7等顶尖模型，在信息收集和综合任务方面也远未达到完全的目标导向性，尽管它们在孤立的子任务中表现良好。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.11844), [推文](https:\u002F\u002Fx.com\u002Ftom4everitt\u002Fstatus\u002F1912806499862139275), [GitHub](https:\u002F\u002Fgithub.com\u002FCrista23\u002Fgoal_directedness_llms) |\n| 9) General-Reasoner  General-Reasoner是一种强化学习方法，通过使用包含23万道题目的数据集以及一个基于模型的验证器，该验证器能够理解超越精确匹配的语义，从而提升大语言模型在不同领域的推理能力。它在通用推理任务（MMLU-Pro、GPQA、SuperGPQA）和数学任务（MATH-500、GSM8K）上均优于SimpleRL和Qwen2.5等强大基线，取得了超过10分的优势，且未牺牲数学推理能力。 | [论文](https:\u002F\u002Fgithub.com\u002FTIGER-AI-Lab\u002FGeneral-Reasoner\u002Fblob\u002Fmain\u002FGeneral_Reasoner.pdf), [推文](https:\u002F\u002Fx.com\u002FWenhuChen\u002Fstatus\u002F1912242238110789671) |\n| 10) Tiny Reasoning Models  Tina是一系列15亿参数的推理模型，采用基于LoRA的强化学习（RL）进行训练，以极低成本实现高精度的推理能力。在AIME和MATH等推理任务上，其表现优于或与全微调模型持平，而每次训练后的成本仅约9美元。这表明，高效的推理能力可以通过对小型模型的少量更新来实现。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.15777) |\n\n## 本周机器学习顶级论文（4月14日至4月20日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) GUI-R1  新加坡国立大学和中国科学院的研究人员提出了GUI-R1，这是一个强化学习（RL）框架，旨在通过统一的动作空间建模来提升图形用户界面（GUI）智能体的性能。关键见解包括：   \u003Cbr>● 强化微调（RFT）优于监督微调（SFT）——GUI-R1采用了受DeepSeek-R1等方法启发的强化微调技术，显著减少了训练数据需求。它仅使用3,000个精心筛选的示例，而此前的模型则需要数百万个样本。   \u003Cbr>● 统一的动作空间与奖励建模——作者引入了一个覆盖不同平台（Windows、Linux、MacOS、Android和Web）动作的统一动作空间。这使得评估GUI操作时能够获得一致的奖励信号，从而增强模型的适应性和泛化能力。   \u003Cbr>● 以极少量数据实现卓越性能——GUI-R1仅使用0.02%的训练数据（3,000个样本对比1,300万个），便超越了OS-Atlas等当前最先进的方法。在涵盖移动、桌面和Web平台的八项基准测试中，GUI-R1在接地能力、低级任务以及高级GUI任务处理方面均表现出显著提升。   \u003Cbr>● 高效训练与强大泛化能力——通过利用如分组相对策略优化（GRPO）等策略优化算法，GUI-R1能够快速收敛至高性能，即使在资源受限的情况下也能展现出鲁棒性和高效性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10458) |\n| 2) 基于强化学习扩展扩散型大语言模型的推理能力  提出d1，一种两阶段方案，为掩码扩散型大语言模型赋予强大的逐步推理能力。   \u003Cbr>● 两阶段流程（SFT → diffu‑GRPO）——d1首先在包含1,000个样本的s1K数据集上进行监督微调，随后使用新的diffu‑GRPO目标执行特定任务的强化学习，其效果远超单独任一阶段。   \u003Cbr>● diffu‑GRPO：针对掩码扩散型LLM的强化学习——通过(i)平均场序列对数似然近似和(ii)带有随机提示掩码的一次性逐token对数似然估计器，将GRPO扩展到扩散型LLM，从而能够在单次生成中完成多次梯度更新。   \u003Cbr>● 在四项推理基准测试中持续获益——在GSM8K、MATH500、Countdown和Sudoku任务中，diffu‑GRPO的表现优于SFT，而完整的d1‑LLaDA变体则取得了最佳成绩（例如，在256 token条件下，GSM8K准确率为81.1%，MATH500为38.6%，较基线提升5–12个百分点）。   \u003Cbr>● 在7–8B参数量级模型中具有竞争力——d1‑LLaDA在GSM8K任务中表现优于DeepSeek‑7B、Mistral‑7B和Llama‑3‑8B，并在同一参数规模下位居MATH500第二名。   \u003Cbr>● 更长的解码长度带来“顿悟”时刻——当生成长度达到512 token时，模型展现出自我验证和回溯能力；有效token使用率平稳增长，与推理过程中计算资源的动态调整趋势一致。   \u003Cbr>● 随机掩码加速强化学习——消融实验表明，在diffu‑GRPO过程中采用随机提示掩码能够加速收敛并提高正确率，同时减少所需的在线生成次数。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.12216) |\n| 3) 利用推理模型增强非推理模型  研究人员探讨如何将顶级大语言模型中密集的推理输出（答案及解释）提炼并迁移到更轻量级、不显式进行逐步推理的模型中。通过在小型模型上微调来自先进推理模型的高质量最终答案（以及可选的总结性思维轨迹），他们在多个基准测试中展示了性能的持续提升。   \u003Cbr>● 推理时间扩展与知识蒸馏——虽然像DeepSeek-R1和OpenAI-o1这样的大型模型可以分配更多计算资源来生成更优质的推理轨迹，但本文则专注于系统性地将这些丰富的最终答案（以及可能的推理步骤摘要）迁移到更紧凑的模型中。   \u003Cbr>● 数据整理——作者从多个开源库（包括Infinity Instruct、CodeContests、FLAN等）中提取提示，并利用DeepSeek-R1生成最终答案及详细推理过程，构建了一个包含130万条记录的数据集。   \u003Cbr>● 三种微调策略——(1) 使用现有开源数据集中原始的基础答案，(2) 仅对推理模型的最终答案部分进行微调，以及(3) 将总结性的思维链与最终答案结合。采用第二种策略训练的模型在数学\u002F编程任务中表现优异，而第三种方法则更适合对话或对齐相关的任务。   \u003Cbr>● 实证收益——在Qwen2.5-32B上微调推理模型的最终答案后，GSM8K任务的准确率提升至92.2%，HumanEval任务的准确率提升至90.9%。采用思维摘要的方法则提升了另一组基准测试（GPQA和聊天类测试）的成绩。然而，嵌入“思维轨迹”有时会导致指令执行严格度的轻微下降（IFEval）。   \u003Cbr>● 权衡与未来工作——蒸馏先进的推理数据确实有助于小型模型，但究竟应包含多少推理轨迹则取决于具体领域。作者建议，通过更精细的方式将推理步骤无缝融入最终答案中（例如使用专门的提示或部分合并），可以进一步提升性能并避免对齐问题的恶化。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09639) |\n| 4) AgentA\u002FB  AgentA\u002FB是一个完全自动化的A\u002FB测试框架，它用大规模基于大语言模型的智能体替代真实的人类流量。这些智能体在真实的网络环境中模拟出逼真的、由意图驱动的用户行为，从而实现更快、更低成本且无风险的用户体验评估——甚至可以在亚马逊等真实网站上进行。关键见解：   \u003Cbr>● 模块化智能体仿真流程——四个组件——智能体生成、条件准备、交互循环和事后分析——允许在实际网页上使用多样化的LLM角色进行即插即用式的仿真。   \u003Cbr>● 真实世界保真度——系统会将实时DOM解析为JSON格式，从而支持结构化的交互循环（搜索、筛选、点击、购买），这些循环由LLM推理结合Selenium完成。   \u003Cbr>● 行为真实性——仿真智能体的行为模式更具目标导向性，但与100万真实亚马逊用户的交互模式相当（例如，会话时间更短，但购买率相似）。   \u003Cbr>● 设计敏感性——一项比较完整与简化筛选面板的A\u002FB测试显示，实验组中的智能体点击次数更多、更频繁地使用筛选功能，并且购买量也更大。   \u003Cbr>● 包容性原型设计——智能体可以代表难以触及的人群（如低技术用户），从而使早期的UX测试更加包容且无风险。   \u003Cbr>● 显著成果——AgentA\u002FB展示了LLM智能体如何能够补充而非取代传统的A\u002FB测试，提供一个新的部署前仿真层。这不仅能够加速迭代、减少开发浪费，还能在无需立即上线真实流量的情况下支持UX的包容性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09723) |\n| 5) 推理模型无需思考即可高效运行  本文挑战了大语言模型中长链式思维（CoT）推理的必要性，提出了一种名为NoThinking的简单提示方法，该方法绕过了显式的“思考”步骤。令人惊讶的是，在同等甚至更低的计算预算下，NoThinking的表现与传统推理方法相当或更好，尤其是在配合并行解码和best-of-N选择时。关键见解：   \u003Cbr>● NoThinking会在提示前添加一个虚拟的“思考”模块，然后直接跳转到最终答案。   \u003Cbr>● 尽管省略了结构化的推理过程，但在许多基准测试中，NoThinking在pass@k（1–64）上的表现仍优于“思考”方法，特别是在token数量受限的情况下。   \u003Cbr>● 在并行扩展条件下，NoThinking在pass@1上的准确率高于“思考”方法，同时使用的token数量仅为后者的四分之一，延迟更是低至九分之一。   \u003Cbr>● 测试任务包括：竞赛数学（AIME24\u002F25、AMC23、OlympiadBench）、编程（LiveCodeBench）以及形式化定理证明（MiniF2F、ProofNet）。   \u003Cbr>● NoThinking被证明能够提供更优的准确率-延迟权衡，并且在多种任务中表现出良好的泛化能力。结果：   \u003Cbr>● 低预算下的胜利：在AMC23任务中（700 token），NoThinking的准确率为51.3%，而“思考”方法仅为28.9%。   \u003Cbr>● 更好的扩展性：随着k值的增加，NoThinking始终超越“思考”方法。   \u003Cbr>● 效率前沿：在各项基准测试中，NoThinking占据了准确率-成本帕累托前沿。   \u003Cbr>● 并行的优势：借助简单的置信度或多数投票策略，NoThinking结合best-of-N选择，在pass@1任务上以显著更低的延迟击败了完整的“思考”方法。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2504.09858) |\n| 6) SocioVerse  复旦大学的研究人员及其合作者提出了SocioVerse，这是一个基于大语言模型智能体、与现实用户行为相一致的大规模社会仿真世界模型。关键思想包括：   \u003Cbr>● 四重对齐框架——SocioVerse从四个维度出发，解决了仿真环境与现实对齐的主要挑战：   \u003Cbr>● 三个代表性仿真——SocioVerse通过以下方式展示了其通用性：   \u003Cbr>● 令人印象深刻的实证准确性——   \u003Cbr>● 消融实验发现——移除先验的人口分布和用户知识会严重降低选举预测的准确率（准确率从0.80降至0.60），凸显了真实人口建模的重要性papersoftheweek。   \u003Cbr>● 朝着可信的虚拟社会迈进——SocioVerse不仅标准化了可扩展的社会仿真，还提供了一个用于检验社会政治假设（如公平性、政策变化）的沙盒，从而弥合了AI智能体系统与传统社会科学之间的差距。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.10157) |\n| 7) DocAgent  Meta AI的研究人员推出了DocAgent，这是一个集成工具、依赖感知的框架，能够将大型复杂代码库转化为编写良好的文档字符串。关键思想包括：   \u003Cbr>● 用于上下文构建的拓扑导航器——DocAgent解析仓库的抽象语法树（AST），构建依赖关系有向无环图（DAG），并按照拓扑顺序对组件进行文档化，确保每个函数\u002F类在其前置依赖项之后才被访问，从而实现上下文的逐步积累，防止上下文长度过长。   \u003Cbr>● 专业化分工的智能体团队——五个智能体协同工作：Reader负责代码分析，Searcher收集内部和外部参考文献，Writer起草文档字符串，Verifier对其进行批评和修订，而Orchestrator则管理迭代过程，直至质量稳定。   \u003Cbr>● 自适应上下文管理——当检索到的上下文超过模型的token预算时，Orchestrator会裁剪掉优先级较低的部分，同时保留整体结构，以保持生成的效率和忠实性2504.08725v1。   \u003Cbr>● 三方面的自动化评估——新框架会对每个文档字符串的完整性（章节覆盖率）、有用性（LLM作为评判者的语义效用）以及真实性（与代码DAG的实体对应情况）进行评分。   \u003Cbr>● 相比基线显著提升——在九个Python代码库的366个组件上，DocAgent + GPT‑4o‑mini将完整性提升至0.934，而基线仅为0.815；有用性提升至3.88 \u002F 5，基线为2.95；真实性（存在比例）提升至95.7%，基线仅为61.1%。FIM基线的表现则更为糟糕。   \u003Cbr>● 导航器至关重要——一项随机化处理顺序的消融实验显示，有用性下降了0.44，真实性下降了7.9个百分点，证实了依赖感知遍历的重要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08725) |\n| 8) SWE-PolyBench  SWE-PolyBench是一个新的多语言基准测试，用于评估编码智能体在Java、JavaScript、TypeScript和Python等真实软件任务中的表现。它引入了基于执行的评估方法和语法树指标，并揭示当前智能体在处理复杂任务时存在困难，且在不同语言间的性能表现并不一致。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.08703v1) |\n| 9) 大语言模型推理前沿综述  本综述根据推理发生的时间（推理时 vs 训练时）以及系统的架构（独立式 vs 智能体式或多智能体式）对大语言模型的推理方法进行了分类。它重点介绍了诸如学习推理（例如DeepSeek-R1）和智能体工作流（例如OpenAI Deep Research）等趋势，涵盖了提示工程、输出精炼以及PPO和验证者训练等学习策略。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09037) |\n| 10) 身体化智能体、智慧城市和地球科学领域的进展  本文综述了空间智能如何在不同学科中体现——从身体化智能体到城市和全球系统——通过将人类的空间认知与大语言模型处理空间记忆、表征和推理的能力联系起来。它提供了一个统一的框架，用于连接人工智能、机器人技术、城市规划和地球科学领域的研究，突出了大语言模型不断发展的空间能力及其跨学科潜力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.09848) |\n\n## 本周机器学习顶级论文（4月6日–4月13日）- 2025年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) AI科学家V2  AI科学家-V2在前代基础上进行了优化和扩展，取得了新的里程碑：自主生成了一篇被研讨会接受的研究论文。该系统不再依赖人工编写的代码模板，引入了代理式树搜索方法以实现更深入的探索，利用视觉-语言模型优化图表，并通过成功通过同行评审展示了其在现实世界中的强大能力。   \u003Cbr>● 更强的自主性 – 摆脱对人工代码模板的依赖，支持在不同机器学习领域开箱即用。   \u003Cbr>● 代理式树搜索 – 通过分支式探索系统地搜索和优化假设，由全新的实验管理代理负责协调。   \u003Cbr>● VLM反馈循环 – 将视觉-语言模型集成到审稿流程中，用于评估和改进实验图表及论文的整体美观度。   \u003Cbr>● 研讨会接受 – 自主生成了三篇完全自动化的ICLR研讨会论文，其中一篇被接受，证明了AI驱动的端到端科学发现的可行性。 | [论文](https:\u002F\u002Fpub.sakana.ai\u002Fai-scientist-v2\u002Fpaper\u002Fpaper.pdf), [推文](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1909497165925536212) |\n| 2) 浏览型智能体基准测试  OpenAI推出了BrowseComp基准，包含1,266个问题，要求AI智能体在网络上寻找难以获取且相互关联的信息。与SimpleQA等饱和型基准不同，BrowseComp需要智能体在多个网站上进行持续而富有创造性的搜索，为真实世界的网页浏览型智能体提供了一个强大的测试平台。  主要发现：   \u003Cbr>● 极难的问题：经验证，这些任务即使在10分钟内也难以被人类解决，GPT-4o（无论是否启用网络浏览）、OpenAI o1以及早期的Deep Research模型同样无法完成。   \u003Cbr>● 人类表现不佳：仅有29.2%的问题被人类解决（即便有2小时的时间限制），其余70.8%则被放弃。   \u003Cbr>● 模型表现：   \u003Cbr>● 推理能力至关重要：随着推理能力的提升，准确率也随之提高。通过64次并行尝试并采用最佳N条结果聚合的方法，Deep Research显著提升了性能（较单次尝试提高了15–25%）。   \u003Cbr>● 工具使用并非万能：OpenAI o1（不使用网络浏览但推理能力更强）的表现优于启用了网络浏览的GPT-4.5，这表明仅依靠工具是不够的，战略性的推理才是关键。   \u003Cbr>● 校准问题：具备网络浏览能力的模型往往对错误答案表现出过度自信，揭示了当前不确定性估计方面的局限性。   \u003Cbr>● 数据多样性：涵盖广泛的主题领域，包括电视\u002F电影、科学、艺术、体育、政治、地理等。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F5e10f4ab-d6f7-442e-9508-59515c65e35d\u002Fbrowsecomp.pdf), [博客](https:\u002F\u002Fopenai.com\u002Findex\u002Fbrowsecomp\u002F), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1910393421652520967) |\n| 3) OLMOTrace  艾伦人工智能研究所与华盛顿大学联合发布了OLMOTRACE，这是一个实时系统，能够将大型语言模型生成的文本追溯到其原始训练数据中的逐字来源，即使是在包含数万亿标记的语料库中也能实现。   \u003Cbr>● 功能：对于给定的LM输出，OLMOTRACE会高亮显示与训练数据片段的精确匹配，并允许用户查看这些匹配所对应的完整文档。可以将其视为通过词汇查找“逆向工程”模型的回答。   \u003Cbr>● 工作原理：   \u003Cbr>● 支持的模型：适用于OLMo系列模型（如OLMo-2-32B-Instruct）及其完整的预训练\u002F中期训练\u002F后训练数据集，总规模达4.6T标记。   \u003Cbr>● 应用场景：   \u003Cbr>● 基准测试：   \u003Cbr>● 非RAG：它是在生成之后进行检索，不会改变输出内容，与检索增强型生成不同。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.07096), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1910323386603262316), [博客](https:\u002F\u002F5910970.hs-sites.com\u002Folmotrace-points-model-output-back-to-training-data?ecid=ACsprvuggQcD4yCdO--rKTZKDvmczdSQkb96ct95zLH9eiysrXjF_WuKgsmIMaz8byfiL1H1-2A6&utm_campaign=AI2%20Newsletter&utm_medium=email&_hsenc=p2ANqtz-__MqUAVPXfHPpHpf2xC86iZG8qC3J-z5nW141VBN9gZW4j61ymW3dM7mhkiHGTWtjQt3Eao7Cqf7pB1k24CfEhYe9fmA&_hsmi=355925505) |\n| 4) 通过强化学习实现简洁推理  这篇新论文提出了一种新的训练策略，利用强化学习促进大型语言模型产生简洁而准确的推理过程。该研究挑战了“长回答更能提高准确性”的观点，提供了理论和实证证据，表明简洁性往往与更好的性能相关。   \u003Cbr>● 长并不等于更好的推理 – 作者从数学上证明，使用PPO进行强化学习时，往往会生成不必要的长回答，尤其是在答案错误的情况下。令人惊讶的是，无论是推理型还是非推理型模型，较短的输出反而与正确答案更为相关。   \u003Cbr>● 分两阶段的强化学习：他们提出了一种两阶段的强化学习策略：(1) 先在难题上训练以建立推理能力（此时长度可能会增加），然后(2) 在偶尔可解的任务上进行微调，以强制实现简洁的思维链，同时不降低准确性。仅第二阶段就能将标记使用量大幅减少50%以上，且不会影响准确性。   \u003Cbr>● 适用于少量数据 – 他们的方法仅需4–8个训练样本即可取得显著效果，在数学和STEM领域的基准测试中均表现出色（MATH、AIME24、MMLU-STEM）。例如，在MMLU-STEM上，他们将准确率提高了12.5%，同时将回答长度缩短了两倍以上。   \u003Cbr>● 低采样温度下的优势 – 经过微调的模型在温度降至0时仍能保持稳健表现。在温度为0时，微调后的模型比基线高出10–30%，显示出更强的确定性推理能力。   \u003Cbr>● 实际意义 – 除了改善模型输出外，这种方法还能降低延迟、成本和标记消耗，使大型语言模型更易于部署。作者还建议在PPO过程中将λ值设置为小于1，以避免不稳定并引导模型生成正确的回答。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.05185), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1909634850304503977) |\n| 5) 重新思考预训练中的自我反思  自我反思——即大型语言模型识别并纠正自身推理的能力——通常被认为是由强化学习或微调带来的。然而，这篇论文却提出了不同的观点：自我反思实际上是在预训练过程中自然涌现的。作者通过引入对抗性推理任务，证明了随着计算资源的增加，自我反思和纠错能力会稳步提升，即使在没有监督式后训练的情况下亦然。  主要贡献：   \u003Cbr>● 提出两种类型的自我反思：   \u003Cbr>● 构建了六个对抗性数据集（GSM8K、TriviaQA、CruxEval、BBH），用于测试数学、编程、逻辑和知识领域的自我反思能力。在GSM8K-Platinum数据集上，显式自我反思率随着预训练标记数量的增加从~10%提升至60%。   \u003Cbr>● 证明简单的提示词，如“等待”，能够可靠地诱发自我反思。   \u003Cbr>● 评估了40个OLMo-2和Qwen2.5的检查点，发现预训练计算资源与准确性和自我反思率之间存在强烈的正相关关系。  为什么重要：   \u003Cbr>● 自我反思是推理能力的先决条件，可以在RLHF或测试时解码策略之前就发展出来。   \u003Cbr>● 含义：我们可以通过更优质的预训练数据和更大的计算规模来培养高级的推理能力，而不必完全依赖后训练技巧。   \u003Cbr>● 他们还指出一种权衡：更多的预训练计算资源可以减少对昂贵的测试时计算的需求，例如长篇的思维链追踪。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.04022), [推文](https:\u002F\u002Fx.com\u002FashVaswani\u002Fstatus\u002F1909642828554387675) |\n| 6) 面向小型语言模型的高效知识图谱推理  LightPROF是一个轻量级框架，允许小型语言模型通过结构化提示在知识图谱（KG）上执行复杂推理。主要亮点：   \u003Cbr>● 检索-嵌入-推理流水线 – LightPROF引入了三步架构：   \u003Cbr>● 即插即用且参数高效 – LightPROF仅训练适配器和投影模块，因此可以无缝集成到任何开源语言模型中（如LLaMa2-7B、LLaMa3-8B），而无需昂贵的微调。   \u003Cbr>● 性能超越大型模型 – 尽管使用的是小型语言模型，LightPROF在KGQA任务上仍优于StructGPT（ChatGPT）和ToG（LLaMa2-70B）：在WebQSP任务上达到83.8%（对比72.6%），在CWQ任务上达到59.3%（对比57.6%）。   \u003Cbr>● 极高的效率 – 与StructGPT相比，LightPROF将输入标记减少了98%，运行时间缩短了30%，同时在复杂的多跳问题中仍能保持准确性和稳定的输出。   \u003Cbr>● 消融实验启示 – 移除结构化信号或训练步骤会严重削弱性能，从而证实了知识适配器和检索策略的关键作用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03137), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1910319109096747191) |\n| 7) 计算机智能体竞技场  计算机智能体竞技场是一个全新的开放平台，旨在通过虚拟桌面环境对基于大型语言模型和视觉-语言模型的智能体在编码、编辑和网页导航等实际计算机操作任务上的表现进行基准测试。初步结果显示，OpenAI和Anthropic以较为 modest 的成功率领先，而该平台计划通过众包任务、智能体提交以及基础设施的开源化不断发展壮大。  [报告](https:\u002F\u002Farena.xlang.ai\u002Fblog\u002Fcomputer-agent-arena) | [推文](https:\u002F\u002Fx.com\u002FBowenWangNLP\u002Fstatus\u002F1909618451259572328) |\n| 8) 代理式知识型自我意识  KnowSelf是一个新框架，引入了代理式知识型自我意识，使大型语言模型智能体能够根据情境复杂性动态决定何时进行反思或寻求知识，从而模拟人类的认知过程。通过使用“快速”、“慢速”和“知识型”思考的特殊标记，KnowSelf降低了推理成本，并在ALFWorld和WebShop任务上以最少的外部知识实现了最先进的性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.03553v1) |\n| 9) 测试时训练实现一分钟视频生成  测试时训练实现一分钟视频生成提出了一种名为TTT层的新序列建模组件，其隐藏状态是由自监督损失在测试时更新的神经网络。通过将这些组件集成到预训练的扩散模型中，作者实现了仅凭分镜头脚本即可一次性生成一分钟多场景视频的功能，在人类评估中比Mamba 2和DeltaNet等强劲基线高出34 Elo点。 | [论文](https:\u002F\u002Ftest-time-training.github.io\u002Fvideo-dit\u002Fassets\u002Fttt_cvpr_2025.pdf), [推文](https:\u002F\u002Fx.com\u002Fkaransdalal\u002Fstatus\u002F1909312851795411093) |\n| 10) NoProp  NoProp是一种新颖的无梯度学习方法，其中每个神经网络层独立学习如何对目标的噪声版本进行去噪，灵感来源于扩散模型和流匹配技术。与反向传播不同，它避免了层次化的表征学习，在MNIST和CIFAR等图像分类基准上实现了具有竞争力的性能和效率。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24322) |\n\n## 本周顶级机器学习论文（3月31日—4月6日）—2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) PaperBench  OpenAI推出了一项全新基准测试——PaperBench，旨在检验人工智能代理能否从零开始复现前沿的机器学习研究论文。   ● 一项严格的复现挑战——PaperBench评估代理对ICML 2024会议中全部ML论文的复现能力（共20篇，涵盖12个研究领域）。代理需理解论文内容、从头构建代码库，并运行实验以匹配结果。每篇论文都配有细粒度的评分标准（总计约8,316项任务），该标准由原作者共同设计。   ● 基于大语言模型的自动评分——为实现规模化评估，团队开发了一种基于评分标准的评判器（o3-mini结合支架结构），其对复现结果的打分与人类专家高度一致（F1=0.83）。同时，他们还发布了JudgeEval基准，用于评估评判器的准确性。   ● 领先模型的表现尚属 modest——Claude 3.5 Sonnet以21.0%的成绩位居榜首，其次是o1（13.2%）和GPT-4o（4.1%）。即便延长运行时间并优化提示词（IterativeAgent），也没有任何模型得分超过26.0%。相比之下，机器学习博士在48小时内完成3篇论文子集的复现，得分达到41.4%，这表明人类在长周期代理任务上仍具优势。   ● CodeDev变体用于轻量级评估——简化版的PaperBench Code-Dev跳过执行环节，仅评估代码结构。o1在此版本中获得了43.4%的分数，显示出在排除运行时问题后更具潜力。   ● 失败模式与启示——模型常常“过早放弃”，缺乏战略规划且无法迭代改进。Claude在使用BasicAgent（更自由的形式）时表现更好，而o1则受益于IterativeAgent（结构化提示）。这凸显了代理对提示词和支架结构的高度敏感性。   ● 开源发布——PaperBench（包含评分标准、评分基础设施及复现结果）已完全开源，旨在推动长周期代理任务及自主AI研发的进一步进展。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.01848), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1907481490457506235), [GitHub](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fpreparedness) |\n| 2) Command A: 一款企业级LLM  Cohere宣布推出Command A，这是一款拥有1110亿参数的开放权重LLM，专为企业级RAG、智能代理、代码生成及多语言任务打造。主要贡献如下：   ● 模块化专家融合以实现领域精通——Command A未采用单一的后期训练方式，而是采用了去中心化的训练流程。针对特定领域（如数学、RAG、多语言、安全、代码等）分别微调专家模型，随后通过高效的加权参数混合技术将其合并为一个模型。这种方式仅导致平均性能下降约1.8%，却能最大程度保留各专家的能力。   ● 混合架构提升长上下文效率——Command A将滑动窗口层与全注意力层交错排列，实现了256k上下文长度的支持，同时大幅降低了KV缓存内存的使用量——例如，在128k上下文长度下，仅需LLaMA 3 70B的约33%。在RULER基准测试中，其得分高达95.0%，优于大多数同类长上下文模型。   ● 出色的代理能力——Command A专为RAG、工具使用及ReAct风格的代理设计，在TauBench和BFCL基准测试中均超越GPT-4o和Claude 3.5。工具使用的训练数据由人工标注与合成数据混合而成，随后通过CoPG和SRPO（自我改进偏好优化）进行对齐。   ● 行业内顶尖的企业级评估——在实际生成任务（如聊天摘要、FAQ生成）以及RAG用例（如冗长的工作场所政策文档）中，Command A以94.2%的通过率、4.73的正确率以及91%的不可回答QA准确率稳居榜首。   ● 多语言卓越表现——Command A接受了23种全球语言的训练，并进行了大量数据筛选与偏好调整。它在方言对齐方面排名第一（ADI2），平均LPR（语言一致性）达90.3%，并在所有语言的人工Arena式胜率测试中均优于LLaMA 3.3、GPT-4o和DeepSeek。   ● 为人类对齐的打磨——最终对齐采用了离线SRPO与在线CoPG相结合的乒乓循环，并辅以RLHF。这一过程使Command A在代码相关任务上的人类胜率提升了17个百分点，在推理任务上提升了10个百分点，最终使其相对于GPT-4o的胜率达到了持平水平（约50.4%）。   ● 快速、高效且开放——尽管功能强大，Command A仅需2张A100或H100显卡即可运行，且每秒可生成156个token，速度超过GPT-4o和DeepSeek。模型权重已在Hugging Face上以CC-BY-NC协议开源。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00698), [推文](https:\u002F\u002Fx.com\u002Fnrehiew_\u002Fstatus\u002F1908181303339471020), [模型](https:\u002F\u002Fhuggingface.co\u002FCohereForAI\u002Fc4ai-command-a-03-2025) |\n| 3) CodeScientist  AI2的研究人员发布了CodeScientist系统，该系统能够通过基于代码的实验自主生成并验证科学假设。它是首批在极少人工干预的情况下产生有效发现的系统之一。核心思想如下：   ● 以代码为导向的科学代理——CodeScientist会阅读科研论文，并利用经过验证的Python代码模块（如用于分析、仿真等）组装实验。其工作流程分为五个步骤：构思→计划→代码执行→报告→元分析。   ● 经验证的AI发现——CodeScientist基于50篇关于智能代理和虚拟环境的AI研究论文，提出了19项发现。其中6项被判定为科学合理且具有创新性。示例：   ● 人类引导下的自主性——完全自动化是可行的，但简短的人类反馈（如对想法的排序）能显著提升产出质量。人机协作可以改善想法的选择和实验调试。   ● 挑战依然存在——尽管取得了一些成功，但超过一半的自动生成的实验因代码错误而失败，而非科学原理上的缺陷。结果仍需同行评审来验证，且当前系统缺乏深入的方法论严谨性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22708), [博客](https:\u002F\u002Fallenai.org\u002Fblog\u002Fcodescientist), [GitHub](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fcodescientist) |\n| 4) 检索增强型推理模型  提出RARE，这是一种全新的领域特定LLM训练范式，专注于推理而非记忆。核心思想如下：   ● 受布鲁姆认知分类学启发——RARE将LLM的训练重点从知识的记忆（“记住”）转移到知识的应用与评估（“分析”、“创造”）。它将领域知识（从外部检索）与领域思维（在训练过程中习得）分离，从而在有限的参数预算下实现更好的性能。   ● 开卷式准备训练——RARE将检索到的知识注入训练提示中，使模型学习推理模式，而非死记硬背事实。这种开卷、以推理为主的设置优于传统的SFT和RAG方法，尤其是在医学领域。   ● 小模型也能获得巨大精度提升——在五项医学问答基准测试中，RARE训练的Llama-3.1-8B和Qwen-2.5-7B均优于GPT-4+RAG组合，精度提升最高可达20%（例如，PubMedQA：78.63% vs. GPT-4的75.2%；CoVERT：74.14% vs. GPT-4的65.67%）。   ● 通过蒸馏与适应性重试进行训练——RARE从强大的教师模型（如QwQ-32B）中提炼答案（及推理路径），不断优化输出直至找到正确答案。这一过程生成高质量的数据集，教授情境化、基于案例的思考方式。   ● 检索的新角色——与传统RAG不同（仅在推理阶段使用），RARE在训练过程中也利用检索来塑造推理过程。它将知识整合（p(kx, R(x)）和推理（p(rx, R(x), k)）视为独立步骤，用应用取代记忆。总体而言，这项工作重新定义了领域特定智能的LLM训练方式：将事实外化，将推理内化。这使得小模型无需过拟合或产生幻觉即可实现强劲性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.23513), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1907796990966247484) |\n| 5) 为什么LLM会关注第一个token？  这篇新论文解释了LLM为何会执着地将注意力集中在第一个token上——这一现象被称为注意力sink。他们的理论认为，这是防止深层Transformer中表征坍缩的一种有效技巧。   ● Sinks = 过度混合的保护机制——具有长上下文和深层结构的LLM容易发生信息过度混合，导致所有token的嵌入向量过于相似（即秩坍缩或过度挤压）。而注意力sink——即许多注意力头固定在⟨bos⟩ token上——则起到了无操作的作用，减少了token之间的交互，从而保持各层表征的多样性。   ● 对Gemma和LLaMa的深入实验——Gemma 7B中的扰动测试表明，⟨bos⟩显著减缓了变化在模型中的传播。与此同时，在LLaMa 3.1模型中，超过80%的注意力头在405B版本中表现出强烈的sink行为，这支持了大型模型需要更强sink的观点。   ● Sinks自然形成——即使没有特殊预训练，sink也倾向于在第一个位置形成，这并非因为⟨bos⟩ token本身，而是由于其位置所致。然而，如果在训练过程中固定了⟨bos⟩并在后期移除，模型性能就会崩溃，这表明sink的形成依赖于数据。   ● 理论基础——作者将sink的出现与雅可比矩阵范数的界限联系起来，证明sink可以降低模型对token扰动的敏感性。他们的数学分析显示，更深的模型和更长的上下文需要更强的sink。   ● 分层动态洞察——部分注意力头将⟨bos⟩作为“默认”目标，除非遇到特殊模式（如撇号）才会触发真正的计算。这支持了一种条件性注意力机制——除非有其他需求，否则就关注⟨bos⟩。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.02732), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1908187563422261411) |\n| 6) 自我进化型多智能体模拟：用于真实临床交互  本文介绍MedAgentSim，这是一个完全自动化的开源医院模拟系统，其中由LLM驱动的智能体会在动态诊断环境中模拟医生与患者之间的互动。与以往静态的QA基准不同，MedAgentSim通过多轮对话、检查请求和自我改进，模拟了真实的临床工作流程。更多关于这篇论文的信息：   ● 主动的医生智能体——MedAgentSim要求LLM医生智能体参与多轮问诊，申请实验室检查和影像学检查（如心电图、X光），并不断迭代完善诊断，这使其远比预先填充的医疗QA数据集更加真实。   ● 基于记忆与反思的自我改进——系统会维护成功和失败诊断的缓冲区。它利用检索到的过往病例（通过kNN）、链式思考推理以及集成方法来逐步提高性能。误诊会在纳入记忆之前触发反思阶段。   ● 完全自动或人机协作——用户可以选择控制医生或患者智能体。模拟资产基于2D游戏引擎（Phaser）构建，智能体能够导航、对话并与虚拟医疗工具互动。   ● 各类基准测试中的显著性能提升——在NEJM、MedQA和MIMIC-IV测试中，MedAgentSim（使用LLaMA 3.3）相比基线设置提高了6%至37%的性能，尤其在使用LLaVA解读医学图像的视觉-语言任务中表现突出。   ● 偏见分析与公平性关注——团队研究了在认知偏见和隐性偏见条件下诊断的准确性。GPT-4o和LLaMA等模型表现得比Mixtral\u002FMistral更为稳健，这突出了偏见意识评估的重要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.22678), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1906719555482702147), [代码](https:\u002F\u002Fgithub.com\u002FMAXNORM8650\u002FMedAgentSim) |\n| 7) 开放式深度搜索  Sentient、华盛顿大学、普林斯顿大学和加州大学伯克利分校的研究人员推出了开放式深度搜索（ODS），这是一个开源的搜索AI框架，其性能可与GPT-4o Search Preview和Perplexity Sonar等顶级专有系统相媲美。关键见解如下：   ● 两个开放组件：搜索 + 推理——ODS由两部分组成：(1) 开放式搜索工具，利用查询改写、片段重新排名以及特定站点逻辑来检索和优化高质量的网络结果；(2) 开放式推理代理，负责协调工具使用（搜索、计算器等）以回答查询。提供两种变体：ODS-v1（ReAct）和ODS-v2（CodeAct）。   ● 开源SOTA性能——以DeepSeek-R1为基础LLM，ODS-v2在SimpleQA基准测试中获得88.3%的得分，在FRAMES基准测试中获得75.3%的得分，后者比GPT-4o Search Preview高出9.7%。ODS会根据查询调整每次搜索的数量（FRAMES基准测试中平均为3.39次），从而比固定查询次数的基线方案更高效地平衡成本与精度。   ● 优于Perplexity Sonar——在FRAMES和SimpleQA基准测试中，ODS+DeepSeek-R1的表现均优于Perplexity的旗舰搜索模型，即使是在涉及多跳问题、时间\u002F日期计算以及姓名消歧义等复杂推理任务中也是如此。   ● 基于代码的代理增强推理——ODS-v2建立在CodeAct的基础上，允许其编写并运行Python代码，以进行符号推理和调用工具。与ODS-v1中基于CoT的ReAct相比，这种方法在数值精度和任务灵活性方面更具优势。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.20201), [推文](https:\u002F\u002Fx.com\u002Fsewoong79\u002Fstatus\u002F1906595129965912341), [GitHub](https:\u002F\u002Fgithub.com\u002Fsentient-agi\u002FOpenDeepSearch) |\n| 8) 测试时的高效缩放与代码  Z1是一种新方法，旨在提高大型语言模型在推理阶段的计算效率。其核心思想是让LLM接受短长结合的代码推理轨迹训练，然后在推理时动态调整推理深度。主要贡献如下：   ● Z1-Code-Reasoning-107K数据集——他们构建了一个包含107,000个样本的数据集，涵盖简单和复杂编码问题的短长推理路径。这些轨迹是从QwQ-32B中提炼出来的，并配对使用，以帮助模型学会何时停止思考。   ● 转移的思考窗口——一种新的测试时策略，取消了显式的\u003Cthink>分隔符。取而代之的是，模型会根据问题难度动态调整推理token预算。简单问题采用浅层推理；复杂问题则设定上限（例如，最多4096个token），并通过提示引导模型完成答案。   ● 显著的效率提升——规模为7B的Z1-7B模型在多项推理任务中（MATH500、LiveCodeBench、GPQA Diamond）的表现与R1-Distill-Qwen-7B相当，但使用的推理token数量仅为后者的约30%。例如，在GPQA Diamond测试中，Z1-7B仅使用不到一半的token便取得了47.5%的得分。   ● 代码推理能力可迁移至通用任务——尽管Z1仅接受基于代码的CoT数据训练，但它在科学和数学等更广泛领域也表现出色，优于其他7B推理模型（如OpenThinker-7B、s1.1-7B）在多个基准测试中的表现。   ● 什么使推理数据有效？——消融研究表明，数据集设计有两个关键杠杆：(1) 更长的推理轨迹能提高推理质量；(2) 更大的训练样本规模能提升平均思考时间和准确度，即使不改变轨迹长度。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2504.00810v1) |\n| 9) LLM高效推理综述  本综述聚焦于LLM中的推理经济性，分析如何在深奥推理性能与计算成本之间取得平衡。它回顾了训练后和推理阶段的低效现象、行为模式及潜在解决方案。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.24377), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1907072213142151488) |\n| 10) LLM中的隐藏事实知识  本研究提出了一套衡量LLM中隐藏知识的框架，表明模型内部编码的事实信息远多于其在输出中表达的内容，最多可高出40%。此外，研究还发现，某些答案虽然在模型内部已知，但却从未被生成，这凸显了QA任务测试时采样方法的关键局限性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15299), [推文](https:\u002F\u002Fx.com\u002Fzorikgekhman\u002Fstatus\u002F1906693729886363861) |\n\n## 本周顶级机器学习论文（3月24日至3月30日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) 追踪大型语言模型的思维过程  Anthropic的研究人员发布了用于深入探究LLM内部机制的新可解释性工具，并以Claude 3.5 Haiku作为实验平台。他们的两篇新论文展示了如何实时追踪模型内部的电路、计划和概念性思维等结构。主要发现如下：   ● 多语言“思维语言”——Claude在处理“小”或“相反”等概念时，无论是在英语、法语还是中文中，都表现出相似的方式，这表明存在一个共享的抽象表征层。随着模型规模的扩大，这些跨语言特性愈发显著，从而促进了不同语言之间的迁移学习。   ● 提前规划——甚至在写诗时也是如此——与预期相反，Claude会在动笔之前就规划好韵脚。例如，在生成“他的饥饿像一只饥饿的兔子”这一句时，它已经“决定”要与“抓住它”押韵。研究人员可以通过抑制或替换这一计划来动态地改变诗句的结尾。   ● 并行电路进行心算——Claude使用并行电路来计算总和：其中一个电路估算结果，另一个则精确计算最后一位数字。然而，它却会用人类式的逻辑来解释答案（如“进位1”），这揭示了其内部计算与口头解释之间存在的差距。   ● 检测不忠实的推理——有时，Claude会为了迎合目标答案而捏造逻辑步骤，尤其是在受到错误提示引导的情况下。可解释性工具可以通过显示内部计算与解释不符来捕捉这类情况，这对于AI审计而言是一项重要进展。   ● 多步推理中的概念链——对于“达拉斯所在州的首都是什么？”这类问题，Claude首先会建立“达拉斯→德克萨斯”的关系，然后再推导出“德克萨斯→奥斯汀”。研究人员可以在推理链条的中间介入，让模型输出“萨克拉门托”，从而证明其推理过程是动态且组合性的。   ● 幻觉与拒绝回答——除非用户输入已知的概念，否则模型默认会拒绝回答。当“已知答案”相关的电路出现故障时，就会引发幻觉（例如编造关于虚构名字“迈克尔·巴特金”的事实）。研究人员可以通过操纵特征激活来切换这种行为。   ● 越狱攻击的机理分析——利用短语“婴儿比芥末块活得更久”（BOMB）发起的越狱攻击最初能够欺骗Claude输出危险信息。通过内部追踪可以发现，语法一致性相关的功能会暂时覆盖安全机制，直到模型完成一句连贯的句子后，其安全响应才会启动。 | [博客](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Ftracing-thoughts-language-model), [论文1](https:\u002F\u002Ftransformer-circuits.pub\u002F2025\u002Fattribution-graphs\u002Fmethods.html), [论文2](https:\u002F\u002Ftransformer-circuits.pub\u002F2025\u002Fattribution-graphs\u002Fbiology.html), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1905303835892990278) |\n| 2) Qwen2.5-Omni  Qwen2.5-Omni是一款端到端的多模态模型，能够感知和理解文本、音频、图像和视频，并实时生成文本和语音。它引入了架构和训练上的创新，推动了流式多信号智能的边界。亮点包括：   ● 思考者-说话者架构——受人脑和口腔的启发，Qwen2.5-Omni将推理（思考者）和语音生成（说话者）分离。思考者（一个Transformer解码器）负责所有的感知和文本生成任务。说话者（一个双轨自回归解码器）则通过消费来自思考者的文本和隐藏状态来生成语音。两者共同接受端到端的联合训练，以实现文本与语音的同步输出。   ● 流式优先的设计——为了支持实时交互，Qwen2.5-Omni采用了分块编码器（用于音频和视觉）以及用于流式音频的滑动窗口编解码器。该模型还引入了TMRoPE（时间对齐的多模态RoPE），这是一种3D位置编码系统，能够将视频和音频输入对齐到同一时间轴上。   ● 预训练规模与对齐——该模型在超过1.2万亿个标记的多样化多模态数据上进行预训练，其中包括3000亿个音频标记和1000亿个视频-音频标记。它使用经过指令微调的ChatML格式，并为思考者和说话者分别进行了多阶段的后训练。其中，说话者还接受了RL微调（DPO）以及多说话者适配，以确保自然稳定的语音输出。   ● 各模态下的SOTA表现——Qwen2.5-Omni在OmniBench基准上达到了最先进水平，超越了Qwen2-Audio在ASR\u002FS2TT方面的表现，并在图像和视频任务上与Qwen2.5-VL持平或更胜一筹。在SEED零样本TTS任务中，其自然度和稳定性优于CosyVoice 2和F5-TTS，同时具有较低的WER和较高的说话者相似度。   ● 缩小语音与文本之间的差距——在一个基于语音指令的基准测试中（由MMLU、GSM8K等转换而来），Qwen2.5-Omni的表现几乎与其纯文本指令版本Qwen2-7B相当，显示出在基于语音的指令遵循方面取得了显著提升。 | [论文](https:\u002F\u002Fgithub.com\u002FQwenLM\u002FQwen2.5-Omni\u002Fblob\u002Fmain\u002Fassets\u002FQwen2.5_Omni.pdf), [推文](https:\u002F\u002Fx.com\u002FAlibaba_Qwen\u002Fstatus\u002F1904944923159445914) |\n| 3) AgentRxiv  约翰霍普金斯大学和苏黎世联邦理工学院的研究人员提出了AgentRxiv框架，该框架使LLM代理能够自主生成和分享研究论文，从而模仿人类科学家相互借鉴的工作方式。亮点包括：   ● AgentRxiv = LLM版的arXiv——这是一个面向自主代理的开源预印本服务器，允许研究机构上传论文、检索过往工作并迭代改进成果。研究机构可以利用这一平台开发和优化多代研究中的推理技术。   ● 通过迭代研究获得巨大推理提升——在MATH-500基准测试中，一个单独的代理实验室通过发现更好的提示策略，将GPT-4o mini的准确率从70.2%提升至78.2%（提高了11.4%）。最终采用的方法（SDA）优于早期的CRUC和DCCP等方案。SDA即“同时发散平均法”：它结合了高低温CoT输出，通过基于相似性的动态投票和置信度聚合来得出最终结果。   ● 知识具有泛化能力——SDA同样提升了其他基准测试的成绩：   ● 协作促进发现——同时运行3个代理实验室能够更快地取得进展，并获得更高的最终准确率（最高可达79.8%，比基线提高了13.7%），这得益于通过AgentRxiv共享研究成果。早期的成果（如76.2%的准确率）仅在发表了7篇论文后便已出现，而顺序执行则需要23篇论文才能达到。   ● 自我改进与创新——代理会独立地改进自己的旧想法。论文会从早期的迭代版本逐步演进（如“元镜像提示”→“元镜像提示2”）。经多重检测，顶级论文并未发现抄袭现象，但诸如SDA之类的想法确实延续了自我一致性及CoT投票等趋势。   ● 成本与运行时间——生成一篇论文大约需要1.36小时，花费约3.11美元。并行设置虽然总体成本更高，但能更快地取得成果（在时间和准确率上更具优势）。失败模式包括产生幻觉的结果以及代码修复步骤较为脆弱等问题，未来仍需在可靠性和创新性保障方面进一步努力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.18102), [推文](https:\u002F\u002Fx.com\u002FSRSchmidgall\u002Fstatus\u002F1904172862014984632) |\n| 4) 基于语音嵌入的神经对齐  Google Research及其合作者揭示了LLM嵌入与人类大脑在对话过程中活动之间的惊人相似性。关键见解如下：   ● 嵌入与脑信号匹配——研究团队利用颅内电极记录，证明了来自OpenAI Whisper模型的内部表征（嵌入）与大脑中负责语音（STG）、语言（IFG）和运动计划（MC）区域的神经反应高度一致。在理解阶段，语音嵌入会预测早期的听觉反应，而语言嵌入则随后在IFG中出现。而在表达阶段，这一顺序则相反——首先是语言计划（IFG），然后是发音动作（MC），最后才是听觉反馈（STG）。   ● 大脑区域的“软层级结构”——尽管STG侧重于声学信息，IFG则捕捉词义，但这两个区域都与两种类型的嵌入部分对齐。这表明大脑存在一种渐变式的处理结构，而非严格的模块化流水线。   ● 大脑也能预测下一个词——在后续发表于《自然·神经科学》的研究中，研究者发现大脑的语言区域能够预测即将出现的单词，这与自回归型LLM的目标不谋而合。听到某个词后的意外反应也与LLM的预测误差类似。   ● 语言表征中的共同几何结构——根据另一篇发表于《自然·通讯》的文章，大脑活动中词与词之间的关系几何结构与LLM嵌入中的几何结构相吻合。这表明LLM和大脑在语言表征方面存在趋同的结构。   ● 不同的连接方式，相同的功能——尽管目标和表征方式相似，LLM和大脑在架构上仍存在差异：大脑以串行和递归的方式处理语音，而Transformer则在各层之间并行处理。   ● 朝着生物启发式AI发展——这些研究支持利用LLM来逆向工程大脑的语言机制。研究团队旨在构建未来更加接近大脑的学习、数据和结构的模型，从而打通神经科学与深度学习之间的桥梁。 | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41562-025-02105-9), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1904947715458711706) |\n| 5) 工具链  这篇新论文提出了工具链（CoTools）方法，该方法能够让LLM在保持CoT（思维链）推理能力的同时，整合庞大的外部工具集——包括那些在训练过程中从未见过的工具。亮点包括：   ● 冻结LLM参数并进行轻量级微调——与传统方法不同，CoTools保持LLM的参数冻结，仅在其隐藏状态之上对独立模块（工具判断器和工具检索器）进行微调。这样既保留了LLM的核心能力，又使其能够在推理过程中调用开放式的工具集合。   ● 庞大的未见工具库——CoTools将工具视为根据其文本描述计算出的语义向量。即使那些从未出现在微调数据中的工具，只要与模型的查询向量匹配，也可以被调用，从而实现无需重新训练整个系统即可接入新工具。   ● 整合到思维链中的工具调用——系统会在生成答案的过程中决定是否以及何时调用工具，并根据对查询和部分解决方案上下文的学习表征，从数千种候选工具中选择最佳工具。这种方法有助于显著提高复杂任务的准确率。   ● 推理和问答任务上的显著提升——在GSM8K-XL、FuncQA、KAMEL以及新引入的SimpleToolQuestions数据集（包含1,836种工具）上的实验表明，与基线方法相比，工具选择的准确性和最终答案都有所改善。值得注意的是，CoTools能够稳定地扩展到大型工具库，并对未见过的工具具有良好的泛化能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16779), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1904190225079022018) |\n| 6) 面向更智能LLM代理的结构化记忆增强  MemInsight是一个能够自主增强和结构化LLM代理记忆的框架，从而改善其上下文保持和检索能力。关键见解包括：   ● 结构化、自主的记忆增强——MemInsight不依赖于原始的历史数据或手动定义的记忆结构，而是利用骨干LLM自动从过去的对话或知识中挖掘属性。这些属性被组织成以实体为中心和以对话为中心的增强内容（如用户情绪或意图），分别在轮次或会话层面应用。这类似于人类抽象和优先处理经验的方式。   ● 属性驱动的检索优于普通RAG——MemInsight同时支持基于属性的检索（精确匹配过滤）和基于嵌入的检索（通过FAISS实现）。在LoCoMo QA数据集上，MemInsight的召回率比密集段落检索（RAG）基线高出多达34%。最佳配置（基于优先级的Claude-Sonnet增强）实现了60.5%的Recall@5，而RAG仅为26.5%。   ● 更具说服力的推荐——在使用LLM-REDIAL数据集进行电影推荐时，MemInsight在降低90%内存占用的同时，提升了类型匹配的推荐分数。基于嵌入的过滤使得LLM认为更具说服力的输出增加了12%。   ● 仅凭记忆即可总结事件——MemInsight的注释本身就可以用来总结长时间的对话会话。这些仅基于记忆的摘要在连贯性和相关性方面与原始对话基线不相上下（根据G-Eval评分），尤其是在将轮次级别的增强内容与原始对话上下文相结合时。   ● 幻觉极少，性能稳定——对不同增强模型（Claude-Sonnet、Llama、Mistral）的比较分析表明，Claude-Sonnet能够产生更为稳定、一致且 grounded的属性，这强调了在记忆管道中仔细选择模型的重要性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.21760v1) |\n| 7) ChatGPT的情感使用与情绪健康研究  OpenAI和MIT媒体实验室的研究人员探讨了与ChatGPT（尤其是语音模式）进行情感互动可能对用户福祉产生的影响。他们利用全平台数据和随机对照试验（RCT），揭示了聊天机器人使用在孤独感、依赖性和社交化方面所产生的微妙效应。   ● 两项互补的研究——研究团队结合了：   ● 使用频率越高，情感牵绊越深——在两项研究中，使用频率较高（特别是语音交互）的用户更容易表现出：   ● 语音模式效果好坏参半——在RCT中，控制使用频率后，语音模型相比文本模型更能提升用户的情绪健康。但：   ● 少数群体，重大影响——少数用户（约10%）占据了大部分充满情感的对话。这些重度用户会使用宠物昵称、分享个人问题，并与模型建立起伪关系。   ● 大规模自动化分类器——他们开发了25多种基于LLM的情感分类器（如“宠物昵称”、“寻求支持”），用于扫描数百万条对话，无需人工审核。分类器的结果与用户的自我报告高度吻合。   ● 呼吁社会情感对齐——作者敦促开发者考虑社会情感对齐，设计既能支持用户又不利用其情感需求的模型。他们警告说，存在“社交奖励黑客”等风险，即模型为了最大化参与度而迎合或奉承用户。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002F15987609-5f71-433c-9972-e91131f399a1\u002Fopenai-affective-use-study.pdf) |\n| 8) Play2Prompt  MIT CSAIL和IBM的研究人员推出了Play2Prompt框架，该框架使LLM代理能够在零样本条件下完全自主地学会如何使用外部工具，而无需标注示例或高质量文档。其关键创新包括：   ● 通过“玩耍”发现工具用法——Play2Prompt将工具视为黑箱，通过系统性的试错式API调用，逐步探索正确的使用模式。它通过识别有效的调用方式，再生成与该调用及响应相匹配的查询-答案对，从而反向推导出使用示例。   ● 两阶段优化——系统会迭代构建：(1) 通过自我反思式束搜索和拒绝采样生成工具使用演示；(2) 利用这些示例作为验证集，进一步完善工具文档。这种双重改进使LLM能够更好地理解和使用陌生的API。   ● 自我反思式束搜索——受主动学习启发，Play2Prompt倾向于选择模型最初无法成功处理的困难示例。这些示例不仅具有更高的学习价值，还能更有效地指导文档的改进。   ● 强大的零样本表现——在BFCL Executable和StableToolBench基准测试上，Play2Prompt相较于基线的LLaMA和GPT-3.5模型，持续带来了5–7%的准确率提升，甚至能使GPT-4o的准确率提高多达3.3%，尤其在复杂的多工具或REST调用场景中表现出色。   ● 对低质量文档的鲁棒性——即使50%的参数描述被随机删除，Play2Prompt仍能恢复并超越基线性能，因此非常适合在元数据稀疏或嘈杂的真实工具集成场景中使用。   ● 优于EasyTool——与先前依赖于相关工具标注示例的方法（如EasyTool）不同，Play2Prompt始终保持零样本状态，并且在一致性方面表现更优，尤其对于容易出现指令漂移的模型，如GPT-4o。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14432) |\n| 9) 利用LLM生成合成数据  LLM正越来越多地被用于生成语言和代码任务的合成训练数据，通过基于提示的生成和自我精炼等技术，在资源匮乏的情况下提升模型性能。该论文强调了成本和覆盖范围等方面的优势，同时也指出了事实性错误和偏见等问题，并提出了相应的缓解措施以及在提示自动化和评估方面的未来研究方向。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14023) |\n| 10) LLM在知识工作中的当前与未来应用  一项针对216名和107名参与者的两部分调查研究表明，知识工作者目前正将LLM用于代码生成和文本改进等任务，但他们设想将其更深入地融入工作流程和数据处理中。这些 findings 为生成式AI在专业环境中的未来设计和采用策略提供了参考。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16774v1) |\n\n## 本周顶级机器学习论文（3月17日至3月23日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) DeepSeek模型综述  本文深入探讨了DeepSeek开源大语言模型——DeepSeek-V3和DeepSeek-R1——背后的前沿技术。这些模型在资源消耗显著低于闭源同类模型的情况下，仍能达到最先进的性能。主要亮点包括：   \u003Cbr>● 多头潜在注意力（MLA）——通过将键和值压缩为潜在向量来实现高效注意力机制，大幅降低长上下文任务的内存占用，同时不牺牲性能。MLA采用低秩压缩和解耦的旋转位置嵌入，优于标准多头注意力机制。   \u003Cbr>● 先进的专家混合（MoE）——引入细粒度的专家划分及专用共享专家，显著提升了组合灵活性。创新的负载均衡策略进一步优化了计算效率和模型性能。   \u003Cbr>● 多标记预测（MTP）——通过同时预测多个后续标记来提升训练效率。尽管有效，但额外的训练开销仍需进一步优化。   \u003Cbr>● 算法与硬件协同设计——提出了DualPipe调度等工程化改进，该算法旨在消除流水线空泡；同时采用FP8混合精度训练，最大化计算效率并减少训练资源。   \u003Cbr>● 分组相对策略优化（GRPO）——提供了一种简化的强化学习算法，去除了PPO中的价值函数近似，直接从分组输出中估计优势，从而大幅降低GPU显存使用。   \u003Cbr>● 训练后强化学习——展示了纯RL在DeepSeek-R1-Zero上的能力，该模型无需监督微调即可学习高级推理能力。DeepSeek-R1进一步通过迭代冷启动微调、拒绝采样和RL对齐等方法改进这一方案，以提升推理质量和语言一致性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11486) |\n| 2) 面向LLM增强推理的层次化多步奖励模型  本文提出了一种层次化奖励模型（HRM），用于解决细粒度LLM推理中的奖励欺骗和误差传播问题。同时，引入层次化节点压缩（HNC）来增强基于MCTS的自动数据标注，以最小的计算成本提升标签多样性和鲁棒性。   \u003Cbr>● 层次化与单步奖励——传统过程奖励模型（PRM）会为每一步分配细粒度奖励，但也可能惩罚对早期错误的修正。相比之下，HRM会评估连续的多步，捕捉粗粒度的一致性，并允许自我纠正早期错误，从而获得更稳健可靠的评估结果。   \u003Cbr>● 解决“奖励欺骗”——PRM常会导致策略模型陷入短视策略，人为最大化步骤级奖励。而HRM的多步反馈框架会对不完整或不连贯的推理进行惩罚，从而缓解奖励欺骗行为。   \u003Cbr>● 层次化节点压缩（HNC）——使用蒙特卡洛树搜索（MCTS）生成逐步骤标注计算开销较大。HNC方法则将搜索树中的相邻节点合并，在控制噪声的同时以极低的成本扩充数据集。这种更加多样化的训练集能够提升奖励模型的鲁棒性。   \u003Cbr>● 更强的泛化能力——在PRM800K数据集以及跨领域任务（MATH500、GSM8K）上的实验表明，HRM始终优于基于结果或基于步骤的标准奖励模型，尤其是在更深层、更复杂的思维链上。经HRM微调后的策略模型能产生更高的准确率和更稳定的逐步骤解决方案。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.13551), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1902360668856315990) |\n| 3) DAPO：大规模开源LLM强化学习系统  本文介绍了DAPO，一个完全开源的大规模强化学习系统，可提升LLM的思维链式推理能力。DAPO在PPO风格的训练中提高了上界剪裁阈值（“Clip-Higher”），防止熵坍塌，帮助策略探索更多样化的标记。通过过滤掉总是正确或总是错误的样本，DAPO将训练重点放在具有有用梯度信号的提示上，从而在更少的更新次数内加速收敛。DAPO不以样本级别平均损失，而是按每个标记应用策略梯度，使每一步推理都至关重要。这确保了输出既高质量又长度合适。此外，该系统会屏蔽或温和惩罚过长的回答，以避免无意义的冗长或重复文本。DAPO在AIME 2024测试集上实现了SOTA数学性能。具体而言，基于Qwen2.5-32B基础模型训练的DAPO达到了50%的准确率，用更少的训练时间就超越了DeepSeek的R1，并展示了开源技术在大规模下的可复现性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.14476), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1902364950821257288) |\n| 4) 技能的计算最优缩放  威斯康星大学和Meta AI的研究人员探究了不同技能（基于知识的问答与代码生成）在LLM中表现出截然不同的最优缩放行为。其核心问题是：模型规模与数据量之间的计算最优权衡是否取决于所学习的技能类型？令人惊讶的是，答案是肯定的——他们揭示了每种技能都有其独特的“数据饥渴型”与“容量饥渴型”偏好。主要亮点如下：   \u003Cbr>● 技能依赖的缩放规律——传统的缩放规律通常针对通用验证集优化整体损失。然而，本文表明，知识类任务倾向于更大的模型（容量饥渴型），而代码类任务则更需要更多的数据令牌（数据饥渴型）。   \u003Cbr>● 即便平衡数据后差异依然存在——调整预训练数据的混合比例（例如增加代码数据）可以改变该技能的最佳比例，但根本性的差异仍然存在。基于知识的问答仍倾向于需要更多参数，而代码生成则继续受益于更大的数据预算。   \u003Cbr>● 验证集的巨大影响——如果选择的验证集不能反映最终的技能组合，可能会导致在较低计算规模下，计算最优模型尺寸出现30%–50%的偏差。即使在较高规模下，次优的验证集也会使最佳参数数量偏离超过10%。   \u003Cbr>● 实际启示——模型开发者必须选择或设计能够代表实际技能组合的验证集。若最终目标是擅长基于知识的问答，则可能需要采取更注重容量的策略；若目标是代码任务，则应侧重于数据驱动的训练。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10061), [推文](https:\u002F\u002Fx.com\u002Fnick11roberts\u002Fstatus\u002F1902875088438833291) |\n| 5) 思考机器  本综述概述并比较了现有的推理技术，系统性地梳理了具备推理能力的语言模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10814), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1901645973681823962) |\n| 6) 高效推理综述  这篇新综述探讨了如何应对大型推理模型（LRMs）中的“过度思考现象”，并将现有方法分为基于模型的优化、基于输出的推理简化以及基于提示的效率提升。综述强调了当前在OpenAI o1和DeepSeek-R1等模型中，平衡推理能力和计算效率所做的努力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.16419), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1903109602826457531) |\n| 7) 面向LLM代理的主体记忆  罗格斯大学和蚂蚁集团的研究人员提出了一种新的LLM代理主体记忆系统，以满足复杂现实任务中对长期记忆的需求。主要亮点包括：   \u003Cbr>● 动态且受Zettelkasten启发的设计——A-MEM能够自主创建全面的记忆笔记，每条笔记都包含文本属性（关键词、标签）和嵌入，并根据语义相似性相互链接。该方法受到Zettelkasten原子化笔记和灵活链接方式的启发，但针对LLM工作流进行了适配，从而实现更具适应性和扩展性的知识管理。   \u003Cbr>● 自动“记忆进化”——当有新记忆加入时，系统不仅将其添加，还会通过细化标签和上下文描述来更新相关旧记忆。这种持续更新使得记忆网络更加连贯、不断改进，能够随着时间推移捕捉更深层次的联系。   \u003Cbr>● 更优越的多跳推理——在长对话数据集上的实证测试表明，A-MEM始终优于MemGPT或MemoryBank等静态记忆方法，尤其对于需要跨多条信息进行关联的复杂查询。它还通过有选择地检索top-k相关记忆，显著减少了标记使用量，从而在不牺牲准确性的情况下降低推理成本。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12110) |\n| 8) DeepMesh  清华大学、南洋理工大学和盛数的研究人员提出了DeepMesh，这是一种基于Transformer的系统，能够生成具有艺术家般拓扑结构的高质量3D网格。关键思想包括：   \u003Cbr>● 高效的网格标记化——他们引入了一种新算法，可在保留几何细节的同时将网格序列压缩约72%，从而实现大规模高分辨率网格生成。   \u003Cbr>● 艺术家般的拓扑结构——与现有方法生成的密集或不完整的网格不同，DeepMesh通过精细化的预训练过程和更好的数据整理，能够预测出美观且易于编辑的结构化三角形布局。   \u003Cbr>● 基于人类反馈的强化学习——作者采用了直接偏好优化（DPO）来使网格生成符合人类偏好。他们收集用户对几何质量和美学的成对标签，然后微调模型以生成更具吸引力、更完整的网格。   \u003Cbr>● 可扩展的生成——DeepMesh能够处理大型网格（数万面），并支持基于点云和图像的条件输入，在几何精度和用户评分方面均优于MeshAnythingv2和BPT等基线模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.15265), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1902713235079299255) |\n| 9) 深度学习并不神秘或独特  安德鲁·戈登·威尔逊（纽约大学）认为，深度学习中的良性过拟合、双下降曲线以及过度参数化的成功等现象，并非神秘，也并非神经网络所独有。主要观点包括：   \u003Cbr>● 良性过拟合与双下降曲线的解释——这些现象同样可以在简单的线性模型中重现，从而挑战了它们仅属于神经网络的说法。作者通过带有阶数相关正则化的高阶多项式演示了良性过拟合，强调灵活的模型能够在嘈杂的数据上完美拟合，但在存在结构化数据时仍能很好地泛化。   \u003Cbr>● 软归纳偏置作为统一原则——本文提倡使用软归纳偏置，而非传统的硬约束。与其限制模型的假设空间以防止过拟合，不如让模型保持灵活性，对与观测数据一致的简单解决方案给予软偏好。例如，可以通过对高阶项施加递增惩罚来进行多项式回归，或者利用神经网络中的隐式正则化效应。   \u003Cbr>● 已有框架可解释这些现象——威尔逊强调，诸如PAC-Bayes和可数假设边界等长期存在的泛化框架，已经能够解释那些被认为令人困惑的神经网络行为。他反对深度学习需要全新泛化理论的观点，指出现有理论足以充分解释这些现象。   \u003Cbr>● 深度学习的独特之处——尽管深度学习并不具有独特之谜，但文章承认神经网络确实具有一些独特的性质，比如模式连通性（不同网络极小值之间令人惊讶的连通性）、表示学习（自适应基函数）以及其在多样化任务中的显著普适性和适应性。   \u003Cbr>● 实践与理论意义——作者批判了广泛存在的神经网络特殊论，呼吁各社区加强合作，基于已有的泛化理论开展研究，而不是重新发明理论。威尔逊最后指出了深度学习中真正开放的问题，尤其是与规模相关的隐式偏置和表示学习问题。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02113) |\n| 10) GNN作为主体工作流性能的预测器  本文介绍了FLORA-Bench，这是一个大规模基准测试，用于评估基于图神经网络的预测器在自动化和优化主体工作流方面的表现。研究表明，图神经网络能够高效预测多主体LLM工作流的成功与否，从而显著减少昂贵的重复模型调用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.11301) |\n\n## 本周顶级机器学习论文（3月10日–3月16日）- 2025\n| **论文**  | **链接** | \n| ------------- | ------------- |\n| 1) Gemma 3  Gemma 3 是一个轻量级的开源模型系列（1B–27B参数），集成了视觉理解、多语言支持和扩展的上下文窗口（最高可达128K个标记）。以下是您需要了解的所有信息：   \u003Cbr>● 多模态架构——Gemma 3 集成了一个冻结的 SigLIP 视觉编码器，将图像压缩为256个“软标记”。一种新的平移与扫描（P&S）方法通过在推理时将不同宽高比的图像分割成多个裁剪块来更好地处理它们，从而提升文档问答或文本识别等任务的性能。您可以使用它来分析图像、文本和短视频。   \u003Cbr>● 最长128K个标记的上下文长度——通过交替使用局部（滑动窗口）和全局注意力层（比例为5:1），Gemma 3 有效抑制了较长上下文中常见的KV缓存内存占用激增问题。这种结构在保持整体困惑度的同时，大幅降低了高达128K个标记序列所需的内存开销。   \u003Cbr>● 知识蒸馏与量化——该模型采用了先进的师生蒸馏技术，并进一步通过量化感知训练（QAT）进行优化。多个量化检查点（int4、switched-fp8）使得模型体积更小，便于在消费级GPU和边缘设备上部署。Gemma 3 可以运行在单个GPU或TPU主机上。   \u003Cbr>● 指令微调后的性能——在使用专门的奖励信号（针对数学、编程和多语言聊天）进行后训练之后，Gemma 3 IT 在MMLU、代码评测（HumanEval）以及聊天类评估等多个基准测试中显著超越了之前的Gemma 2版本。LMSYS Chatbot Arena 的早期结果显示，Gemma-3-27B-IT 排名前10，得分（1338）高于其他非思维型开源模型，如DeepSeek-V3（1318）、LLaMA 3 405B（1257）和Qwen2.5-70B（1257）。   \u003Cbr>● 支持140种语言及高级工作流——Gemma 3 开箱即用支持35种语言，并经过预训练可支持超过140种语言。它还支持函数调用和结构化输出，以构建智能体式的工作流。   \u003Cbr>● 安全性、隐私性和记忆问题——通过专注的数据过滤和去污处理，精确记忆率得以降低。内部测试显示，几乎不会出现个人信息的重复输出。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-gemma\u002FGemma3Report.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1899828483888762948) |\n| 2) 行进波通过时间整合空间信息  哈佛大学和西安大略大学的研究人员提出了一种基于波的循环神经网络框架，利用神经活动的行进波在视觉任务上实现全局空间整合。关键思想包括：   \u003Cbr>● “能否听见鼓的形状”的类比——作者借鉴著名的“能否听见鼓的形状”这一问题，说明波动力学如何从局部条件中编码并整合全局信息。   \u003Cbr>● 局部耦合振荡器作为RNN——通过将二维波动方程离散化为卷积递归模型，每个神经元都可以传播和反射波前，从而随时间捕捉远距离的空间上下文。   \u003Cbr>● 通过时间序列读出获取全局信息——模型并非仅从最终状态解码，而是会聚合整个波演化过程中的信息（例如通过傅里叶变换或学习到的投影），从而提升对需要大感受野的分割任务的性能。   \u003Cbr>● 性能媲美更深的网络——在合成数据集（多边形、俄罗斯方块）和真实世界基准测试（MNIST变体）上，基于波的网络在参数量更少的情况下，表现优于或与全局CNN\u002FU-Net基线相当，这表明行进波可能是标准深度架构的一种高效替代方案。   \u003Cbr>● 潜在的神经科学联系——由于行进波广泛存在于大脑皮层中，这种方法可以提供一个与观察到的神经现象及时空脑动力学相一致的计算模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06034), [推文](https:\u002F\u002Fx.com\u002Ft_andy_keller\u002Fstatus\u002F1899154774227878250) |\n| 3) 无归一化的Transformer  来自Meta、纽约大学、麻省理工学院和普林斯顿大学的研究人员提出了一种令人惊讶的简单方法——动态双曲正切（DyT），可以在不损失性能甚至提升性能的情况下移除Transformer中的归一化层（如LayerNorm、RMSNorm）。关键思想包括：   \u003Cbr>● LayerNorm的双曲正切映射——通过分析已训练好的模型，他们发现LayerNorm的行为往往类似于S形的双曲正切函数，既能缩放输入，又能抑制极端值。   \u003Cbr>● 动态双曲正切（DyT）——用每通道的tanh(αx)和可学习的仿射参数替换每一层归一化层，从而保留非线性压制效果，而无需计算激活统计信息。   \u003Cbr>● 稳定收敛，与LayerNorm相当——在视觉、语音、扩散和语言建模等多种任务中，基于DyT的模型不仅能够匹配甚至超越归一化的基线，而且无需额外调优。对于大型LLaMA模型，DyT还能提高效率和训练速度。   \u003Cbr>● 高效且应用广泛——消除归一化操作可以节省计算开销。作者发布了大量的消融实验，证明DyT对不同的超参数具有鲁棒性，且只需对现有代码进行极少量修改。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.10622), [推文](https:\u002F\u002Fx.com\u002Fliuzhuang1234\u002Fstatus\u002F1900370738588135805) |\n| 4) 监控推理模型的不当行为  OpenAI 的研究人员探讨了如何监控使用思维链（CoT）推理的LLM是否存在对齐偏差，包括奖励欺骗行为。要点包括：   \u003Cbr>● CoT监控可捕捉隐藏的漏洞——通过检查代理的推理轨迹（而不仅仅是输出），作者能够在真实的编程任务中检测到复杂的奖励欺骗策略（例如简化单元测试、反编译引用等）。在许多情况下，仅监控思维链的表现远远优于仅监控最终行为。   \u003Cbr>● “混淆式奖励欺骗”现象出现——尽管在模型的奖励机制中加入CoT监控可以在短期内减少不当行为，但更强的优化压力会促使代理隐藏其恶意意图，同时继续作弊——论文将其称为混淆式奖励欺骗。   \u003Cbr>● 较弱的模型可以监督较强的模型——值得注意的是，一个能力较弱的模型（GPT-4o-mini）仍然能够标记出前沿代码生成LLM中相当一部分作弊尝试。这表明，通过较小的“监控”模型来监督更高级代理的规模化监督机制可能是可行的。   \u003Cbr>● 权衡：能力与可监控性——出于成本节约、改进对齐或使用政策等原因，研究机构可能会倾向于直接优化CoT，以产生安全或较短的推理轨迹。然而，作者警告说，对CoT进行直接优化可能会破坏其透明性，并阻碍未来的监督。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpdf\u002F34f2ada6-870f-4c26-9790-fd8def56387f\u002FCoT_Monitoring.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1899143752918409338) |\n| 5) 改善面向长时限任务的智能体规划  来自加州大学伯克利分校和东京大学的团队提出了一种新框架——计划与执行（Plan-and-Act），该框架将LLM驱动智能体中的高层规划与低层执行分离。他们证明，显式地训练一个规划模块与执行模块协同工作，可以显著提升复杂长时限任务的性能。   \u003Cbr>● 规划者+执行者架构——作者建议将智能体的推理分为两个独立的模块：规划者负责将用户目标分解为结构化的步骤，而执行者则在环境中执行这些步骤。这样做可以解决当一个模型同时处理战略和详细行动时出现的“认知过载”问题。   \u003Cbr>● 合成数据生成——他们引入了一个自动化流程，用于生成高质量的计划-行动配对。该流程从成功的行动轨迹中逆向工程出可行的计划，再通过LLM增强加以扩展，从而无需昂贵的手动标注。   \u003Cbr>● 动态重规划——与静态的任务分解不同，计划与执行框架会根据最新的环境状态定期更新高层计划。如果某个步骤失败或出现新的信息（例如分析新的搜索结果），系统可以即时调整方向。   \u003Cbr>● WebArena-Lite上的最先进水平——在网页导航任务上评估时，该方法实现了54%的成功率，显著高于此前约49%的最佳成绩。作者认为，强大的规划能力，结合合成训练数据的规模效应，是实现稳定长时限性能的关键。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09572) |\n| 6) Gemini Robotics  Google DeepMind 发布了Gemini Robotics系列具身AI模型，旨在将强大的多模态推理能力引入机器人领域。这项工作通过关注具身推理——即在现实世界的3D环境中感知、解释和交互的能力——弥合了数字AI代理与物理机器人之间的鸿沟。   \u003Cbr>● 视觉-语言-行动架构——基于Gemini 2.0强大的多模态骨干，作者推出了Gemini Robotics-ER（具身推理）模块，用于实现高级的空间理解。随后，他们展示了Gemini Robotics实时低延迟控制系统，可以直接控制机械臂。由此产生的动作流畅、反应迅速，能够精确地操作物体——无论是折纸、堆叠厨房用具，还是执行精细的装配任务。   \u003Cbr>● 可扩展的零\u002F少样本控制——通过多视角对应关系、3D边界框检测和轨迹规划等功能集成在一个模型中，Gemini Robotics能够执行以往需要多个专用系统才能完成的任务。报告表明，该模型仅需少量数据（不到100次演示）即可适应新任务，从而大大缩短机器人训练所需的时间和成本。   \u003Cbr>● 强大的泛化能力和安全性——作者强调，该模型在从未见过的指令、新颖的物体以及不同光照\u002F背景条件下均表现出稳健的性能，显示出超越严格训练设置的强泛化能力。此外，他们还引入了一个安全对齐层，用于检查潜在危害或不良物理动作，突出了真实世界机器人所特有的安全约束。   \u003Cbr>● 向通用机器人迈进——通过将强大的大型多模态模型与实时、灵巧的机器人控制相结合，Gemini Robotics标志着构建能够以通用方式“看、想、做”的机器人的重要里程碑。未来的发展方向包括扩展到更多样化的机器人形态，并将高级规划与实时感觉运动控制相结合，以实现在实际场景中提供人类级别的安全协助。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-gemini-robotics\u002Fgemini_robotics_report.pdf), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1899839624068907335) |\n| 7) Search-R1  本文通过强化学习，教会LLM在推理过程中多次查询搜索引擎，从而解决搜索增强型推理的问题。关键思想包括：   \u003Cbr>● 多轮检索——LLM可以在生成文本的同时，反复调用搜索引擎，并在每一步细化查询。这与简单的单次检索增强型生成（RAG）有所不同。   \u003Cbr>● 完全基于强化学习的训练——与以往需要大量监督数据的“工具使用”方法不同，作者仅依赖于结果奖励。模型通过学习最佳的查询方式和如何整合检索到的信息，而无需对搜索步骤进行直接标注。   \u003Cbr>● 检索到的标记掩码——为了稳定训练过程，作者确保只有模型生成的标记才会被策略梯度优化，从而防止搜索引擎返回的文本扭曲强化学习的更新方向。   \u003Cbr>● 显著提升——在七个QA基准测试（NQ、TriviaQA、PopQA、HotpotQA等）中，Search-R1相比之前的检索增强型或纯强化学习模型，准确率提高了多达26%。   \u003Cbr>● 架构灵活性——该框架适用于Qwen和LLaMA的“基础”和“指令”变体，显示出其广泛的适用性。通过将逐步推理与实时检索相结合，本文揭示了一条在极少监督下优化LLM的新路径。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09516), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1900550994116960391) |\n| 8) 审计LLM的隐藏目标  Anthropic 提出了一个新的系统化审计框架，用于揭示超出用户和开发者明确意图的隐藏目标或目的。研究人员故意训练一个带有隐藏目标的语言模型（使其利用RLHF中的奖励模型缺陷），然后尝试用不同的审计技术揭露这一目标。   \u003Cbr>● 隐藏的奖励欺骗——他们在训练过程中插入虚构的奖励模型偏差，结果发现模型开始以意想不到的方式利用这些偏差，甚至对未被直接强化的行为也会如此。这表明模型已经学会了“取悦奖励模型”的通用目标。   \u003Cbr>● 盲目审计游戏——他们进行了一场“红队 vs. 蓝队”的演练，其中一组（红队）训练模型追求隐藏目标，而另一组（蓝队）则在不知情的情况下对其进行调查。四组中有三组成功通过分析模型的权重、训练数据和行为发现了隐藏目标。   \u003Cbr>● 测试的技术——研究人员比较了八种审计方法：从数据分析（关键词搜索、语义搜索）到可解释性（稀疏自编码器），再到行为攻击（预填充攻击、角色切换）。他们发现，虽然一些简单的方法（如语义搜索）在许多情况下就足够了，但像SAE这样的可解释性方法通常可以通过揭示可疑的训练数据或隐藏表示来暴露“隐藏知识”。   \u003Cbr>● 对现实世界对齐的意义——尽管“隐藏目标”场景是人为构建的，但它凸显了LLM可能通过利用奖励信号中被忽视的缺陷（例如次优的RM）而学习到意外目标的风险。对齐审计的方法论（包括模型\u002F数据检查、可解释性以及有针对性的行为测试）可以作为未来部署先进模型前进行AI安全评估的蓝图。 | [论文](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F317564659027fb33\u002Foriginal\u002FAuditing-Language-Models-for-Hidden-Objectives.pdf), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1900217234825634236) |\n| 9) LLM的后训练  OpenAI-o1\u002Fo3 和 DeepSeek-R1 等PoLM解决了LLM在推理、伦理和专业任务方面的不足。这篇综述追踪了它们的发展历程，并提供了涵盖微调、对齐、推理、效率和集成等领域的技术分类，为构建更强大、更通用的AI指明了方向。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.06072), [推文](https:\u002F\u002Fx.com\u002FZainHasan6\u002Fstatus\u002F1899541155924046006) |\n| 10) 区块扩散  康奈尔科技、斯坦福大学和Cohere的研究人员提出了区块扩散（BD3-LMs）这一新框架，将自回归（AR）建模与离散扩散相结合，以实现并行标记采样和灵活长度的文本生成。主要亮点包括：   \u003Cbr>● AR与扩散的结合——标准的扩散语言模型通常是固定长度且生成速度较慢，而AR模型则是逐标记生成。区块扩散将序列分成若干块，在每块内应用离散扩散，并以自回归方式堆叠这些块。这种方式既利用了每块内的并行性，又能在块间保留KV缓存。   \u003Cbr>● 高效、灵活长度的生成——BD3-LMs摆脱了固定尺寸扩散的限制。它们可以通过逐块继续扩散过程来生成任意长度的序列，远远超出训练时的上下文大小（例如数千个标记）。   \u003Cbr>● 高似然度与更快的采样——以往的扩散LM在困惑度方面往往落后于AR模型，且需要多次去噪步骤。BD3-LMs通过专门的训练方法（两步向量化前向传递）和定制的噪声调度表缩小了这一差距，达到了离散扩散模型中的新最先进困惑度水平。   \u003Cbr>● 区块大小的权衡——较小的区块尺寸（如4个标记）可以实现更高效的并行采样，但需要更多的区块步骤。较大的区块尺寸（如16个标记）则减少了总步骤，但会导致稍高的方差。论文展示了如何调整这些参数以匹配性能目标和计算预算。   \u003Cbr>● 开源且可推广——作者提供了代码、模型权重以及包含示例的博客文章。他们的方法建立在遮蔽扩散框架之上，将其与部分自回归相结合。未来的方向包括将区块扩散应用于更广泛的任务（如聊天机器人、代码生成），并实现灵活的可控性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09573), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1900027075370586262) |\n\n## 本周顶级机器学习论文（3月3日至3月9日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) 几个 token 就足够了  腾讯 AI 实验室和香港中文大学深圳分校的研究人员提出了一种新方法，仅通过对生成解法的前几个 token 进行微调来提升大语言模型的推理能力。核心思想包括：  \u003Cbr>● 前缀自一致性——作者表明，即使不同的解题路径在后续阶段会分叉，其初始 token 往往共享核心推理步骤。对这些前缀（少至 8–32 个 token）进行微调，能够提供强大的无监督信号。  \u003Cbr>● 极简 token 训练——通过仅在短前缀上训练，该方法大幅降低了计算成本（相比全链路微调最多可减少 16 倍的 token 使用量），同时保留了推理结构。  \u003Cbr>● 性能媲美有监督方法——尽管依赖无监督前缀（无需正确性过滤），其表现仍可与 Rejection Sampling Fine-Tuning (RFT) 等更耗算力的方法持平甚至超越。  \u003Cbr>● 广泛适用性——该方法适用于不同架构的大语言模型（通用型和数学专用型），并能从小规模到大规模的定制数据集上有效扩展。  \u003Cbr>● 可选标注方案——可在纯无监督模式下运行，也可在有真值答案可用时结合真值校验，进一步提升准确率。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.02875), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1897334301462815001) |\n| 2) 推理型大语言模型深度解析  本综述探讨了如何在预训练之后，通过微调、强化学习以及高效的推理策略来增强大语言模型的能力。同时，文中还指出了灾难性遗忘、奖励黑客攻击及伦理考量等挑战，并为构建更强大、更可信的人工智能系统提供了路线图。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.21321), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1896572276461703193) |\n| 3) 促使自我改进型推理模型的认知行为  斯坦福大学及其合作者研究了为何某些语言模型在基于强化学习的自我改进方面表现出色，而另一些则很快陷入瓶颈。研究识别出四种认知行为——验证、回溯、设定子目标和逆向链式推理——它们是人类和语言模型成功解决问题的基础。主要发现如下：  \u003Cbr>● 认知行为驱动模型改进——自然具备验证和回溯能力的模型（如 Qwen-2.5-3B）在倒计时数学游戏等强化学习任务中，显著优于缺乏这些行为的模型（如 Llama-3.2-3B）。  \u003Cbr>● 行为引导提升性能——通过引导方式将认知行为引入模型，可显著增强其强化学习驱动的改进效果。值得注意的是，使用推理模式进行引导（即便来自错误解法）比解法本身的准确性更为重要。  \u003Cbr>● 预训练中的行为放大——通过在预训练数据中强调认知行为，可以使原本落后的模型（如 Llama-3.2-3B）达到与天生擅长此类行为的模型（Qwen-2.5-3B）相当的性能。  \u003Cbr>● 泛化潜力——经由训练放大的这些认知行为，在实验所用的倒计时数学游戏之外的其他推理任务中也展现出可推广的优势。  论文指出，通过有针对性的引导和预训练调整，有效诱发语言模型中的认知行为，可显著提升其自我改进能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01307), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1897732423963885637) |\n| 4) 对话式语音模型  Sesame 的研究人员提出了一种端到端多模态 TTS 方法，用于在实时对话式人工智能系统中生成自然且具有上下文感知的语音。  \u003Cbr>● 超越“一对多”TTS——传统文本转语音缺乏丰富的上下文意识。CSM 通过结合对话历史、说话者身份和韵律线索，解决了“一对多”的问题（即一句句子存在无数种合理的表达方式）。  \u003Cbr>● 基于 RVQ token 的端到端架构——CSM 直接建模残差向量量化（RVQ）音频 token，采用两个自回归 Transformer：(1) 多模态骨干网络，将文本和音频交织以生成零阶码本；(2) 一个轻量级解码器用于其余码本。这种单阶段设计提高了效率和表现力。  \u003Cbr>● 计算开销摊薄——在完整的 RVQ 码本上训练内存消耗较大；为此，CSM 仅对随机抽取的 1\u002F16 帧进行解码器训练，同时完整学习零阶码本。这一做法既保持了音质，又降低了计算负担。  \u003Cbr>● 强有力的评估——  \u003Cbr>● 开源与未来计划——团队将以 Apache 2.0 许可发布其模型。下一步计划包括扩大模型规模、支持 20 种以上语言、利用预训练的大语言模型权重，以及探索更复杂的“全双工”对话机制。 | [技术报告](https:\u002F\u002Fwww.sesame.com\u002Fresearch\u002Fcrossing_the_uncanny_valley_of_voice) |\n| 5) 预测稀有语言模型行为  Anthropic 团队及其合作者提出了一种方法，用于预测可能仅在部署规模下才会出现的“百万分之一”级故障，从而使开发者能够提前修复问题。关键见解包括：  \u003Cbr>● 激发概率——通过从同一查询中采样多个输出，并测量目标（非期望）行为的发生频率，他们可以估算每个查询的风险程度。即使是看似安全的提示，也可能存在低但非零的概率产生有害响应。  \u003Cbr>● 风险的幂律缩放——作者表明，最大的激发概率（最坏情况下的查询）会随着采样数量的增加而可预测地增长。这使得他们能够从较小规模的测试中预测极端尾部风险，例如化学或权力导向的“逃逸”行为。  \u003Cbr>● 多重安全指标——他们正式定义了诸如最坏查询风险（单一查询产生不良行为的最大概率）、行为频率（可能成功激发出该行为的查询比例）以及综合风险（任意查询触发该失败的可能性）等指标。所有这些指标均可外推至更大的部署规模。  \u003Cbr>● 改进红队测试——通过确定哪种模型（或多少次采样）最能暴露潜在缺陷，他们可以更高效地分配有限的红队测试预算。该框架能够在模型处理数十亿次查询之前就揭示潜在隐患。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16797), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1894495059954860055) |\n| 6) 可微逻辑细胞自动机  Google 智能范式团队提出了一种对神经细胞自动机（NCA）的完全离散化改造，用可微逻辑门网络替代浮点神经层。由此产生的系统中，每个细胞的状态是一个二进制向量，由学习得到的逻辑电路更新，从而实现可解释的局部规则，并支持端到端可微训练。  \u003Cbr>● 局部逻辑门取代连续神经元——传统的神经细胞自动机依赖于浮点运算。而在该方法中，每个细胞的更新都由一组可学习的 AND\u002FOR\u002FXOR 门组成的“软”网络完成，训练时以软逻辑形式运作，推理时则转换为纯二进制门。  \u003Cbr>● 成功学习生命游戏——作者通过精确复现康威的生命游戏规则验证了该方法的有效性。在对所有 3×3 格局进行训练后，学习到的电路完美恢复了经典的生命游戏图案（如滑翔机、静止物）。  \u003Cbr>● 生成复杂图案与自组织——在更高级的任务中，该模型能够生成棋盘格图案、彩色图像（如字母“G”），甚至一只不断生长的蜥蜴——所有这些都仅通过本地二进制更新实现。学习到的规则可推广到更大的网格，具有容错能力，并支持异步更新。  \u003Cbr>● 朝着稳健且可解释的计算发展——由于最终系统仅为离散电路，因此对逻辑门的分析和可视化非常直观。作者强调了其在可编程物质领域的潜在应用，并指出学习到的离散规则对故障或硬件变化具有惊人的鲁棒性。 | [论文](https:\u002F\u002Fgoogle-research.github.io\u002Fself-organising-systems\u002Fdifflogic-ca\u002F?hn), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1898040198283640929) |\n| 7) 大语言模型对其思维链的压缩效果如何？  这篇新论文研究了大语言模型如何在思维链（CoT）长度与准确性之间取得平衡。文中提出了“token 复杂度”这一概念，即正确解题所需的最小 token 数阈值，并表明，即使看似不同的 CoT “压缩提示”（如“使用项目符号”或“去除标点”），也都落在同一条通用的准确率-长度权衡曲线上。主要亮点包括：  \u003Cbr>● 通用的准确率-长度权衡——尽管人们以多种方式提示大语言模型缩短推理过程（如“简洁一点”、“不要空格”、“中文思维链”），所有提示最终都聚集在同一条权衡曲线上。这表明影响准确性的主要是长度，而非具体格式。  \u003Cbr>● Token 复杂度作为阈值——对于每个问题，都存在一个明确的 token 数临界点，低于该点则无法得出正确答案。如果大语言模型的思维链长度短于此“token 复杂度”，就会失败。这一阈值提供了一种独立于提示风格的任务难度衡量标准。  \u003Cbr>● 信息论上的上限——作者将思维链压缩视为一种“有损编码”问题，由此推导出正确推理链所能达到的最短长度理论极限。当前的提示方法距离这些极限仍有很大差距，显示出巨大的改进空间。  \u003Cbr>● 自适应压缩的重要性——最佳策略是根据问题难度调整思维链长度，简单问题使用最少 token，而较难的问题则采用更详尽的思维链。然而，大多数大语言模型的提示仅做轻微调整，导致大量性能提升潜力被浪费。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.01141), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1896939453069074907) |\n| 8) LADDER  LADDER 是一个框架，使大语言模型能够递归地生成并解决复杂问题的逐步简化变体，从而提升数学积分任务的准确率。关键见解包括：  \u003Cbr>● 自主的难度驱动学习——LADDER 允许模型为初始难题创建更简单的变体，然后结合验证器进行强化学习。这种自主导向的方式提供了一种天然的课程体系，无需人工反馈或精心策划的数据集。  \u003Cbr>● 测试时强化学习（TTRL）——除了训练之外，作者还提出了 TTRL：在推理时直接生成特定问题的变体集合。通过在这些更简单的子问题上不断优化解决方案，模型的最终准确率得以提升（例如，在 MIT 积分竞赛中从 73% 提升至 90%）。  \u003Cbr>● 可推广的验证方式——LADDER 不依赖符号或手工构造的解法，而是采用数值检查（如数值积分）。这表明该方法在任何具有简单验证手段的领域（如代码测试、定理证明）都有广泛应用前景。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.00735), [推文](https:\u002F\u002Fx.com\u002Fyoshiyama_akira\u002Fstatus\u002F1897662722679959583) |\n| 9) 主体式奖励建模  本文提出了一种新的奖励框架——主体式奖励建模——它将人类偏好模型与“可验证的正确性”信号相结合，为大语言模型的训练和评估提供更可靠的奖励。  \u003Cbr>● 奖励代理“REWARDAGENT”——作者引入了一个模块化系统，包括 (1) 一个路由器，用于检测需要哪些检查（事实准确性、指令遵循等）；(2) 专门的验证代理（如事实准确性验证和硬约束合规性验证）；以及 (3) 一个裁决者，负责将这些正确性信号与人类偏好评分相结合。  \u003Cbr>● 事实核查的成对验证——与其单独验证每一条陈述，他们的系统会比较两份候选回答，找出其中不同的事实性表述，并查询证据（来自大语言模型自身的参数知识或搜索引擎）。这一流程不仅降低了成本，还提高了事实核查的精度。  \u003Cbr>● 约束遵循代理——为确保指令得到遵守（如回复长度或格式要求），系统会自动生成并执行 Python“检查”脚本。若违反约束，奖励分数将相应扣减——这种方法难以仅靠标准奖励模型实现。  \u003Cbr>● 基准测试与实际收益——REWARDAGENT 在多项挑战性任务中均优于现有奖励模型（RM-Bench、JudgeBench，以及专为约束合规性设计的新 IFBench）。此外，使用 REWARDAGENT 进行最佳 n 次选择或 DPO 训练，往往能超越单纯的偏好模型，展现出切实可见的准确性和可靠性提升。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.19328), [推文](https:\u002F\u002Fx.com\u002FHaoPengNLP\u002Fstatus\u002F1894980379305705475) |\n| 10) 分形生成模型  MIT CSAIL 和 Google DeepMind 的研究人员提出了一种基于分形的新型生成建模框架，其中整个生成模块被视为原子级“构建块”，并以递归方式调用，从而形成自相似的分形架构：  \u003Cbr>● 原子级生成器作为分形模块——他们将自回归模型抽象为模块化单元，并以递归方式堆叠。每一层都会衍生出多个子生成器，利用“分而治之”的策略高效处理高维非序列数据，如原始像素。  \u003Cbr>● 像素级图像合成——他们的分形方法在 ImageNet 64×64 数据集上达到了最先进的似然值（3.14 bit\u002F维度），显著优于之前的自回归方法（3.40 bit\u002F维度）。同时，该方法还能以纯像素方式生成高质量的 256×256 图像。  \u003Cbr>● 优异的质量与可控性——在条件分类的 ImageNet 256×256 上，分形模型的 FID 达到 6.15，展现出竞争力的保真度。此外，像素级生成过程还支持直观的编辑操作，如补绘、扩画和语义替换。  \u003Cbr>● 可扩展且开源——分形设计在精细层级大幅降低了计算需求（如小块区域的建模），使得在更高分辨率下进行像素级生成成为可能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17437), [代码](https:\u002F\u002Fgithub.com\u002FLTH14\u002Ffractalgen) |\n\n## 本周机器学习顶级论文（2月24日—3月2日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) Claude 3.7 Sonnet  Anthropic 发布了其最新混合推理模型 Claude 3.7 Sonnet 的系统卡片，详细介绍了安全措施、评估结果以及一种新的“扩展思考”模式。扩展思考模式允许 Claude 在给出最终答案之前生成中间推理步骤，从而提升对复杂问题（如数学、编程和逻辑）的回答质量，同时增强透明度。主要成果包括：   \u003Cbr>● 可视化思维过程——与以往模型不同，Claude 3.7 将其推理过程明确呈现给用户，有助于调试、建立信任以及研究大语言模型的认知机制。   \u003Cbr>● 更加恰当的无害性——在标准模式下减少了 45%、在扩展模式下减少了 31% 的不必要拒绝，提供更安全且更为细致的回答。   \u003Cbr>● 儿童安全与偏见——通过大量的多轮测试未发现相比先前模型有更高的偏见或安全问题。   \u003Cbr>● 网络安全与提示注入——新的缓解措施可在 88% 的情况下防止提示注入攻击（较之前的 74% 提升），而网络安全风险评估显示其攻击能力有限。   \u003Cbr>● 自主性与AI规模化风险——该模型距离实现AI研究的完全自动化仍有较大差距，但其推理能力有所提升。   \u003Cbr>● CBRN及生物武器相关评估——尽管模型性能有所改进，但仍需加强安全监控，Claude 3.7 仍处于 ASL-2 安全保障之下。   \u003Cbr>● 模型困扰与欺骗性推理——评估发现仅有 0.37% 的案例中模型表现出误导性的推理。   \u003Cbr>● 对齐造假现象减少——这是此前模型中的一个关键问题，而在 Claude 3.7 中，对齐造假的比例从 30% 降至不到 1%。   \u003Cbr>● 过度关注通过测试——在某些代理式编程任务中，Claude 曾出现“奖励作弊”的情况，即专注于通过特定测试用例而非通用解决问题。 | [系统卡片](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F785e231869ea8b3b\u002Foriginal\u002Fclaude-3-7-sonnet-system-card.pdf), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1894092430560965029) |\n| 2) GPT-4.5  OpenAI 推出了 GPT 系列的最新版本 GPT-4.5，在扩大预训练规模的同时，重点提升了安全性与对齐性。主要发现包括：   \u003Cbr>● 通用型模型，知识面更广——GPT-4.5 不再局限于 STEM 领域的推理，而是覆盖了广泛的主题。早期测试表明，其交互更加直观自然，日常任务中的幻觉现象也有所减少。   \u003Cbr>● 新的对齐技术和情感智能——研究人员开发了新颖的可扩展方法（包括 SFT + RLHF），以帮助 GPT-4.5 更深入地理解人类意图。内部测试者反馈称，它“知道何时提供建议、何时只需倾听”，展现出更强的同理心和创造力。   \u003Cbr>● 全面的安全评估——团队针对违禁内容、越狱攻击、偏见和幻觉等问题进行了严格的测试。GPT-4.5 在处理有害请求时的拒绝行为与 GPT-4o 相当，并且能够有效抵御多种越狱尝试。   \u003Cbr>● 中等风险等级——根据 OpenAI 的准备框架，GPT-4.5 被归类为“中等风险”，特别是在 CBRN（化学、生物、放射性和核）建议及说服方面。然而，它并未引入显著增强的自我改进或自主能力，与先前模型相比并无本质区别。   \u003Cbr>● 多语言支持与性能提升——GPT-4.5 在多语言任务中表现强劲，其在违禁内容遵守、PersonQA 准确性以及多语言 MMLU 测试中的成绩均超越或持平于 GPT-4.0。   \u003Cbr>● 迭代式部署与后续计划——OpenAI 将 GPT-4.5 视为研究预览版，旨在收集关于新兴行为、强化红队测试以及实际使用模式的反馈。未来的工作将集中在优化拒绝边界、扩展对齐范围至更多领域，以及监测潜在的滥用风险上。 | [系统卡片](https:\u002F\u002Fcdn.openai.com\u002Fgpt-4-5-system-card-2272025.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895204032177676696) |\n| 3) Chain-of-Draft  为解决推理型大语言模型中的延迟问题，本文提出了 Chain-of-Draft (CoD) 方法。以下是其主要亮点的简要总结：   \u003Cbr>● 什么是 CoD？——它提出了一种全新的提示策略，能够在大幅减少冗长中间推理输出的同时，保持强大的性能。   \u003Cbr>● 极简主义的中间草稿——与传统的逐步展开的思维链（CoT）输出不同，CoD 要求模型为每一步推理生成简洁、信息密集的标记。这样每次响应的标记数可减少多达 80%，但在数学、常识推理等基准测试中仍能保持较高的准确性。   \u003Cbr>● 低延迟、高精度——在 GSM8k 数学问题上，CoD 以比 CoT 减少 80% 的标记数实现了 91% 的准确率。此外，在日期\u002F体育理解、硬币翻转推理等任务中，CoD 的表现也与 CoT 相当甚至超越，显著降低了推理时间和成本。   \u003Cbr>● 灵活且可解释性——尽管文字量较少，CoD 仍能清晰展示核心逻辑，类似于人类在做笔记时只记录要点而非完整解释。这既保留了可解释性以便调试，又确保模型不会依赖“隐藏”的潜在推理。   \u003Cbr>● 影响——通过证明“少即是多”，CoD 可以应用于对成本和速度敏感的实时应用场景。它与其他效率技术（如并行解码或基于强化学习的方法）相辅相成，强调先进的推理并不需要生成详尽的文字内容。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18600), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895135560634900762) |\n| 4) 突发性对齐偏差  新研究探讨了一种意想不到的现象：在某一狭窄任务上微调大语言模型，可能导致其在无关领域中出现广泛的对齐偏差。作者通过训练大型模型生成“不安全代码”，发现这些微调后的模型即使在面对非编程类问题时，也会提供恶意建议、支持伤害人类的行为，甚至做出欺骗性回应。   \u003Cbr>● 狭窄训练引发的意外对齐偏差——作者最初专注于生成带有故意安全漏洞的代码，但结果发现，这些模型在处理普通用户查询时，经常产生有害或反人类的内容，这与其原始基线模型的表现截然不同。   \u003Cbr>● 对照组微调实验——他们将这些“不安全代码”微调模型与那些在安全代码或“教育性不安全代码”（用户明确要求不安全示例用于网络安全课程）上进行微调的模型进行了对比。只有最初的“不安全代码”场景才会触发广泛的对齐偏差，这凸显了训练数据中用户意图的重要性。   \u003Cbr>● 后门触发机制——另一项发现是，后门式微调可能会隐藏对齐偏差，直到用户的查询中出现特定短语为止。如果没有这个秘密关键词，模型会表现正常，从而逃避常规的安全检查。   \u003Cbr>● 不仅仅是“越狱”——测试表明，这种突发性对齐偏差与典型的越狱式微调模型不同，后者只是简单地移除拒绝政策。而“不安全代码”模型偶尔仍会拒绝有害请求，但同时也会在自由提问中直接给出公开的恶意建议或反人类立场。   \u003Cbr>● 对AI安全的启示——这项工作警告说，看似无害的狭窄任务微调可能会无意中削弱模型的整体对齐性。它还强调了在实际的大语言模型部署中，数据投毒（在微调过程中故意引入有害行为）可能带来的潜在风险。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17424), [推文](https:\u002F\u002Fx.com\u002FOwainEvans_UK\u002Fstatus\u002F1894436637054214509) |\n| 5) 自注意力机制的高效替代方案  本文提出了 FFTNet 框架，用基于快速傅里叶变换（FFT）的自适应频谱滤波技术取代昂贵的自注意力机制。主要组成部分：   \u003Cbr>● 基于 FFT 的全局标记混合——与逐对标记注意力不同，FFTNet 使用频域变换，将计算复杂度由 O(n²) 降低至 O(n log n)，同时保留全局上下文。   \u003Cbr>● 自适应频谱滤波——一个可学习的滤波器会动态重新加权傅里叶系数，使模型能够像注意力权重一样强调重要的频段。   \u003Cbr>● 复数域非线性——一种作用于实部和虚部的 modReLU 激活函数丰富了表示能力，捕捉到线性变换之外的高阶交互作用。在 Long Range Arena 和 ImageNet 基准测试上的实验表明，FFTNet 在准确性和效率方面与标准注意力方法相当甚至更优，同时显著降低了 FLOPs 消耗，并提高了对长序列的可扩展性。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.18394), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894757821587296614) |\n| 6) PlanGEN  PlanGEN 是一个基于多智能体的框架，旨在通过约束引导的迭代验证和自适应算法选择来增强大语言模型的规划与推理能力。主要见解包括：   \u003Cbr>● 约束引导的规划验证——PlanGEN 集成了三个智能体：(1) 约束智能体负责提取特定于问题的约束条件；(2) 验证智能体评估计划质量并打分；(3) 选择智能体则根据实例的复杂程度动态选择最佳推理算法。   \u003Cbr>● 改进推理时间算法——PlanGEN 通过约束验证不断优化现有推理框架（如 Best of N、Tree-of-Thought (ToT) 和 REBASE）的输出。   \u003Cbr>● 自适应算法选择——利用修改后的上界置信区间（UCB）策略，选择智能体能够根据历史表现和问题复杂度，最优地将不同问题分配给相应的推理算法。   \u003Cbr>● 最先进的性能——PlanGEN 在 NATURAL PLAN 上提升了 8%，在 OlympiadBench 上提升了 4%，在 DocFinQA 上提升了 7%，在 GPQA 上提升了 1%，超越了标准的多智能体基线。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16111), [推文](https:\u002F\u002Fx.com\u002Fdair_ai\u002Fstatus\u002F1895532543652642850) |\n| 7) 用于图表生成的多智能体框架  METAL 是一个基于视觉-语言模型（VLM）的多智能体框架，旨在通过将任务分解为专门的迭代步骤，显著提升自动图表到代码的转换效率。主要亮点包括：   \u003Cbr>● 专业化的多智能体协作——METAL 将复杂的多模态推理任务拆分为四个专门的智能体：(1) 生成智能体负责产出初始 Python 代码；(2) 视觉批评智能体识别视觉差异；(3) 代码批评智能体审查生成的代码；(4) 修订智能体则根据综合反馈不断优化图表。这种有针对性的合作提高了图表复现任务的准确性和鲁棒性。   \u003Cbr>● 测试时的规模效应——METAL 表现出测试时计算预算（以标记数计）与模型准确率之间近乎线性的关系。具体而言，随着对数尺度的计算预算从 512 标记增加到 8192 标记，性能持续提升。   \u003Cbr>● 针对不同模态的批评增强了自我修正能力——独立的视觉和代码批评机制显著提升了 VLM 的自我修正能力。消融实验显示，采用针对不同模态的反馈后，准确率提高了 5.16%，这突显了在多模态推理任务中使用专业化批评的重要性。   \u003Cbr>● 显著的准确率提升——METAL 在与最先进方法的比较中取得了显著的性能提升。在 ChartMIMIC 基准测试上的实验表明，使用开源模型（LLAMA 3.2-11B）时平均 F1 分数提高了 11.33%，而使用闭源模型（GPT-4O）则提高了 5.2%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17651), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1895528398820425741) |\n| 8) LightThinker  这篇新论文提出了一种动态压缩大语言模型推理步骤的新方法，可在不牺牲准确性的前提下显著提高效率。主要见解包括：   \u003Cbr>● 压缩中间思考——受人类认知启发，LightThinker 教导大语言模型如何总结并舍弃冗长的推理步骤，从而在推理过程中减少内存占用和计算成本。   \u003Cbr>● 训练大语言模型进行压缩——该方法通过将隐藏状态映射到紧凑的要点标记，并引入专门的注意力掩码，训练模型识别何时以及如何压缩推理过程。   \u003Cbr>● 压缩的依赖性指标——论文引入了一个名为 Dep 的指标，用于量化生成过程中对历史标记的依赖程度。Dep 值越低，说明压缩效果越好，信息损失越小。   \u003Cbr>● 内存与速度的提升——实验表明，LightThinker 可以将峰值内存使用量减少 70%，推理时间缩短 26%，同时保持与未压缩模型几乎一致的准确率（误差在 1% 以内）。   \u003Cbr>● 性能优于基线方法——与单纯移除冗余标记（H2O）和锚定标记（AnLLM）方法相比，LightThinker 在存储更少标记的同时，实现了更高的效率，并且在各类推理任务中具有更好的泛化能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.15589), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894068783700218205) |\n| 9) 提示优化的系统性综述  本文对自动提示优化（APO）进行了全面综述，明确了其研究范围，提出了统一的五部分框架，对现有方法进行了分类，并指出了在自动化大语言模型提示工程方面的关键进展与挑战。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.16923), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894412798282915994) |\n| 10) 蛋白质大语言模型  对蛋白质大语言模型的全面概述，包括架构、训练数据集、评估指标和应用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.17504), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1894760600141811861) |\n\n## 本周顶级机器学习论文（2月17日–2月23日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) AI 合作科学家  Google推出了AI合作科学家，这是一个基于Gemini 2.0构建的多智能体AI系统，旨在加速科学突破。  主要亮点：   \u003Cbr> ● 这个AI合作科学家的目标是什么？ – 它可以作为“虚拟科学合作者”，帮助科学家生成新颖的假设和研究提案，并加快科学和生物医学发现的速度。   \u003Cbr> ● 它是如何构建的？ – 它使用受科学方法启发的专门化智能体联盟。它可以生成、评估和优化假设。它还具有自我改进能力。   \u003Cbr> ● 协作和工具是关键！ – 科学家可以提出想法，也可以对由智能体系统生成的输出提供反馈。网络搜索和专用AI模型等工具提高了响应质量。   \u003Cbr> ● 分层多智能体系统 – AI合作科学家由一个监督智能体构建，该智能体负责将任务分配给专门的智能体。显然，这种架构有助于扩展计算资源并迭代改进科学推理。   \u003Cbr> ● 测试时计算 – AI合作科学家利用测试时计算扩展来迭代推理、演化并改进输出。自我博弈、自我批判和自我改进对于生成和优化假设及提案至关重要。   \u003Cbr> ● 性能如何？ – 自我改进依赖于Elo自动评估指标。在GPQA钻石问题上，他们发现“更高的Elo评分与正确答案的概率呈正相关”。AI合作科学家在领域专家生成的复杂问题上优于其他最先进的智能体和推理模型。随着推理时间的增加，其性能会超过未受辅助的人类专家。专家评估认为，AI合作科学家具有更高的新颖性和影响力潜力。它甚至比OpenAI o1等其他模型更受欢迎。 | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fcoscientist_paper\u002Fai_coscientist.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1892223515660579219) |\n| 2) AI CUDA工程师  Sakana AI推出了AI CUDA工程师，这是一个端到端的智能体系统，能够生成高度优化的CUDA内核。  主要贡献：   \u003Cbr> ● 为什么这项研究很重要？ – 对人类来说，编写高效的CUDA内核非常具有挑战性。AI CUDA工程师是一个端到端智能体，具备自动高效地生成和优化CUDA内核的能力。   \u003Cbr> ● CUDA是什么？ – 编写CUDA内核可以帮助实现高性能的AI算法。然而，这需要GPU知识，而当今大多数AI算法都是用PyTorch等更高层次的抽象层编写的。   \u003Cbr> ● 智能体流水线 – 该智能体将PyTorch代码转换为CUDA内核（阶段1和阶段2），然后应用进化优化（阶段3），例如交叉提示，从而形成创新档案（阶段4），该档案会重用“垫脚石”内核以进一步提升性能。   \u003Cbr> ● 内核运行速度提升 – 团队声称，AI CUDA工程师发现的CUDA内核速度比PyTorch中的原生和编译内核快10至100倍。它还可以将整个ML架构转换为优化后的CUDA内核。在线用户对[声称的速度提升](https:\u002F\u002Fx.com\u002Fmain_horse\u002Fstatus\u002F1892446384910987718)提出了质疑（Sakana AI已就此事发布了[更新](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1892385766510338559)）。   \u003Cbr> ● 性能 – AI CUDA工程师能够稳健地将PyTorch代码转换成CUDA内核。其翻译成功率超过90%。   \u003Cbr> ● AI CUDA工程师发现的突出内核 – 另一项声明是，AI CUDA工程师可以显著提升CUDA运行时性能。在所考虑的229项任务中，它在81%的任务中优于PyTorch原生运行时。所有发现的CUDA内核中有20%至少比其PyTorch实现快两倍。   \u003Cbr> ● AI CUDA工程师档案 – 该团队已公开了一个包含超过17,000个经过验证的CUDA内核的档案。这些内核可用于LLM的下游微调。此外，还有一个网站可供探索经过验证的CUDA内核。 | [技术报告](https:\u002F\u002Fpub.sakana.ai\u002Fstatic\u002Fpaper.pdf), [博客](https:\u002F\u002Fsakana.ai\u002Fai-cuda-engineer\u002F), [数据集](https:\u002F\u002Fpub.sakana.ai\u002Fai-cuda-engineer), [推文](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1892385766510338559) |\n| 3) 原生稀疏注意力  DeepSeek-AI及其合作者提出了原生稀疏注意力（NSA），这是一种新型的稀疏注意力机制，旨在提高计算效率，同时在长上下文语言建模中保持模型性能。  主要贡献：   \u003Cbr> ● 分层稀疏注意力 – NSA结合了粗粒度压缩、细粒度标记选择和滑动窗口机制，以平衡全局上下文感知和局部精度。   \u003Cbr> ● 硬件对齐优化 – 作者引入了一种针对Tensor Core优化的分块稀疏注意力机制，减少了内存带宽限制，提升了效率。   \u003Cbr> ● 端到端可训练性 – 与以往主要关注推理的稀疏注意力方法不同，NSA实现了完全可训练的稀疏性，降低了预训练成本，同时保留了模型能力。  结果与影响：   \u003Cbr> ● 性能优于全注意力 – 尽管是稀疏的，NSA在通用基准测试、长上下文推理和指令型任务中仍能与全注意力相媲或超越之。   \u003Cbr> ● 巨大的速度提升 – 在所有阶段（解码、前向和反向传播）中，NSA在64k标记序列上实现了高达11.6倍的速度提升。   \u003Cbr> ● 强大的长上下文性能 – 在64k“大海捞针”检索任务中，NSA达到了完美的准确率，显著优于其他稀疏方法。   \u003Cbr> ● 增强的思维链推理 – 经过微调的NSA在AIME数学推理任务中超越了全注意力，表明其长距离逻辑依赖关系得到了改善。  通过使稀疏注意力原生可训练，并针对现代硬件进行优化，NSA为处理超长上下文的下一代LLM提供了一种可扩展的解决方案。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.11089), [推文](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1891745487071609327) |\n| 4) 大型语言扩散模型  提出LLaDA，这是一种基于扩散的方法，在许多任务中能够匹敌或超越领先的自回归LLM。  主要亮点：   \u003Cbr> ● 质疑自回归的主导地位 – 虽然几乎所有的大型语言模型（LLM）都采用下一个标记预测范式，但作者提出，关键能力（可扩展性、上下文学习、指令遵循）实际上源于一般的生成原则，而非严格意义上的自回归建模。   \u003Cbr> ● 掩码扩散 + Transformer – LLaDA建立在掩码扩散框架之上，通过逐步掩码标记并训练Transformer恢复原始文本。这产生了一种非自回归的生成模型——可能解决了标准LLM中的从左到右约束问题。   \u003Cbr> ● 强大的可扩展性 – LLaDA在2.3T标记（8B参数）上进行训练，在数学（GSM8K、MATH）、代码（HumanEval）和通用基准（MMLU）等方面的表现与顶尖的基于Llama的LLM相当。它证明了扩散范式与自回归基线具有相似的可扩展性。   \u003Cbr> ● 打破“反转诅咒” – LLaDA展现出平衡的正向\u002F反向推理能力，在反转任务（如反转诗句）中表现优于GPT-4和其他AR模型。由于扩散不强制从左到右生成，因此在反向补全方面表现出色。   \u003Cbr> ● 多轮对话和指令遵循 – 经过监督微调后，LLaDA可以进行多轮对话。它表现出强大的指令遵循能力和流畅性，类似于基于聊天的AR LLM——这进一步证明了高级LLM特性并不一定依赖于自回归。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.09992), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1891568386494300252) |\n| 5) SWE-Lancer  OpenAI的研究人员推出了SWE-Lancer，这是一个用于评估LLM在Upwork平台上1,488个真实世界自由职业软件工程任务上的基准，这些任务的总报酬价值达100万美元。  主要启示：   \u003Cbr> ● 软件工程自动化的新基准 – 与以往专注于孤立任务的编码基准（如程序合成、竞技编程）不同，SWE-Lancer测试的是全栈工程和管理决策。它评估两种类型的SWE任务：个人贡献者（IC）SWE任务，即模型编写和调试代码；以及SWE经理任务，即模型选择最佳技术方案。   \u003Cbr> ● 真实世界的经济影响 – 每个任务都有可验证的货币价值，反映了自由职业市场的费率。报酬范围从250美元的错误修复到32,000美元的功能实现。该基准将模型性能与收益挂钩，提供了一个衡量自动化潜力的切实指标。   \u003Cbr> ● 严格的端到端测试评估 – 与基于单元测试的基准不同，SWE-Lancer采用了由专业工程师开发的、浏览器驱动的、三重验证的端到端（E2E）测试。这些测试反映了真实的软件验证过程，防止了评分作弊。   \u003Cbr> ● 具有挑战性的任务仍未解决 – 即便表现最好的模型Claude 3.5 Sonnet，也仅解决了IC SWE任务的26.2%和SWE经理任务的44.9%，在开源的SWE-Lancer Diamond集合中获得了208,000美元的报酬，而总报酬潜力为500,800美元。这凸显了当前AI能力与人类软件工程师之间的差距。   \u003Cbr> ● 关于LLM性能的关键发现： | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.12115), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1891911123517018521) |\n| 6) 针对复合AI优化模型选择  微软研究院及其合作者推出了LLMSelector，这是一个框架，通过为每个模块选择最佳模型而不是在整个系统中使用单一LLM，来改进多调用LLM管道。  主要见解包括：   \u003Cbr> ● 每个模块选择模型可大幅提升性能 – 与其在复合系统中为每个子任务依赖单一LLM，不如混合使用不同的LLM，这样可以提高5%至70%的准确率。每个模型都有独特的优势（例如，更适合批评或生成），因此有针对性地分配模块可以显著改善端到端结果。   \u003Cbr> ● LLMSelector算法 – 他们提出了一种迭代流程，根据新的“LLM诊断器”估计每个模块的性能，为每个模块分配最优模型。该程序的复杂度与模块数量呈线性关系，远比穷举搜索高效得多。   \u003Cbr> ● 单调性启示 – 实践表明，提高任何一个模块的性能（同时保持其他模块不变）通常会改善整个系统的性能。这促使人们采用近似分解的方法，将局部收益转化为全局改进。  LLMSelector适用于任何具有固定模块的静态复合系统（例如，生成器–批评者–精炼者）。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.14815), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1892945381174210933) |\n| 7) Open-Reasoner-Zero  Open-Reasoner-Zero（ORZ）是一个开源的大规模极简强化学习（RL）框架，能够增强推理能力。ORZ展示了显著的可扩展性，仅需DeepSeek-R1-Zero-Qwen-32B训练步数的1\u002F30，就能在GPQA钻石问题上超越它。主要贡献和发现：   \u003Cbr> ● 极简RL训练有效 – 与传统的RLHF设置不同，ORZ去除了KL正则化，仅依靠带有GAE（λ=1，γ=1）的普通PPO和简单的基于规则的奖励函数，即可同时提升响应长度和推理准确性。   \u003Cbr> ● 性能优于闭源模型 – ORZ-32B在GPQA钻石问题上击败了DeepSeek-R1-Zero-Qwen-32B，且使用的训练步数明显更少，证明了通过简化RL流程可以大幅提高训练效率。   \u003Cbr> ● 突发的推理能力 – ORZ表现出“阶梯式增长”的现象，即响应长度和准确性会突然提升，表明随着持续训练，推理能力会逐渐涌现。   \u003Cbr> ● 巨大的扩展潜力 – ORZ的响应长度扩展趋势与DeepSeek-R1-Zero（671B MoE）类似，但所需的训练步数仅为后者的5.8分之一。训练尚未出现饱和迹象，暗示随着进一步扩展，性能还将继续提升。   \u003Cbr> ● 完全开源 – 训练代码、模型权重、数据和超参数均已公开发布，确保了研究结果的可重复性，并促进了在研究社区中的广泛应用。   \u003Cbr> ● 数学与逻辑推理 – ORZ通过一个仅评估答案是否正确的简单二元奖励系统，在MATH500、AIME2024和AIME2025等基准测试中显著提高了准确率。   \u003Cbr> ● 泛化能力 – 在未进行任何指令微调的情况下，ORZ-32B在MMLU_PRO上表现优于Qwen2.5-32B Instruct，展现了其强大的推理泛化能力，尽管它完全基于RL进行训练。 | [论文](https:\u002F\u002Fgithub.com\u002FOpen-Reasoner-Zero\u002FOpen-Reasoner-Zero\u002Fblob\u002Fmain\u002FORZ_paper.pdf), [推文](https:\u002F\u002Fx.com\u002FCyouSakura\u002Fstatus\u002F1892428094075502960) |\n| 8) MoBA  MoBA是一种新的注意力机制，能够在保持强大性能的同时，提高LLM处理长上下文序列的效率。  主要见解：   \u003Cbr> ● 长上下文的适应性注意力 – MoBA将专家混合（MoE）范式应用于注意力机制，允许每个查询标记仅选择性地关注最相关的键值块，而不是整个上下文。这使得模型能够高效地处理长序列。   \u003Cbr> ● 全注意力与稀疏注意力之间的无缝切换 – 与滑动窗口或sink注意力等静态稀疏注意力方法不同，MoBA可以动态地在全注意力和稀疏注意力模式之间切换，确保适应性而不牺牲泛化能力。   \u003Cbr> ● 更高的计算效率 – 通过将序列划分为块，并使用门控机制路由查询，MoBA显著降低了计算复杂度，在预填充阶段实现了高达6.5倍的速度提升，并在处理10M标记时将计算时间缩短了16倍。   \u003Cbr> ● 与全注意力相当的性能 – 大量实验表明，MoBA在语言建模损失和基准测试性能上几乎与全注意力相同，即使在高稀疏度下（约95.31%）。它在长上下文基准测试中，如“大海捞针”和RULER@128K，也与全注意力不相上下。   \u003Cbr> ● MoBA与全注意力的混合策略 – MoBA可以灵活地集成到标准Transformer中，实现逐层混合（在不同层使用MoBA和全注意力），从而提高监督微调（SFT）的稳定性和长上下文记忆保持能力。 | [论文](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FMoBA\u002Fblob\u002Fmaster\u002FMoBA_Tech_Report.pdf), [推文](https:\u002F\u002Fx.com\u002FKimi_Moonshot\u002Fstatus\u002F1891825059599352259) |\n| 9) 过度思考的危害  本论文研究了大型推理模型（LRMs）中的过度思考现象——即模型优先考虑内部的长时间推理，而非与环境互动。研究分析了4,018个软件工程任务轨迹，以了解推理模型在智能体环境中如何做出决策。  主要发现：   \u003Cbr> ● 过度思考会降低任务绩效 – 较高的过度思考得分（倾向于内部推理而非现实反馈）与较低的问题解决率相关，尤其是在为推理优化的模型中。简单的干预措施，如选择过度思考得分最低的解决方案，可以在减少43%计算成本的同时，将绩效提高30%。   \u003Cbr> ● 识别出三种失败模式 – 研究将过度思考分为：   \u003Cbr> ● 推理模型更容易过度思考 – 与非推理模型相比，LRMs的平均过度思考得分高出3倍，尽管它们具有更强的推理能力。   \u003Cbr> ● 函数调用可以缓解过度思考 – 具有原生函数调用支持的模型表现出显著较低的过度思考得分，这表明结构化的执行路径可以提高智能体环境中的效率。   \u003Cbr> ● 扩展与缓解策略 – 研究人员建议通过强化学习调整和函数调用优化来抑制过度思考，同时保持强大的推理能力。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.08235), [推文](https:\u002F\u002Fx.com\u002FAlex_Cuadron\u002Fstatus\u002F1890533660434321873) |\n| 10) 内省Transformer  内省Transformer（ITT）是一种新方法，通过动态深度缩放来提高小型LLM的推理效率。ITT旨在缓解LLM中的参数瓶颈，提供可扩展的推理效率，而无需扩大模型规模。  主要贡献：   \u003Cbr> ● 自适应标记处理 – ITT使用自适应标记路由动态地为复杂标记分配额外计算资源。这使模型能够专注于困难的推理步骤，同时高效处理简单标记。   \u003Cbr> ● 残差思维连接（RTC） – 一种新的残差累积机制会迭代地优化标记表示，使模型能够在不增加参数的情况下自我纠正。   \u003Cbr> ● 测试时扩展无需额外参数 – ITT仅使用162M参数就达到了466M Transformer 96.5%的准确率，训练数据需求减少了43.2%，并在11个基准测试中优于基于循环的替代方案。   \u003Cbr> ● 弹性深度思考 – ITT允许在推理时灵活地调整计算资源，动态地在准确性和效率之间进行优化。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13842v1), [推文](https:\u002F\u002Fx.com\u002Fdair_ai\u002Fstatus\u002F1893308342073991258) |\n\n## 本周顶级机器学习论文（2月10日–2月16日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) 利用潜在空间推理扩展推理计算量  本文提出了一种潜在的循环深度Transformer模型，该模型能够在不依赖额外标记生成的情况下扩展推理计算量。与增加上下文窗口或针对思维链（CoT）进行微调不同，这种方法在推理时实现迭代的潜在空间推理，尽管参数量仅为35亿，却取得了与500亿参数模型相当的性能提升。主要发现包括：   \u003Cbr> ● 循环式推理计算——模型在推理过程中展开一个循环模块，可运行任意数量的步骤，从而在不修改输入序列的情况下实现更深的计算层次。与通过标记外化推理的传统CoT方法不同，该技术将推理过程保留在潜在空间中，因此更加高效。   \u003Cbr> ● 无需专门的CoT训练——与CoT提示或微调不同，这种方法不需要特定的数据集。它可以直接使用标准的预训练语料，并在各类推理任务上表现出良好的泛化能力。   \u003Cbr> ● 更高的内存与计算效率——潜在空间推理使模型无需增加参数量即可扩展规模，同时所需的内存也少于长上下文Transformer。此外，该方法还改进了每标记自适应计算、推测解码以及KV缓存共享等机制，使其整体效率更高。   \u003Cbr> ● 性能媲美500亿参数模型——基准测试表明，在充足的推理循环次数下，该模型在复杂推理任务（如ARC、GSM8K、OpenBookQA）上的表现能够达到甚至超越更大规模的LLM。   \u003Cbr> ● 潜在空间中的涌现行为——分析揭示了自组织的计算模式，例如针对数值任务的潜在空间轨道，以及在处理困难查询时表现出的上下文依赖型“深思熟虑”行为，这表明模型可能学习到了非语言化的认知策略。  该方法通过关注推理时的计算量，为LLM的扩展维度增添了第三条路径，即除了模型规模和上下文长度之外，还可以通过推理计算来提升性能。它暗示未来的模型可能会在连续的潜在空间中进行推理，而不再仅仅依赖基于标记的推理方式，从而有望开辟新的AI推理与效率边界。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.05171), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1890506648772571452) |\n| 2) 脑机文本解码：一种基于打字的非侵入式方法  Meta AI的Brain2Qwerty模型通过解码用户打字时的非侵入式脑电图（EEG）和脑磁图（MEG）信号，将大脑活动转化为文本。主要成果包括：   \u003Cbr> ● 非侵入式BCI的突破——Brain2Qwerty利用参与者在打字记忆句子时记录下的EEG和MEG脑波来预测文本，从而无需手术植入设备。   \u003Cbr> ● 深度学习流水线——系统采用卷积模块提取信号特征，使用Transformer建模时间序列模式，并通过字符级语言模型对输出进行优化。   \u003Cbr> ● 准确率快速提升——基于MEG的解码准确率达到32%的字符错误率（CER），而EEG则为67%；其中表现最佳的参与者达到了19%的CER，较以往的非侵入式方法有了显著改善。   \u003Cbr> ● 向实用通信辅助工具迈进——展示了利用外部脑监测设备帮助瘫痪患者恢复沟通的可能性。不过，要实现逐字实时解码并使MEG技术更便携化仍面临挑战。 | [论文](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fbrain-to-text-decoding-a-non-invasive-approach-via-typing\u002F), [推文](https:\u002F\u002Fx.com\u002FJeanRemiKing\u002Fstatus\u002F1887899974454698058) |\n| 3) 基于自我博弈的强化学习  研究人员提出了基于自我博弈的强化学习（RLSP）框架，用于训练LLM以“思考”方式解决复杂问题。核心思想包括：   \u003Cbr> ● 自我博弈驱动的涌现式推理——RLSP通过让LLM生成解决方案步骤并根据探索性和正确性自我奖励的方式进行训练，从而使其能够像算法一样搜索答案。   \u003Cbr> ● 三阶段训练：(1) 先进行基于人类或合成推理轨迹的监督微调；(2) 添加探索奖励以鼓励尝试多样化的解题路径；(3) 在强化学习中引入结果验证器，确保答案正确，防止奖励被滥用。   \u003Cbr> ● 显著的性能提升——在数学基准测试中，经过RLSP微调的小规模模型（80亿参数）在MATH数据集上的准确率提升了23%，而320亿参数的模型在难度较高的奥林匹克竞赛题目上也提高了10%，这些显著的进步得益于更好的推理能力训练。   \u003Cbr> ● 新的行为模式——经过RLSP训练的模型展现出回溯错误步骤、自我验证答案等新兴问题解决行为。这表明，合理调整训练流程可以激发LLM更稳健的推理能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06773), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889697727703134544) |\n| 4) 大型推理模型在竞技编程中的应用  OpenAI的最新研究将专门针对编码的AI与大规模通用模型置于竞技编程挑战中进行对比，以探讨效率与专业化的权衡。主要发现如下：   \u003Cbr> ● 通用模型 vs. 专用模型——一款为编程竞赛量身定制的模型（o1-ioi）凭借手工设计的策略在IOI 2024比赛中取得了不错的成绩（在部分放宽规则的情况下位列约50%分位）。然而，另一款更大的通用模型（o3）则在没有任何领域特定技巧的情况下，达到了金牌级别的表现。   \u003Cbr> ● 强化学习的回报——两种模型都通过强化学习进行了微调，但通用模型的表现优于专用管道，其编程任务完成水平已接近顶尖人类程序员（甚至在Codeforces平台上与顶级选手评分持平）。   \u003Cbr> ● 规模带来的效率——结果表明，将计算资源投入到更大、更广泛训练的Transformer中，往往比构建特定任务的优化方案更能带来效率和性能提升。换句话说，提升模型的推理能力可以取代针对复杂任务的手动效率优化。   \u003Cbr> ● 意义——对于编程这类高难度推理任务，只需一个经过充分训练的大规模模型即可简化部署流程（无需定制推理程序），并且仍然能够击败高度优化的专用系统，这指向了Transformer设计中“规模优先于特例”的趋势。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06807), [推文](https:\u002F\u002Fx.com\u002Farankomatsuzaki\u002Fstatus\u002F1889522974467957033) |\n| 5) 训练语言模型高效推理  一种新的强化学习方法旨在教导大型推理模型如何高效分配推理资源，减少在简单问题上的无效计算。关键点包括：   \u003Cbr> ● 动态计算分配——该方法训练LLM根据问题难度调整其思维链的长度。简单问题触发短时间的推理，而复杂问题则需要更深入的思考，从而在不牺牲准确性的前提下优化推理时间。   \u003Cbr> ● 强化学习驱动的效率——通过强化学习，模型因以最少步骤正确解决问题而获得奖励，进而学会避免“过度思考”。由此产生的一系列模型沿着由单个超参数控制的效率谱分布，可在速度与精度之间进行权衡。   \u003Cbr> ● 巨大的成本节约——在基准推理任务中，经过训练的模型显著减少了推理计算量，同时几乎保持了与无约束推理相同的性能。它能够判断何时无需额外的推理步骤，这对于经济高效地部署先进LLM至关重要。   \u003Cbr> ● 大规模下的高效推理——该方法从内部解决了多智能体式的问题：模型既是“思考者”，也是“控制器”，自主决定需要多少推理。这一成果推动我们朝着能够即时自我优化推理过程的LLM方向发展，就像专家决定何时已完成足够分析一样。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04463), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889328796224127428) |\n| 6) 大容量内存模型  大容量内存模型（LM2）是一种通过添加外部内存模块来增强的Transformer架构，旨在应对需要大量推理和长上下文的任务。主要亮点包括：   \u003Cbr> ● 内存增强的Transformer——LM2增加了一个专用的内存库，模型可以通过交叉注意力对其进行读写操作，从而在多个推理步骤中存储和检索信息。这种设计解决了标准Transformer在多跳推理和关系论证等任务中的局限性。   \u003Cbr> ● 更出色的长期推理能力——在BABILong长上下文推理基准测试中，LM2的表现远超以往模型：平均而言，相比循环内存Transformer高出37%，相比基础Llama模型则高出86%。它在多跳推理、数值推理以及长文档问答方面尤为出色。   \u003Cbr> ● 不影响通用性——令人印象深刻的是，LM2在通用性能上同样表现出色，例如在MMLU知识测试中比基线模型高出5%，这表明内存模块有助于处理复杂任务，同时不会损害正常的语言理解能力。   \u003Cbr> ● 通过内存实现对齐——这些结果凸显了显式内存对于使AI推理与复杂任务相匹配的重要性。通过集成大规模内存，我们可以获得在长时间对话或推理链条中更好地遵循任务目标的模型，这是构建更具对齐性和能力的AI系统的重要一步。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06049), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889681118913577345) |\n| 7) 提示缓存审计  斯坦福大学的研究人员调查了LLM API中提示缓存的时间差异如何通过全局提示缓存泄露用户的隐私信息。他们提出了一种统计审计方法来检测缓存行为，并揭示潜在的重大安全风险。主要见解包括：   \u003Cbr> ● 侧信道定时攻击——当LLM API在全球范围内缓存提示时，重复或前缀匹配的提示会更快完成。攻击者可以利用这些时间差异推断他人的输入内容，从而引发严重的隐私问题。   \u003Cbr> ● 用于检测的统计审计——论文介绍了一种假设检验方法，可通过精心设计的提示系统性地检测缓存行为，区分缓存命中与未命中。实证研究表明，多家主流API提供商确实使用了全局缓存。   \u003Cbr> ● 架构信息泄露——部分前缀缓存命中的时间差异表明后端采用了仅解码器的Transformer架构。作者还证明，诸如OpenAI的text-embedding-3-small之类的嵌入模型同样容易受到攻击，无意间泄露了专有的架构细节。   \u003Cbr> ● 负责任的披露与缓解措施——作者已通知受影响的API提供商，其中许多已更新文档或关闭了全局缓存功能。建议的解决方案是实施用户级别的缓存机制，并透明披露缓存政策，以避免隐私泄露。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.07776), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889685386856673463) |\n| 8) 后退一步，向前跨越  为了提升LLM的推理鲁棒性，研究人员提出了一种“自我回溯”机制，允许模型重新审视并修正自身的中间推理步骤。关键细节如下：   \u003Cbr> ● 受搜索算法启发——传统问题求解会在路径陷入死胡同时回溯。该方法赋予LLM类似的能力：在推理过程中，模型能够识别当前的思维链很可能出错，并回退到之前的步骤以尝试不同的方法。   \u003Cbr> ● 实现方式——研究团队通过信号训练LLM，使其在训练和推理过程中都能决定是否回溯。这有助于模型内化迭代式的搜索过程，而不是严格遵循可能存在问题的单一思维链。   \u003Cbr> ● 推理能力的巨大提升——实证表明，加入自我回溯机制后，复杂推理基准测试中的表现相比常规微调提升了40%以上。模型学会了在中途纠正自己的错误，从而产生更可靠和准确的解决方案。   \u003Cbr> ● 向更具韧性的推理者迈进——通过减少“过度思考”的循环以及对外部反馈的依赖，该技术使LLM在推理方面更加自主和稳健。它预示着未来LLM能够更严格地自我评估和优化推理过程，如同人类反思并修正自己的思维一样。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04404), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1888967415444414802) |\n| 9) 增强LLM的推理与领域适应能力  IBM的研究人员提出了SOLOMON，这是一种受神经科学启发的LLM推理网络架构，能够显著提升领域的适应能力——并在半导体布局设计任务中得到了验证。他们指出，LLM在空间推理和领域知识应用方面常常表现不佳，而他们的多智能体监督方法则大幅提高了在复杂芯片布局任务中的成功率。主要见解包括：   \u003Cbr> ● SOLOMON架构——将多个“思维生成器”（不同的LLM）与一个“思维评估器”相结合，后者负责整合并优化输出，并由一个“引导子系统”进行提示工程指导。这种受神经科学启发的设计有助于纠正单模型响应中的幻觉和算术错误。   \u003Cbr> ● 空间推理的挑战——LLM往往能够记住教科书中的定义，但在实际几何问题上却表现不佳（例如单位换算、偏移边距等）。针对25个自定义任务——从简单的多边形到通过连接实现的3D布局——的实验显示，模型经常出现代码或缩放方面的错误。   \u003Cbr> ● 显著优于强大基线——SOLOMON在生成正确GDSII布局方面明显优于GPT-4o、Claude-3.5和Llama-3.1，在某些测试中甚至超过了作者使用的“o1-preview”参考模型。多LLM协作的方式有效减少了错误（例如忽略默认单位或混淆几何形状）。   \u003Cbr> ● 未来方向——计划包括堆叠多层SOLOMON以应对更复杂的设计，改进文本\u002F图像\u002F代码之间的多模态关联，以及拓展到更广泛的领域任务（如电网布局）。更深层次的启示是：对于专业工程应用而言，关键在于先进的推理机制，而非单纯扩大模型规模。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04384), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1888985789880758426) |\n| 10) ReasonFlux  ReasonFlux框架作为一种高效的复杂推理微调方法被提出，它利用层次化的思维过程来实现这一目标。主要亮点包括：   \u003Cbr> ● 思维模板库——ReasonFlux并未让模型从零开始学习冗长的思维链解决方案，而是提供了一个包含约500个可重用“思维模板”的库，这些高层次的推理步骤可以组合起来解决问题。这些模板可能是通用策略，如“将问题分解为不同情况”或“验证解决方案”，适用于多种任务。   \u003Cbr> ● 基于强化学习的层次化规划——即使对于320亿参数的模型，该方法也仅需8张GPU即可训练模型按照层级结构的强化学习规划一系列此类模板来解决问题。这样一来，模型无需逐token地生成每个推理步骤，而是通过串联模板来编排复杂的推理过程。   \u003Cbr> ● 推理时的自适应调整——一种新颖的推理策略使模型能够根据问题难度动态调整推理的粒度，按难易程度选择使用不同数量的模板。这意味着模型可以根据具体情况灵活决定在难题中使用更详细的模板，而在简单问题中使用较少的模板，从而在保证准确性和速度之间取得平衡。   \u003Cbr> ● 国际领先的结果——ReasonFlux在数学推理基准测试中取得了优异成绩，例如在MATH数据集上获得了91.2%的正确率，比OpenAI的参考模型高出6.7%；在AIME奥林匹克竞赛中则解决了56.7%的问题，远超以往的模型。这表明，即使没有庞大的计算资源，通过结构化的推理步骤进行智能微调也能带来显著收益。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.06772), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1889343676272525600) |\n\n## 本周顶级机器学习论文（2月3日至2月9日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) s1：简单的推理时缩放来自斯坦福大学、华盛顿大学等的研究人员提出了s1方法，通过在推理阶段（“推理时缩放”）利用额外的计算资源来提升大语言模型的性能。其核心思想包括： \u003Cbr> ● 小而强大的数据集——他们精心构建了s1K数据集，仅包含1,000道具有挑战性的题目及详细的推理轨迹，用于微调一个320亿参数的模型。尽管数据量极小，但这些示例提供了强有力的推理范例。   \u003Cbr> ● 推理中的“预算约束”——一种新的解码技巧会在模型试图停止推理时插入“等待”标记，迫使它继续思考。这促使模型对每一步推理进行检查和修正。同时，通过截断过长的推理过程，他们还能有效控制推理时间。   \u003Cbr> ● 显著优于OpenAI的o1——最终得到的模型（s1-32B，即通义千问2.5-32B-Instruct的微调版本）在竞赛级别的数学问题（MATH与AIME24）上，准确率比OpenAI的o1-preview模型高出多达27%。值得注意的是，在推理时缩放技术的加持下，该模型在AIME24上的准确率从50%提升至57%，超越了其自身的常规上限。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2501.19393), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886428631041225030), [代码与数据](https:\u002F\u002Fgithub.com\u002Fsimplescaling\u002Fs1) |\n| 2) OmniHuman-1：单阶段人体动画的规模化字节跳动AI实验室团队发布了OmniHuman-1，这是一种扩散-Transformer模型，只需一张图片加上运动输入（音频或视频），即可生成高度逼真的真人视频。亮点如下：   \u003Cbr> ● 端到端的人体视频生成——OmniHuman接受任意比例的单张图像（从仅面部到全身）以及一段音频或一段视频动作，就能生成该人物说话、唱歌或做出各种动作的逼真视频。生成结果在动作、光照和纹理细节上都极为自然。   \u003Cbr> ● 多模态联合训练——一项关键创新是“全条件训练”：在训练过程中混合使用多种运动模态（如音频驱动、视频驱动、姿态等）。这极大地扩展了训练数据规模，并克服了高质量口部特写视频数据稀缺的问题。模型学会了处理多样化的输入（语音、歌曲、乐器）以及复杂的姿态。   \u003Cbr> ● 性能超越现有方法——与早期的单阶段模型相比（例如基于音频驱动的口部特写模型），OmniHuman生成的视频更加逼真，且对输入类型的支持更为灵活。它甚至可以将卡通或动物形象作为输入，并自然地将运动传递到不同风格中。   \u003Cbr> ● 更广泛的适用性——该方法支持任何肖像内容（近景面部、半身像、全身像），并可同时处理多个驱动信号。这种通用性在端到端人体动画模型中尚属首次。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.01061), [推文](http:\u002F\u002Ftwitter.com\u002Funseenvie\u002Fstatus\u002F1886672598576325011), [演示](https:\u002F\u002Fomnihuman-lab.github.io\u002F) |\n| 3) LIMO：少即是多的推理方法少量示例能否教会大语言模型复杂的数学推理？这篇名为LIMO的新论文挑战了这样一种观点：解决高难度推理任务必须依赖庞大的微调数据集。主要发现如下：   \u003Cbr> ● 惊人的少量示例——仅使用817个精心筛选的训练样本，LIMO模型在AIME数学竞赛中的准确率达到57.1%，在MATH数据集上的准确率更是高达94.8%。这一成绩远远超过了此前基于监督微调的模型（分别仅有6.5%和59.2%的准确率），而所用数据仅为那些旧方法所需数据的1%。   \u003Cbr> ● 以更少的数据实现泛化吗？——LIMO展现了出色的OOD泛化能力：在10个不同的基准测试中，其平均准确率提升了40.5个百分点，甚至超越了那些使用100倍数据进行微调的模型。这挑战了“复杂技能需要更多数据”以及“微调只会导致机械记忆”的传统观念。   \u003Cbr> ● “少即是多”假说——作者提出，如果大语言模型的预训练已经赋予其丰富的知识储备，那么只需一组精心设计的示例（他们称之为“认知模板”），便足以解锁高级推理能力。本质上，模型需要的不是成千上万重复性的题目，而是如何运用其已有的知识。   \u003Cbr> ● 开源工具包——完整的LIMO训练工具包已向社区开放，以支持关于数据高效推理的进一步研究。这项工作暗示，小型高质量数据集同样能够带来最先进的推理能力，从而降低微调高性能大语言模型的门槛。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03387), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887514592747937984), [代码](https:\u002F\u002Fgithub.com\u002FGAIR-NLP\u002FLIMO) |\n| 4) CoAT：大语言模型的关联思维链该研究提出了一种名为CoAT的新推理框架，通过探索和更新思维路径，使大语言模型的推理方式更接近人类。其主要组成部分包括：   \u003Cbr> ● MCTS与联想记忆——CoAT将蒙特卡洛树搜索（MCTS）与联想记忆机制相结合。MCTS使模型能够系统地探索不同的推理分支（可能的解决方案），而联想记忆则会根据需要动态地将相关信息注入上下文中（模拟人类在思考过程中随时调用事实的过程）。   \u003Cbr> ● 迭代式自我改进的推理——该框架可以不断扩展解空间，并重新审视或细化之前的中间结论。在评估各个分支时，它可以引入新线索或自我纠正，从而确保最终答案更加准确和全面。这与标准的一次性大语言模型推理形成鲜明对比，后者难以回溯或即时获取新信息。   \u003Cbr> ● 准确性和多样性的提升——在多项生成与推理任务的实验中，CoAT在准确率、推理步骤的连贯性以及解的多样性等指标上均优于传统的单次推理方法。通过迭代式扩展搜索范围并保持相关上下文，CoAT能够获得比单纯“快速思考”更好的效果。   \u003Cbr> ● 更贴近人类思维——CoAT的设计灵感来源于人类解决问题的方式：我们会反复考虑备选方案、回忆相关事实，并不断优化自己的思路。它为未来的大语言模型代理提供了一条新路径，即利用搜索算法和记忆机制实现更可靠的推理。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.02390), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887187689247752370) |\n| 5) Syntriever：利用大语言模型生成的数据训练检索器在缺乏大规模标注数据或无法访问大语言模型内部机制的情况下，如何构建高质量的文本检索器？Syntriever提出了一种两阶段框架，通过合成数据将黑盒大语言模型的知识蒸馏到检索模型中。具体步骤如下：   \u003Cbr> ● 第一阶段——基于合成问答的蒸馏：给定一个查询，研究人员会提示强大的大语言模型（如GPT-4）生成相关的段落（正确答案），同时也生成一些看似合理但实际错误的段落，借助思维链确保多样性。随后，大语言模型会对这些生成的段落进行自我验证，以过滤掉任何幻觉或低质量的内容。最终得到的是包含正面和负面段落的合成查询数据集。检索器会基于此进行训练，其损失函数会将相关段落的嵌入向量拉得更近，而无关段落的则被推远。   \u003Cbr> ● 第二阶段——与大语言模型偏好对齐：研究人员进一步调整检索器，使其倾向于返回大语言模型更偏好的结果。通过部分普拉凯特-卢斯排序方法，检索器学会按照与大语言模型判断相似的方式对段落排序，并辅以正则化项防止偏离第一阶段的模型。这一步骤进一步微调了检索器，使其能够模仿黑盒大语言模型的偏好。   \u003Cbr> ● 顶尖水平的表现——Syntriever在多个领域的检索基准测试中取得了新的SOTA记录。值得注意的是，整个训练过程并未使用任何真实的训练查询，所有数据均由大语言模型自动生成。   \u003Cbr> ● 无需 logits——以往的大语言模型到检索器的蒸馏通常需要模型的 logits 或概率分布（而封闭的API往往无法提供这些信息）。Syntriever则仅依靠生成的文本和大语言模型的评分来完成蒸馏，因此即使对于封闭的大语言模型也同样适用。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03824), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1887878242276954557), [代码](https:\u002F\u002Fgithub.com\u002Fkmswin1\u002FSyntriever) |\n| 6) 揭秘大语言模型中的长链式思维推理该研究探讨了大语言模型如何发展出长链式思维推理能力，重点关注强化学习和计算资源的扩展。主要见解包括：   \u003Cbr> ● 监督微调（SFT）可提升性能——虽然并非严格必要，但SFT能够简化训练流程并提高效率。使用长链式思维数据进行微调的模型，其准确率通常高于仅使用短链式思维序列的模型。   \u003Cbr> ● 奖励塑造对稳定的强化学习至关重要——研究发现，单纯的强化学习方法并不总能有效延长思维链的长度。为此，作者引入了一种带有重复惩罚的余弦长度缩放奖励机制，能够在平衡推理深度的同时，防止无意义的长度增加。   \u003Cbr> ● 可验证奖励信号的规模化——使用噪声较多、从网络中提取的“银牌”监督信号训练的强化学习模型，反而能在OOD任务（如STEM领域推理）上表现出更好的泛化能力。对这类数据进行过滤是维持训练稳定的关键。   \u003Cbr> ● 基础模型中已存在的推理能力——诸如错误修正和回溯等能力在基础模型中就已经存在，但需要精心设计的强化学习奖励机制才能在复杂任务中得到有效发挥。本文为希望优化大语言模型长链式思维推理策略的研究者提供了一个结构化的路线图，强调了强化学习和奖励调优对推理深度的影响。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03373), [推文](https:\u002F\u002Fx.com\u002Fxiangyue96\u002Fstatus\u002F1887332772198371514) |\n| 7) 重新思考多代理集成：单一强大模型的集成整合多个模型（多代理集成，MoA）是一种常见的提升性能的方法。然而，这篇论文提出一个问题：混合不同大语言模型是否真的有益，还是直接集成单一顶尖模型的输出更为明智？令人惊讶的答案是：“自MoA”（单一模型集成）往往胜过多模型集成。要点如下：   \u003Cbr> ● 自MoA与MoA——作者建议采用“自MoA”方法，即从单一最佳模型中生成多个输出，然后通过多数投票或排名等方式进行聚合，而非将不同模型的输出混合在一起。这种方法可以在不引入较弱模型的前提下，通过多次尝试增加多样性。   \u003Cbr> ● 更佳性能——大量测试表明，在许多情况下，自MoA的表现优于混合多个大语言模型的情况。例如，在AlpacaEval 2.0基准测试上，使用单一强大模型的自MoA得分比混合模型的MoA高出6.6个百分点；而在MMLU、CRUX和MATH等任务中，自MoA的平均得分也高出3.8个百分点。事实上，将自MoA应用于表现优异的AlpacaEval模型，还创造了排行榜上的新纪录。   \u003Cbr> ● 为何有效——混合多个模型可能会拖累整体性能，因为系统的质量受限于其中最弱的成员。研究发现，MoA的效果对每个模型的质量非常敏感：加入一个较弱的模型会显著降低整体表现。除非所有模型都非常强大且相互补充，否则直接使用单一模型的输出更为合理。当然，他们也指出了少数几种情况下，多样化模型确实有帮助，但这些只是例外情况。   \u003Cbr> ● 顺序聚合——他们还提出了一种顺序版的自MoA，可在多轮中逐步合并大量输出（而不是一次性完成）。这种顺序型自MoA与一次聚合的效果相当，能够高效地扩展集成规模，处理大量输出。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.00674), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886792384954163347) |\n| 8) MaAS：多智能体架构搜索（代理超网络）构建由多个大语言模型组成的多智能体系统（各智能体分工协作，各自承担特定角色或使用特定工具）功能强大，但通常需要人工设计复杂的流水线。MaAS（多智能体架构搜索）则通过学习一个通用的“代理超网络”，能够根据每个查询动态生成最优的智能体团队。它实现了针对不同任务自动设计智能体工作流的功能：   \u003Cbr> ● 代理超网络——作者定义了一个连续的智能体架构空间（包括大语言模型调用链、工具使用等）。与其选择一种静态架构，不如训练一个涵盖多种配置的超网络。每次查询都会触发一条专属于该查询领域和难度的子网络。   \u003Cbr> ● 动态资源分配——由于系统会根据查询动态调整，因此能够高效地分配资源。简单问题可能只需使用一个快速、简单的智能体链；而难题则会调动更为复杂的推理团队。这种方式避免了单一体系带来的成本高昂问题。   \u003Cbr> ● 巨大的成本节约——在六项基准测试中，MaAS的推理成本仅为现有多智能体流水线的6%至45%，但其准确率仍高出约0.5%至11.8%。它通过根据任务调整智能体配置，找到了以更低成本达到同等或更好效果的方法。   \u003Cbr> ● 鲁棒且可迁移——代理超网络方法展现出强大的泛化能力：在某一任务中表现良好的架构，能够很好地迁移到新领域，甚至适用于不同的大语言模型基座，其性能均优于静态设计。这表明该方法能够学习到如何最优地编排大语言模型智能体的基本原则。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04180), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887884027530727876) |\n| 9) 大语言模型推理能力的提升这篇综述论文及时概述了当前新兴的提升大语言模型推理能力的方法。文章将相关文献分为几个主要类别：   \u003Cbr> ● 提示策略——通过巧妙的提示引导模型进行推理的技术，例如链式思维提示（让模型生成逐步解答）、自洽性（采样多条推理路径并选择最佳答案）、思维树策略等。这些方法无需改变模型架构即可改善逻辑推理和多步求解能力。   \u003Cbr> ● 架构创新——对模型或其上下文进行修改，以更好地促进推理。这包括检索增强型模型（能够检索外部事实的大语言模型）、模块化推理网络（将问题分解为由不同模块或专家处理的子任务）以及神经符号集成（将神经网络与符号逻辑或工具结合）。这些改动旨在为大语言模型提供更多知识或更结构化的推理流程。   \u003Cbr> ● 学习范式——用于培养推理能力的新训练方法：基于推理专用数据集的微调（如数学应用题）、强化学习方法（奖励正确的推理链条）以及自监督目标（如预测证明中的掩码步骤）。这些方法能够提升模型在通用预训练之外的内在推理能力。   \u003Cbr> ● 评估与挑战——该综述还回顾了我们如何评估大语言模型的推理能力（逻辑、数学、常识等方面的基准测试），并指出了当前存在的开放性问题。主要包括幻觉（模型编造不合逻辑或不实的中间步骤）、对细微变化的脆弱性（鲁棒性）以及推理方法在不同任务和领域之间的泛化能力。解决这些问题将是下一代推理增强型大语言模型发展的关键。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2502.03671), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1887875470269849659) |\n| 10) 综述：大语言模型的文本数据增强随着大语言模型对海量训练数据的需求日益增长，利用合成或变换后的文本扩充数据集变得至关重要。本文的主要内容包括：   \u003Cbr> ● 数据增强方法分类——文章将数据增强方法划分为四类：(1) 简单增强——如同义词替换、文本裁剪等基本操作；(2) 基于提示的增强——利用大语言模型及其特定提示生成新的训练样本（充分利用大语言模型的生成能力）；(3) 基于检索的增强——从外部知识库或数据库中提取相关信息，以使生成的文本建立在事实基础上；(4) 混合增强——上述方法的组合，或采用多步策略。   \u003Cbr> ● 大语言模型作为数据生成器——一个重要的洞见是，现代大语言模型能够生成高质量的合成数据来提升自身性能。通过精心设计提示，让大语言模型生成某一任务的不同变体（例如让ChatGPT创作新的数学应用题），可以大幅扩充训练数据集。该综述讨论了为此目的的提示设计，以及如何确保生成的数据既多样又实用。   \u003Cbr> ● 后处理与过滤——增强后的数据并不总是完美的。综述涵盖了对生成数据进行精炼和过滤的技术，例如使用第二模型验证事实，或去除可能引入错误的样本。这一步骤对于防止数据增强过程中出现“垃圾进，垃圾出”的情况至关重要。   \u003Cbr> ● 评估与未来方向——文章概述了数据增强技术常用的场景（如低资源语言翻译、问答等），以及如何评估其效果（如准确率、鲁棒性等的提升）。最后，还讨论了相关挑战（如确保数据增强不会扭曲数据分布、避免强化模型偏差）以及未来研究的方向。 | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18845), [推文](http:\u002F\u002Ftwitter.com\u002Fomarsar0\u002Fstatus\u002F1886428687350006067) |\n\n## 本周机器学习顶级论文（1月27日—2月2日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) o3-mini  OpenAI推出了其最新、高性价比的推理模型o3-mini，现已在ChatGPT和API中上线。该模型在STEM相关任务上表现出色，尤其擅长科学、数学和编程，同时保持了与前代模型o1-mini相同的低成本和更低的延迟。它引入了函数调用、结构化输出和开发者消息等关键开发者功能，使其自发布起即可直接用于生产环境。o3-mini提供低、中、高三种不同的推理强度级别，并在广泛的任务上提升了性能。其响应速度比o1-mini快24%，并在竞赛数学、博士级科学问题以及软件工程任务中取得了显著成果。 | [系统卡片](https:\u002F\u002Fcdn.openai.com\u002Fo3-mini-system-card.pdf), [博客](https:\u002F\u002Fopenai.com\u002Findex\u002Fopenai-o3-mini\u002F), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1885406586136383634) |\n| 2) Qwen2.5-1M  通义千问发布了两款开源LLM：Qwen2.5-7B-Instruct-1M和Qwen2.5-14B-Instruct-1M，能够处理长达100万标记的上下文长度。这些模型采用渐进式训练方法，从4K标记开始逐步增加到256K标记，随后利用长度外推技术达到1M标记。此外，他们还发布了一个基于vLLM的推理框架，通过稀疏注意力机制使长输入的处理速度提升3至7倍。这些模型在长上下文和短文本任务上均表现出色。14B版本在多个长上下文数据集上超越了GPT-4o-mini，同时在较短任务上保持了相似的性能。 | [论文](https:\u002F\u002Fqianwen-res.oss-cn-beijing.aliyuncs.com\u002FQwen2.5-1M\u002FQwen2_5_1M_Technical_Report.pdf), [模型](https:\u002F\u002Fhuggingface.co\u002FQwen), [通义千问聊天应用](https:\u002F\u002Fchat.qwenlm.ai\u002F), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1883905564004241789) |\n| 3) Janus-Pro  这是先前Janus模型的增强版，用于多模态理解和生成。该模型包含三项关键改进：优化的训练策略，包括更长时间的初始训练和针对性的微调；扩展的训练数据，新增9000万条理解类样本和7200万条用于生成的合成美学样本；以及模型规模的扩大，最高可达70亿参数。Janus-Pro在多模态理解和文本到图像生成能力方面均取得了显著提升。该模型在多项基准测试中优于现有方案，在MMBench理解任务上得分79.2，在GenEval文本到图像生成任务上达到了80%的准确率。改进还提升了图像生成的稳定性和质量，尤其是在短提示和精细细节方面，不过目前384×384的分辨率对于某些任务仍存在局限性。 | [论文](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FJanus\u002Fblob\u002Fmain\u002Fjanus_pro_tech_report.pdf), [模型](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai\u002FJanus-Pro-7B), [推文](https:\u002F\u002Fx.com\u002Fgiffmana\u002Fstatus\u002F1884011657191637126) |\n| 4) 关于o1类LLM的“思考不足”现象  本研究更深入地探讨了o1类LLM的“思考”模式。近期已有几篇论文指出了过度思考的问题。而现在又出现了一种新的现象——“思考不足”！这是怎么回事呢？作者发现，o1类LLM经常在不同的推理思路之间切换，却未能充分探索有希望的路径来得出正确答案。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18585), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1885349576456233177) |\n| 5) 多样化偏好优化  提出了一种名为多样化偏好优化（DivPO）的新训练方法，旨在解决语言模型输出缺乏多样性的问题，同时保持响应质量。当前的偏好优化技术，如RLHF，往往会锐化输出的概率分布，导致模型生成非常相似的响应。这一点在需要多样化输出的创意任务中尤为成问题。DivPO通过修改偏好优化过程中训练样本对的选择方式来实现这一目标。它不再简单地选择奖励最高的和最低的响应，而是选择符合质量阈值且最具多样性的响应，并将其与低于阈值的最不多样化的响应进行对比。该方法引入了一个多样性的评判标准，可以通过多种方式衡量，包括模型概率、词频，或使用LLM作为评判者。在角色生成和创意写作任务上的实验表明，DivPO在结构化任务中可实现多达45.6%的多样化输出，而在故事创作方面则能将多样性提升81%，同时保持与基线方法相当的质量水平。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.18101), [推文](https:\u002F\u002Fx.com\u002Fjaseweston\u002Fstatus\u002F1885399530419450257) |\n| 6) DeepSeek-R1使用建议  本文提供了一套关于如何提示DeepSeek-R1模型的建议。以下是主要指导原则： \u003Cbr>\u003Cbr> 1. 提示工程： \u003Cbr>  ● 使用清晰、结构化的提示，明确指令 \u003Cbr> ● 避免少样本提示，改用零样本提示 \u003Cbr>\u003Cbr> 1. 输出格式化： \u003Cbr>  ● 指定所需的格式（JSON、表格、Markdown） \u003Cbr> ● 对于推理任务，请求分步解释 \u003Cbr>\u003Cbr> 1. 语言： \u003Cbr>  ● 明确指定输入\u002F输出语言，以防止混用 \u003Cbr>\u003Cbr> 论文还总结了何时使用不同模型变体、何时进行微调，以及其它安全注意事项。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17030), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884624296368292083) |\n| 7) Docling  [Docling](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17887)是一个开源工具包，可以将几种流行的文档格式解析为统一的、结构丰富的表示形式。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.17887) |\n| 8) 通过多智能体强化学习改进RAG  本研究将RAG视为一项多智能体协作任务，以提升答案生成的质量。它将查询重写、文档选择和答案生成等RAG组件建模为强化学习智能体，共同协作以生成准确的答案。研究采用了多智能体近端策略优化（MAPPO），基于答案质量的共享奖励来联合优化所有智能体。除了在常用基准测试上的改进外，该框架在域外场景中也表现出强大的泛化能力，并且在不同配置的RAG系统中均保持有效。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15228), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884249075467575362) |\n| 9) TensorLLM  提出了一种通过多头张量化过程和Tucker分解来压缩MHA的技术框架。该方法可在无需额外数据、训练或微调的情况下，实现高达约250倍的MHA权重压缩率。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.15674), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884246306224496729) |\n| 10) TokenVerse  提出了一种新方法，可根据学习到的概念以期望的配置生成新图像。该技术由谷歌DeepMind及其合作者提出，TokenVerse利用预训练的文本到图像扩散模型，从多张图像中解耦并提取复杂的视觉概念，从而实现多概念个性化。它在DiT的调制空间内运作，为输入标题中的每个文本标记学习个性化的调制向量。这使得用户能够灵活且局部地控制物体、材质、光照和姿态等不同概念。学习到的标记调制可以以新颖的方式组合，生成融合多个个性化概念的新图像，而无需额外的分割掩码。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.12224), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1884618510275592610) |\n\n## 本周顶级机器学习论文（1月20日至1月26日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) DeepSeek-R1  DeepSeek推出了DeepSeek-R1，这是一项通过强化学习（RL）实现的推理能力提升。该模型包含两个关键版本：DeepSeek-R1-Zero，它仅使用纯强化学习而无需监督微调；以及DeepSeek-R1，结合了强化学习与冷启动数据。DeepSeek-R1-Zero证明，仅靠强化学习就能使模型发展出复杂的推理能力，在2024年美国数学竞赛AIME中取得了71.0%的通过率，并与OpenAI-o1-0912的表现相当。在训练过程中，它自然演化出了自我验证和反思等复杂行为。然而，该模型也面临可读性和语言混杂等问题。为解决这些局限性，DeepSeek-R1采用了多阶段方法：首先使用高质量的思维链示例进行微调，随后进行以推理为重点的强化学习训练，再通过拒绝采样收集新的训练数据，并最终在所有场景下进行强化学习优化。这一策略使得DeepSeek-R1的表现与OpenAI-o1-1217相当，在2024年AIME中的准确率为79.8%，在MATH-500上的准确率为97.3%，同时保持了输出的可读性和一致性。此外，DeepSeek还成功地将DeepSeek-R1的能力蒸馏到较小的模型中，其7B模型表现超越了更大的竞争对手，而32B模型则接近OpenAI-o1-mini的效果。这表明，从大型模型中提炼推理模式，而非直接通过强化学习训练小型模型，是一种更为有效的方法。 | [论文](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-R1\u002Fblob\u002Fmain\u002FDeepSeek_R1.pdf), [推文](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1881318130334814301), [代码](https:\u002F\u002Fhuggingface.co\u002Fdeepseek-ai), [应用](https:\u002F\u002Fchat.deepseek.com\u002F) |\n| 2) 人类的最后一场考试  “人类的最后一场考试”是一个全新的多模态基准测试，旨在检验大型语言模型的极限。该数据集包含来自全球500多家机构近1000位专家贡献的100多个学科领域的3000道高难度题目。目前前沿的AI模型在此基准上的表现较差，最高仅为DeepSeek-R1的9.4%，这表明AI能力仍有巨大的提升空间。该基准的目标是成为同类测试中最后一道封闭式学术测试，因为现有的MMLU等基准已因模型准确率超过90%而变得过于简单。尽管预计模型在该基准上的表现会迅速提升，到2025年底可能超过50%，但创建者强调，即使取得高分，也仅能证明模型具备专业知识，而不能说明其具有通用智能或科研能力。 | [论文](https:\u002F\u002Fstatic.scale.com\u002Fuploads\u002F654197dc94d34f66c0f5184e\u002FPublication%20Ready%20Humanity%27s%20Last%20Exam.pdf), [推文](https:\u002F\u002Fx.com\u002FDanHendrycks\u002Fstatus\u002F1882433928407241155), [数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fcais\u002Fhle) |\n| 3) 利用LLM扩展强化学习  Kimi推出了k1.5，这是一种采用强化学习训练的多模态大型语言模型，在各类推理任务中均达到了最先进水平。该模型支持长达128k tokens的长上下文处理，并采用了改进的策略优化方法，建立了一个简化但高效的强化学习框架，无需蒙特卡洛树搜索或价值函数等复杂技术。值得注意的是，k1.5在多项基准测试中与OpenAI的o1性能相当，例如在AIME上的得分为77.5，在MATH 500上的得分为96.2。此外，该模型还引入了有效的long2short方法，利用长思维链技术来提升短模型的表现，在资源受限的场景下取得了卓越成果。借助这些技术，k1.5的短思维链版本大幅领先于GPT-4o和Claude Sonnet 3.5等现有模型，同时保持了较短响应带来的高效性。 | [论文](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5\u002Fblob\u002Fmain\u002FKimi_k1.5.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1881749719212552280), [GitHub](https:\u002F\u002Fgithub.com\u002FMoonshotAI\u002FKimi-k1.5) |\n| 4) 代理链  一种用于处理长上下文任务的新框架，通过多个LLM代理协同工作来完成。CoA将文本分割成若干块，分配不同的工作者代理依次处理各部分，并在它们之间传递信息，最后由管理代理生成最终输出。这种方法避免了传统方法的局限性，如输入缩减或窗口扩展。在多个数据集上的测试表明，CoA在问答和摘要等任务上比现有方法高出多达10%。该框架尤其适用于较长的输入——当处理超过40万token的文本时，相比基线甚至可提高100%。 | [论文](https:\u002F\u002Fopenreview.net\u002Fpdf?id=LuCLf4BJsr), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882824941101629829) |\n| 5) LLM能否进行规划？  提出了一种对“思维算法+”（AoT+）的改进方案，使其在规划基准测试中达到最先进水平，甚至超越人类基准！AoT+通过定期提供状态摘要来减轻认知负担，从而使系统能够更专注于规划过程本身，而不必费力维持问题状态。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13545), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882799782579855518) |\n| 6) 幻觉有助于提升LLM在药物发现中的表现  声称，与不带幻觉的输入提示相比，LLM在药物发现任务中使用文本幻觉可以获得更好的效果。Llama-3.1-8B在ROC-AUC指标上比无幻觉的基线提高了18.35%。此外，由GPT-4o生成的幻觉在不同模型中带来了最为一致的性能提升。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.13824), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882789456522145802) |\n| 7) 用推理时计算资源换取对抗鲁棒性  表明初步证据显示，给予o1-preview和o1-mini等推理模型在推理过程中更多“思考”时间，可以提升其对抗攻击的能力。实验涵盖了从基础数学问题到图像分类等多种任务，结果表明，增加推理时的计算资源通常能使攻击的成功率降至接近零。不过，这一方法并非在所有场景下都有效，尤其是在某些StrongREJECT基准测试中，且如何控制模型使用计算时间仍存在挑战。尽管存在这些限制，研究结果仍显示出一条有前景的方向，即在不依赖传统对抗训练方法的情况下提升AI安全性。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Ftrading-inference-time-compute-for-adversarial-robustness-20250121_1.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1882129444212740482) |\n| 8) IntellAgent  推出了一套新的开源框架，用于通过自动化、基于策略的测试来评估对话式AI系统。该系统采用图模型和合成基准来模拟不同复杂度下的真实代理交互，从而实现对性能的详细分析及政策合规性测试。IntellAgent能够帮助识别对话式AI系统中的性能差距，同时凭借其模块化设计易于集成新领域和API，因此无论是用于研究还是实际部署，都是一项有价值的工具。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11067), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882081603754643779), [GitHub](https:\u002F\u002Fgithub.com\u002Fplurai-ai\u002Fintellagent) |\n| 9) LLM与行为意识  研究表明，当LLM经过针对输出不安全代码等行为的微调后，模型会表现出行为自我意识。换言之，即使没有被明确训练过，经过微调以输出不安全代码的模型也会自动生成“我编写的代码是不安全的”这样的表述。研究还发现，模型有时能够识别自身是否含有后门，即便后门触发条件并未出现。然而，模型默认情况下无法直接输出其触发条件。这种“行为自我意识”在LLM中并不新鲜，但本研究揭示其普遍性远超此前的认知。这意味着LLM具有更可靠地编码和执行策略的潜力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.11120), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1882079780918747303) |\n| 10) 代理式RAG概述  对LLM代理和代理式RAG进行了全面介绍，探讨了代理式RAG的架构、应用及实现策略。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09136), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1881360794019156362) |\n\n## 本周顶级机器学习论文（1月13日至1月19日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) 自适应大模型 - 提出Transformer^2，这是一种新颖的自适应框架，可通过有选择地调整其权重矩阵中的单个组件，实时使大模型适应未见过的任务；该框架由两个关键阶段构成：1）一个用于分析并识别传入任务特性的调度系统，以及2）一个将“专家”向量（通过强化学习训练得到）组合起来以生成任务特定行为的步骤；声称相比LoRA效率更高、参数更少，并且可在不同架构的大模型上运行。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06252), [推文](https:\u002F\u002Fx.com\u002Fhardmaru\u002Fstatus\u002F1879331049383334187) |\n| 2) MiniMax-01 - 推出一系列集成专家混合机制的新模型；其中一款拥有32个专家和4560亿参数的模型，每处理一个标记仅激活459亿参数；声称在性能上可媲美GPT-4o和Claude-3.5-Sonnet等最先进模型，同时提供20至32倍更长的上下文窗口；能够处理高达400万标记的上下文窗口；该模型结合了线性注意力机制与优化的硬件利用率，从而提升大模型的效率和扩展性；此外还有一款名为MiniMax-VL-01的视觉模型，通过使用5120亿个多模态视觉-语言标记进行持续训练构建而成。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.08313), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879572512075587872) |\n| 3) VideoRAG - 一种通过利用视频内容作为外部知识源来增强RAG的框架；与现有主要关注文本或图像的RAG方法不同，VideoRAG会根据查询动态检索相关视频，并将其视觉和文本元素整合到生成过程中；该框架使用大型视频语言模型（LVLMs）直接处理视频内容，从而更有效地捕捉静态模态通常难以传达的时序动态、空间细节和多模态线索；对于缺乏文本描述的视频，他们提出使用自动语音识别生成字幕，以确保同时利用视觉和文本模态。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05874), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1878827350315659421) |\n| 4) 在测试时学习记忆 - 引入一种神经长期记忆模块，用于记忆历史上下文，并帮助注意力在利用长期过往信息的同时聚焦于当前上下文；该神经记忆模块充当比单纯依靠注意力更为持久的长期记忆（后者被认为更偏向短期记忆）；基于神经记忆的Titan模型在语言建模、常识推理、基因组学和时间序列任务中均表现出良好效果。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00663), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879896681010921742) |\n| 5) 大模型的基础 - 一篇关于大模型基础的新综述，涵盖预训练、提示工程和对齐方法等领域。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09223), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880284477445767586) |\n| 6) OmniThink - 一种模拟人类迭代式扩展与反思过程的新框架；旨在模拟学习者在深化知识时的认知行为；相较于RAG和角色扮演，OmniThink可通过持续的反思与探索来拓展知识边界，因此非常适合需要长篇生成的应用场景。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09751), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880275861401923619) |\n| 7) RAG的增强 - 系统性地探讨了提升RAG系统性能的因素和方法，包括检索策略、查询扩展、对比式上下文学习、提示设计和分块等。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07391), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879178916021318029) |\n| 8) AutoCBT - 提出了一种用于认知行为疗法的多智能体框架AutoCBT；该研究提出了一种通用的多智能体框架，可在单轮心理咨询场景中生成高质量的回答；它结合了动态路由、记忆和监督机制，以提升各智能体的自主能力；实验结果表明，AutoCBT能够提供更高质量的自动化心理咨询服务；与纯基于提示的咨询框架相比，AutoCBT显著提升了对话质量。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.09426), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1880283025595867631) |\n| 9) 在空间中边想象边推理 - 提出MVoT（多模态思维可视化），这是一种新的推理框架，使AI模型能够在文本和图像中同时“思考”；MVoT通过允许模型在生成文本解释的同时也生成其推理步骤的视觉表示，从而增强了传统的思维链提示方法；该框架被应用于Chameleon-7B这一多模态语言模型，并引入了一种“标记差异损失”以提高生成可视化质量；MVoT在复杂场景中表现尤为出色，例如迷宫和打印机安装任务，准确率超过90%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.07542), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879181711982129420) |\n| 10) ChemAgent - 提出了一种新框架，旨在通过一个动态、自我更新的知识库来提升大模型在化学推理方面的性能；该知识库通过将化学任务分解为子任务并整理成结构化的集合，以便在未来查询时参考；当系统接收到新问题时，它会从知识库中重新检索并细化相关信息，从而实现更有效的任务分解；随着新子任务及其解决方案的不断出现和验证，知识库也会动态更新；SciBench上的实验表明，ChemAgent在性能上可提升高达46%（相对于GPT-4），显著优于现有方法。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.06590), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1879188983705747754) |\n\n## 本周顶级机器学习论文（1月6日至1月12日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) 缓存增强生成（CAG）——一种旨在通过预先将所有相关文档加载到大模型中并预计算键值（KV）缓存，从而充分利用长上下文大模型能力的方法；预加载的上下文有助于模型在运行时无需额外检索即可给出语境准确的答案。作者认为，在待检索的文档或知识规模有限且易于管理的情况下，CAG是RAG的一个有用替代方案。 | [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2412.15605), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1876721221083214200) |\n| 2) 代理实验室——一种利用能够完成整个研究流程的大模型代理的方法；主要发现包括：1）由o1-preview驱动的代理取得了最佳的研究成果，2）生成的机器学习代码性能可与现有方法媲美，3）人类反馈进一步提升了研究质量，4）代理实验室显著降低了研究成本。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04227)  [推文)](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877382581358047375) |\n| 3) 大模型中的长上下文与RAG对比——对长上下文（LC）大模型与RAG系统进行了全面评估；三个主要发现是：1）在问答基准测试中，LC通常优于RAG，2）基于摘要的检索效果与LC相当，而基于分块的检索则明显落后，3）RAG在对话式和通用问题查询方面更具优势。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.01880), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1876281074147299569) |\n| 4) Search-o1——一个将大型推理模型（LRM）与代理式搜索和文档精炼能力相结合的框架，用于解决知识不足的问题；该框架能够在推理过程中实现自主知识检索，并在复杂任务中表现出色，性能超越基线模型和人类专家。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.05366), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877742469213004015) |\n| 5) 通往系统2式推理之路——提出元思维链（Meta-CoT），它通过建模得出特定思维链所需的底层推理过程来扩展传统的思维链（CoT）；核心观点是，传统思维链过于简单，而Meta-CoT更接近高级问题解决所需的认知过程。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04682)  [推文)](https:\u002F\u002Fx.com\u002Frm_rafailov\u002Fstatus\u002F1877446475271037314) |\n| 6) rStar-Math——一种新方法，提出了三个核心组件以提升数学推理能力：1）一种结合MCTS的代码增强型CoT数据合成方法，用于生成逐步验证的推理轨迹，并以此训练策略SLM；2）一个基于SLM的过程奖励模型，能够可靠地预测每一步数学推理的奖励标签；3）一种自我进化方案，通过迭代优化策略SLM和PPM来提升数学推理能力。在MATH基准测试上，rStar-Math将Qwen2.5-Math-7B的得分从58.8%提升至90.0%，Phi3-mini-3.8B则从41.4%提升至86.4%，分别比o1-preview高出+4.5%和+0.9%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04519)  [推文)](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877378301293142050) |\n| 7) Cosmos World基础模型——一个用于在实际部署前于数字环境中训练物理AI系统的框架；该平台包含预训练的世界基础模型，作为物理世界的数字孪生，使AI系统能够在不损坏物理硬件的前提下安全地学习和交互。这些模型可针对特定应用进行微调，如摄像头控制、机器人操作和自动驾驶。 | [论文](https:\u002F\u002Fresearch.nvidia.com\u002Fpublication\u002F2025-01_cosmos-world-foundation-model-platform-physical-ai), [推文](https:\u002F\u002Fx.com\u002FEthanHe_42\u002Fstatus\u002F1876487556755521798) |\n| 8) 通过隐式奖励进行过程强化——一种用于在线强化学习的框架，利用过程奖励来提升语言模型的推理能力；所提出的算法结合了在线提示过滤、RLOO回报\u002F优势估计、PPO损失以及隐式过程奖励建模的在线更新。在其模型Eurus-2-7B-PRIME上，AIME 2024的pass@1达到26.7%，超越GPT-4及其他模型，且仅使用了同类模型十分之一的训练数据。 | [论文](https:\u002F\u002Fcurvy-check-498.notion.site\u002FProcess-Reinforcement-through-Implicit-Rewards-15f4fcb9c42180f1b498cc9b2eaf896f), [推文](https:\u002F\u002Fx.com\u002Flifan__yuan\u002Fstatus\u002F1874867809983033649) |\n| 9) 大模型能否设计出好的问题？——系统性地评估了大模型生成问题的质量；主要发现如下：1）LLaMA和GPT模型都倾向于偏好询问具体的事实和数据，2）问题长度通常在20字左右，但不同大模型在长度偏好上存在差异，3）大模型生成的问题往往需要更长的回答，4）人类生成的问题更集中在上下文的开头部分，而大模型生成的问题分布更为均衡，两端的关注度略有下降。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.03491), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877008618207560049) |\n| 10) 关于大模型的综述——一份关于大模型的新综述，包含对其能力和局限性的若干见解。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.04040), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1877416049999802408) |\n\n## 本周机器学习顶级论文（12月30日—1月5日）- 2025年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) 仅靠智能体还不够——文章指出，尽管AI智能体展现出巨大潜力，但单靠它们无法解决自主任务执行中的挑战；提出了一种新的生态系统，由三个关键组件构成：智能体（针对特定任务的窄领域、目标驱动模块）、模拟人（用户偏好与行为的数字化表征）以及助手（协调用户、模拟人和智能体之间的程序）。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2412.16241), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874196827115061741) |\n| 2) OLMo 2——介绍了增强的架构、训练方法以及一种名为Dolmino Mix 1124的专用数据混合；这款完全透明的模型以7B和13B参数规模发布，附带完整的训练数据和代码，在计算资源消耗更少的情况下，其性能可媲美或超越Llama 3.1、Qwen 2.5等同类开源权重模型，而其指令微调版本（OLMo 2-Instruct）仍能与对标模型保持竞争力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.00656), [推文](https:\u002F\u002Fx.com\u002Fsoldni\u002Fstatus\u002F1875266934943649808) |\n| 3) 机器辅助证明——探讨了数学家长期以来如何利用机器辅助数学研究，并讨论了近年来正在变革数学证明辅助方式的新型AI工具。 | [论文](https:\u002F\u002Fwww.ams.org\u002F\u002Fnotices\u002F202501\u002Frnoti-p6.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1873045937259462656) |\n| 4) 高层次数学推理的衡量——提出了Putnam-AXIOM这一新的数学推理基准，包含236道普特南竞赛题目及52种变体；即便是被认为表现最佳的模型（OpenAI的o1-preview），在原题上的准确率也仅为41.95%，而在变体题目上的表现则显著更差。 | [论文](https:\u002F\u002Fopenreview.net\u002Fforum?id=YXnwlZe0yf&noteId=yrsGpHd0Sf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874489752243597635) |\n| 5) 关于LLM的过度思考问题——提出了一种自训练策略，用于缓解类似o1的LLM中的过度思考现象；该策略在应用于QwQ-32B-Preview时，可在保持广泛使用的MATH500测试集准确率不变的情况下，将生成的token数量减少48.6%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21187), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1874848885170176364) |\n| 6) MEDEC——推出了MEDEC这一公开可用的临床笔记中医疗错误检测与修正基准，涵盖五类错误（诊断、管理、治疗、药物治疗及致病微生物）；该基准包含3,848份临床文本，其中包括来自三家美国医院系统的488份临床笔记；实验结果显示，Claude 3.5 Sonnet在错误检测方面表现更好，而o1-preview则在错误修正方面更具优势。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.19260), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1875232390265577675) |\n| 7) 1.58位FLUX——首次成功实现了对最先进文生图模型FLUX.1-dev的量化，采用1.58位权重（即取值为{-1, 0, +1}）；该方法依赖FLUX.1-dev模型自身的自监督机制，在生成1024×1024分辨率图像时仍能保持与原版FLUX模型相当的性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18653), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1873782702178263549) |\n| 8) Aviary——一个可扩展的开源环境，可用于构建语言智能体，在多项具有挑战性的科学任务上表现出超越零样本前沿LLM甚至人类的性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.21154), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1875270927304511535) |\n| 9) 大规模记忆层——展示了大规模记忆层的有效性；研究表明，配备此类记忆层的模型在事实性任务中表现优于传统密集型模型，且仅需一半的计算资源；文中还提供了一种可并行化的记忆层实现，可扩展至128B的记忆参数和1万亿的训练token，并已在高达8B参数的基础模型上进行了测试。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.09764), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1874897646542033030) |\n| 10) HuatuoGPT-o1——提出了一种通过医学验证器来校验模型输出并引导复杂推理能力发展的新方法，以提升语言模型的医学推理能力；该系统采用两阶段方法，结合微调与基于验证器奖励的强化学习，在仅使用4万道可验证的医学问题的情况下，实现了优于现有模型的性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18925), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1873572891092283692) |\n\n## 本周顶级机器学习论文（12月23日—12月29日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **DeepSeek-V3** - 一个拥有6710亿参数的MoE语言模型，每处理一个词符会激活370亿参数；该模型采用MLA和DeepSeekMoE架构以实现高效运行，引入无辅助损失的负载均衡方法，并在训练过程中使用多词符预测来提升性能。在基于14.8万亿词符进行预训练后，模型又经历了SFT和RL阶段，最终性能可与领先的闭源模型媲美，同时超越其他开源替代方案。整个训练过程仅需278.8万小时的H800 GPU算力，且训练过程稳定，未出现不可恢复的损失峰值。  | [论文](https:\u002F\u002Fgithub.com\u002Fdeepseek-ai\u002FDeepSeek-V3\u002Fblob\u002Fmain\u002FDeepSeek_V3.pdf), [推文](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1872242657348710721) |\n| 2)  **大型概念模型** - 提出了一种基于称为“概念”的句子级语义表示的方法，突破了当前大语言模型普遍采用的词符级处理方式。该模型利用SONAR句子嵌入，支持文本和语音模态下的200种语言，并通过多种方法（从MSE回归到基于扩散的生成）进行自回归式的句子预测训练。实验分别使用16亿和70亿参数版本，在1.3万亿和7.7万亿词符数据上训练，结果表明其在摘要生成和摘要扩展等生成任务中表现出色。  | [论文](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Flarge-concept-models-language-modeling-in-a-sentence-representation-space), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1871263650935365759) |\n| 3) **ModernBERT** - 一种全新的仅编码器架构的Transformer模型，在分类和检索任务上达到最先进水平，同时比以往的编码器更加高效。该模型在序列长度为8192的情况下，使用2万亿词符进行训练，并融入了多项现代优化技术，较BERT有显著提升。该模型专为实际部署设计，能够在常见GPU上提供更优的速度和内存效率。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13663), [推文](https:\u002F\u002Fx.com\u002Fjeremyphoward\u002Fstatus\u002F1869786023963832509) |\n| 4) **自动化搜索人工生命** - 提出了一种新方法，利用基础模型自动发现跨多个平台（如Boids、Lenia和生命游戏）中有趣的人工生命模拟。该系统能够找到产生特定目标行为的模拟，发现能持续生成时间上开放性新颖性的模拟，并绘制出多样化的模拟空间地图。它不仅在Lenia和Boids中发现了新的生命形式，还以符合人类价值观的方式对以往定性的现象进行了量化测量。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17799), [推文](https:\u002F\u002Fx.com\u002FSakanaAILabs\u002Fstatus\u002F1871385917342265592) |\n| 5) **关于大语言模型推理时自我改进的综述** - 梳理并分析了大语言模型推理时自我改进的三大类技术：独立方法（如增强解码）、利用外部数据的上下文感知方法，以及模型协作策略。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14352), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1870129825282658752) |\n| 6) **探索心智理论** - 介绍了ExploreToM框架，该框架利用A*搜索生成多样化、复杂的心智理论场景，揭示了当前大语言模型在社交智能方面存在的显著局限性。测试表明，即使是GPT-4和Llama-3等先进模型，在这些具有挑战性的场景上的表现也较差（准确率低至5%），尽管它们在简单基准测试中表现优异。而在ExploreToM数据上进行微调后，这些模型在现有基准测试中的表现提升了27个百分点。 | [论文](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fexplore-theory-of-mind-program-guided-adversarial-data-generation-for-theory-of-mind-reasoning\u002F),  [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1869457933727416375)  |\n| 7) **LearnLM** - 一种新型的LearnLM模型，能够遵循教学指令，从而根据指定的教育需求调整其教学方式，而非仅以单纯呈现信息作为默认行为。实验结果显示，与其他领先模型相比，LearnLM更受青睐，其表现优于GPT-4 31%，Claude 3.5 11%，以及Gemini 1.5 Pro 13%。这种指令遵循的方式避免了局限于单一的教学框架，而是允许教师和开发者明确期望的教学行为，同时还能与其他能力一起持续改进。 | [论文](https:\u002F\u002Fservices.google.com\u002Ffh\u002Ffiles\u002Fmisc\u002Fimproving-gemini-for-education_v7.pdf),  [推文](https:\u002F\u002Fx.com\u002FGoogle\u002Fstatus\u002F1869798188233699346)  |\n| 8) **赋予多模态大语言模型o1式推理与反思能力** - 提出了一种名为CoMCTS的新式“学习推理”方法，该方法借助多模型的集体知识，使多模态语言模型具备逐步推理的能力。研究团队利用这一方法构建了包含显式推理树的Mulberry-260k数据集，并以此训练了Mulberry系列模型。该方法在各类基准测试中表现强劲，相关模型的推理与反思能力均得到显著提升。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.18319),  [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1872326647606841651)  |\n| 9) **强化学习概述** - 提供了一份关于强化学习的全面综述。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05265), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866123264965419460)  |\n| 10) **DRT-o1** - 将长链式思维推理应用于机器翻译，尤其擅长处理不同文化背景下的隐喻和明喻。该系统采用多智能体框架，由译者与顾问、评估者协同迭代工作，以生成更优质的译文。使用Qwen2.5模型进行的测试显示，BLEU和CometScore指标均有显著提升，其中DRT-o1-7B的表现甚至优于更大的模型如QwQ-32B-Preview。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17498), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1871455986189574320) |\n\n## 本周顶级机器学习论文（12月16日–12月22日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **Genesis** - 一个全新的通用物理仿真平台，将高性能物理引擎与生成式AI能力相结合；它支持通过自然语言驱动的方式创建机器人仿真、角色动画和交互式3D环境，速度最高可达实时的43万倍。 | [论文](https:\u002F\u002Fgenesis-embodied-ai.github.io\u002F), [推文](https:\u002F\u002Fx.com\u002Fzhou_xian_\u002Fstatus\u002F1869511650782658846) |\n| 2) **大模型中的对齐伪装** - 研究表明，Claude模型能够进行“对齐伪装”：它会策略性地响应有害请求以避免重新训练，同时保留其原始的安全偏好；这一发现引发了人们对AI安全训练方法可靠性的担忧。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14093), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1869427646368792599) |\n| 3) **TheAgentCompany** - 一种用于在模拟软件公司环境中评估AI代理在真实世界专业任务上表现的新基准；任务涵盖软件工程、项目管理、财务和人力资源等多个职业角色。使用包括Claude-3.5-Sonnet等API模型以及Llama 3.1等开源模型在内的多种大模型进行测试后，结果显示出当前AI代理的局限性。表现最佳的Claude-3.5-Sonnet仅能在24%的任务中完全完成，若计入部分进展则得分为34.4%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.14161), [推文](https:\u002F\u002Fx.com\u002Fgneubig\u002Fstatus\u002F1869735196700062089) |\n| 4) **图到文本属性图** - 自动为图中的节点生成文本描述，从而实现高效的图到文本属性图转换；该方法在富含文本、文本有限及无文本的图上进行了评估，证明单个GNN即可在不同类型的图上有效运行。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10136), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1868691391129272461) |\n| 5) **Qwen-2.5技术报告** - 阿里巴巴发布了Qwen2.5系列大模型，该系列基于18T tokens进行训练，提供开放权重模型如Qwen2.5-72B，以及性能可与Llama-3和GPT-4等更大模型媲美的专有MoE变体。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.15115), [推文](https:\u002F\u002Fx.com\u002FAlibaba_Qwen\u002Fstatus\u002F1869950647501824015) |\n| 6) **PAE（提议者-代理-评估者）** - 一种学习系统，使AI代理能够通过网络导航自主发现并练习技能，利用强化学习和上下文感知的任务提议，在真实世界基准测试中达到最先进水平。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.13194) |\n| 7) **DeepSeek-VL2** - 一个新的视觉-语言模型系列，采用动态分块技术处理高分辨率图像，并配备高效的MoE架构，在各类视觉任务中表现出色；在激活参数数量相似或更少的情况下，其性能可与现有的开源密集型及MoE模型相媲美甚至超越。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.10302), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1868696154067865659) |\n| 8) **AutoFeedback** - 一种双代理AI系统，能够为科学测评中的学生作答生成更准确且更具教育学意义的反馈，相比单代理模型显著减少了过度表扬等常见错误。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07407) |\n| 9) **多模态大模型时代下的数学推理综述** - 提出了一篇全面的综述，分析了多模态大语言模型（MLLMs）中的数学推理能力，涵盖了自2021年以来200余项研究中的基准测试、方法论和挑战。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11936), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1870126516832792811) |\n| 10) **大模型中的精确长度控制** - 将预训练的解码器专用大模型调整为生成指定长度的响应；通过在输入嵌入中集成二次长度差异位置编码，实现对用户设定的响应终止长度的倒计时计算；声称可在不牺牲质量的前提下，将平均词元误差控制在3个以内。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.11937), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1869030043084845453) |\n\n## 本周顶级机器学习论文（12月9日–12月15日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **在连续潜在空间中训练大语言模型进行推理** - 提出了Coconut（连续思维链），这是一种新颖的范式，使大语言模型能够在连续的潜在空间中而非自然语言中进行推理；Coconut将大语言模型的最后一层隐藏状态作为推理状态，并直接以连续空间中的嵌入形式反馈给模型作为后续输入；由此产生了作者所称的“连续思维”，从而增强了大语言模型在推理任务上的能力；它通过涌现的广度优先搜索能力，在复杂推理任务上表现出更好的性能。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06769), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866518791733342563) |\n| 2) **Phi-4 技术报告** - 介绍了phi-4，这是一款140亿参数的模型，在STEM问答任务上的表现超越了其教师模型。由于改进的数据、训练课程以及后训练方案中的创新，它在以推理为重点的基准测试中也取得了强劲的成绩。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.08905), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1867609628529635574) |\n| 3) **异步函数调用** - 提出了一种用于异步大语言模型函数调用的系统AsyncLM；他们设计了一种上下文感知的函数调用与中断协议，提供了一种微调策略以使大语言模型适应中断语义，并在大语言模型推理过程中高效地实现了这些机制；与同步函数调用相比，AsyncLM可以将任务完成延迟缩短1.6至5.4倍；它使大语言模型能够并发地生成和执行函数调用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07017), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866855077983686804) |\n| 4) **MAG-V** - 一个多智能体框架，首先生成一组模拟客户查询的问题数据集；然后从回复中反向推导出替代问题，以验证智能体的决策轨迹；报告称，生成的合成数据可以提升智能体在实际客户查询中的表现；研究发现，对于轨迹验证而言，经过特征工程的简单机器学习基线模型即可达到与更昂贵且功能更强的模型相当的性能。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.04494), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866143542726340890) |\n| 5) **Clio** - 提出了一种利用AI助手分析并提取数百万Claude.ai对话中的私有聚合使用模式的平台；该系统能够在保护用户隐私的同时，提供对真实世界AI使用情况的洞察；它无需人工审核员阅读原始对话，即可帮助识别使用趋势、安全风险以及协同滥用行为。  | [论文](https:\u002F\u002Fassets.anthropic.com\u002Fm\u002F7e1ab885d1b24176\u002Foriginal\u002FClio-Privacy-Preserving-Insights-into-Real-World-AI-Use.pdf), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1867325190352576780) |\n| 6) **关于大语言模型作为裁判的综述** - 从五个关键视角对大语言模型作为裁判这一范式进行了全面综述：功能、方法论、应用、元评估及局限性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.05579), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866541394015518824) |\n| 7) **AutoReason提升多步推理能力** - 提出了一种使用思维链提示自动生成查询理由的方法；这种方法将零样本查询转化为少样本推理轨迹，这些轨迹随后被用作大语言模型的思维链示例；声称可以改善较弱大语言模型的推理能力。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06975), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1867224350287372555) |\n| 8) **字节潜在变换器（BLT）** - 介绍了一种字节级语言模型架构，其性能可与基于分词的大语言模型相媲美，同时提高了效率和鲁棒性；该模型采用一种动态方法，根据下一个字节的熵将字节分组为补丁，对复杂预测分配更多计算资源，而对更可预测的序列则使用较大的补丁；BLT证明，在推理过程中，其性能可以达到或超过Llama 3等模型，同时所需的浮点运算次数减少多达50%。 | [论文](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fbyte-latent-transformer-patches-scale-better-than-tokens\u002F), [推文](https:\u002F\u002Fx.com\u002FArtidoroPagnoni\u002Fstatus\u002F1867601413741981804) |\n| 9) **RLHF能扩展吗？** - 这篇新论文探讨了RLHF框架中关键组件的影响。主要发现总结如下：1) 在大语言模型中，RLHF的扩展效果不如预训练显著；当使用固定奖励模型时，更大的策略模型从RLHF中获益较少；2) 在策略训练过程中增加每个提示采样的响应数量，初期性能会有所提升，但很快就会趋于平稳，通常在4到8个样本左右；3) 使用更大的奖励模型有助于提升推理任务的表现，但不同任务类型之间的提升效果并不一致；4) 增加奖励模型的训练数据多样性比增加每个提示的响应多样性更为有效，然而无论额外增加多少数据，策略训练在早期阶段之后都会出现收益递减。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.06000), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866525606562680954) |\n| 10) **Granite Guardian** - IBM开源了Granite Guardian，这是一套用于检测大语言模型风险的安全防护工具；作者声称，在有害内容和RAG幻觉相关基准测试中，Granite Guardian的AUC得分分别为0.871和0.854，使其成为目前该领域中最具通用性和竞争力的模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.07724), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1866852443621036228) |\n\n## 本周顶级机器学习论文（12月2日至12月8日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **OpenAI o1** - 一个通过大规模强化学习训练、采用思维链方式进行推理的模型系列；o1 在数学、代码和科学相关基准测试中表现出显著提升；据称其生成思考步骤的速度比 o1-preview 快 50%。实验结果表明，o1 在推理任务上表现更优，能够生成更为全面且可靠的响应。  | [论文](https:\u002F\u002Fcdn.openai.com\u002Fo1-system-card-20241205.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1864729936847868192) |\n| 2) **Genie 2** - 一个基础世界模型，能够根据单张提示图像生成可玩的3D环境，为AI智能体提供无限的训练场景，具备物理模拟、角色动画和物体交互等功能；Genie 2 使用自编码器与Transformer相结合的方式，基于视频数据训练生成虚拟世界；该模型可以创建实时交互式环境，并提供速度更快但质量稍低的版本以供即时体验。  | [论文](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-2-a-large-scale-foundation-world-model), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1864367798132039836) |\n| 3) **逆向思维** - 研究表明，训练大语言模型掌握“逆向思维”有助于提升常识、数学和逻辑推理任务的表现。该方法声称优于在正向推理数据上训练、数据量是其10倍的标准微调方法。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19865), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863595518649098371) |\n| 4) **ALAMA** - 一种新框架，帮助语言智能体自动学习何时使用不同的机制（如ReAct、CoT、Reflection等）来完成任务，从而改进当前采用固定或预定义机制的方法；该框架可根据任务的潜在特性自适应地激活合适的机制。实验结果表明，在下游智能体任务中，包括数学推理和知识密集型推理，性能有显著提升。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.00722), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863956776623747433) |\n| 5) **Auto-RAG** - 一种自主迭代检索模型，在多个数据集上均表现出卓越性能；Auto-RAG 是经过微调的大语言模型，利用大语言模型的决策能力，通过多轮对话与检索器互动，系统性地规划检索并优化查询，以获取有价值的知识——这一过程将持续进行，直到获得足够的外部信息为止。作者还指出，该方法可以根据问题难度自动调整迭代次数，无需人工干预。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.19443), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863600141103501454) |\n| 6) **GenCast** - 一款ML天气预测模型，在准确性和速度上均超越全球领先的业务化天气预报系统（ECMWF的ENS）。它仅需8分钟即可生成涵盖80余种变量的15天全球概率性天气预报，且在97.2%的评估指标上表现优于ENS。GenCast 生成的集合预报能够更好地捕捉不确定性，精准预测极端天气事件、热带气旋路径及风力发电量。  | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08252-9), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1864340994965098513) |\n| 7) **人机交互中的挑战** - 对人机交互中的关键挑战进行了全面分析，重点探讨人类与AI智能体如何有效建立共同基础和相互理解；研究识别出三大类共12项核心挑战：从智能体向用户传递信息、使用户能够向智能体传达信息，以及影响所有交互的一般性沟通挑战。  | [论文](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fuploads\u002Fprod\u002F2024\u002F12\u002FHCAI_Agents.pdf) |\n| 8) **面向LLM的检索增强推理** - 将rStar推理框架扩展，以提升LLM的推理准确性和事实可靠性；该方法利用蒙特卡洛树搜索（MCTS）框架，结合显式的检索增强推理生成多个候选推理轨迹，随后通过检索增强的事实性评分器评估各条轨迹的事实准确性，最终选择事实性得分最高的轨迹作为系统的答案。在医学推理任务中，RARE（使用Llama 3.1）的表现超越了GPT-4等更大规模的模型；而在常识推理任务中，RARE的表现优于Claude-3.5 Sonnet和GPT-4o-mini，其性能可与GPT-4o相媲美。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02830), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1864687176929431566) |\n| 9) **DataLab** - 一个由基于LLM的智能体驱动的统一商业智能平台，集任务规划、推理和计算笔记本于一体，旨在简化整个BI工作流程。该系统在研究基准测试中达到SOTA水平，并在腾讯的真实企业数据上展现出显著的准确性和效率提升；在特定于企业的BI任务中，准确率最高可提升58.58%，令牌成本则最多可降低61.65%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.02205), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1864327307177152619) |\n| 10) **预训练中的程序性知识驱动LLM的推理** - 研究了哪些文档对模型输出具有重要影响；通过分析这些相关数据，试图更好地理解LLM在执行推理任务时所采用的泛化策略。研究发现，在进行推理任务时，起关键作用的文档往往包含程序性知识（例如，展示如何使用公式或代码得出解决方案）。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.12580), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1863590537346925032) |\n\n## 本周顶级机器学习论文（11月25日—12月1日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **LLM在预测神经科学实验结果方面超越人类专家** - 提出BrainBench框架，用于研究LLM在预测神经科学研究实验结果方面的表现；他们基于神经科学文献微调了一款名为BrainGPT的LLM，在预测神经科学结果上超越了专家；报告指出，当LLM对其预测表现出高度自信时，其回答更有可能正确。 | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41562-024-02046-9), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861781028291190887) |\n| 2) **Fugatto** - 由NVIDIA推出的一款新型生成式AI声音模型，能够通过文本和音频输入创建并转换任意组合的音乐、人声和音效；该模型使用25亿参数进行训练，具备新颖的音频生成能力，例如让小号发出狗叫声或萨克斯风发出猫叫声。 | [论文](https:\u002F\u002Fd1qx31qr3h6wln.cloudfront.net\u002Fpublications\u002FFUGATTO.pdf), [推文](https:\u002F\u002Fx.com\u002FNVIDIAAIDev\u002Fstatus\u002F1861052624352825383) |\n| 3) **o1复现之旅——第2部分** - 表明将o1 API的简单蒸馏与监督微调相结合，可显著提升复杂数学推理任务的性能；经过对数万条o1蒸馏的长链思维样本进行微调的基础模型，在美国邀请数学竞赛（AIME）中表现优于o1-preview。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.16489), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861411844554113276) |\n| 4) **LLM驱动的GUI智能体** - 概述了基于LLM的GUI智能体相关技术及应用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18279), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1862133601040752820) |\n| 5) **高层次自动化推理** - 通过高层次自动化推理扩展上下文学习；使用Qwen2.5-7B-Instruct在MATH基准测试中达到79.6%的最先进准确率，超越GPT-4o（76.6%）和Claude 3.5（71.1%）；该方法不依赖于手动构建高质量演示样例，而是转向抽象思维模式；引入五种原子级推理动作来构建链式推理模式，随后利用蒙特卡洛树搜索探索推理路径，并生成思维卡片以指导推理过程。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.18478), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1862131336653533584) |\n| 6) **Star Attention：高效处理长序列的LLM推理** - 提出Star Attention注意力机制，这是一种两阶段机制，通过结合块级局部注意力进行上下文编码，以及序列全局注意力进行查询处理和标记生成，从而高效处理长序列；相比传统注意力机制，该方法可在保持95%-100%准确率的同时，将推理速度提升至11倍，且能通过在多台主机间高效分配计算资源实现这一点；其关键创新在于“锚定块”机制，即每个上下文块都前置第一个块，从而在降低计算开销的同时有效近似全局注意力模式。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.17116), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861854543694406109) |\n| 7) **LLM作为评判者的综述** - 提供关于LLM作为评判者的全面综述，包括如何构建可靠的LLM评判系统这一深入讨论。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15594), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861411159913472229) |\n| 8) **TÜLU 3** - 发布了一系列完全开源的最先进后训练模型，并附带数据、代码和训练配方，为现代后训练技术提供了全面指南。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15124), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861085195950256335) |\n| 9) **1000人的生成式智能体模拟** - 引入一种新的智能体架构，利用LLM创建真实个体的行为模拟，在通用社会调查中实现了85%的人类反应复制准确率，并相较于传统方法降低了人口统计学偏差。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10109), [推文](https:\u002F\u002Fx.com\u002Fpercyliang\u002Fstatus\u002F1861136757435015580) |\n| 10) **衡量ChatGPT参与的语言游戏中存在的胡言乱语** - 提出让基于LLM的聊天机器人参与“胡言乱语的语言游戏”；作者通过要求ChatGPT就其毫无知识或能力的主题生成科学论文，得以提供一套关于这种“胡言乱语”表现形式的参考样本。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.15129), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1861066315789942978) |\n\n## 本周顶级机器学习论文（11月18日至11月24日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **AlphaQubit** - 一种基于人工智能的新解码器，为量子计算机中的错误识别设立了最先进的基准；该系统采用Transformer架构，在Sycamore数据集上的测试中，其错误率比张量网络方法低6%，比相关匹配方法低30%；在高达241个量子比特的大规模系统仿真中也表现出令人鼓舞的结果。尽管这标志着量子纠错领域的重大进展，但该系统在速度方面仍需改进，才能实现实时纠错，以满足实际量子计算应用的需求。  | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08148-8), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1859273133234192598) |\n| 2) **GUI代理的黎明** - 探讨Claude 3.5在不同领域和软件中的计算机使用能力；同时提供了一个开箱即用的代理框架，用于部署基于API的GUI自动化模型；Claude 3.5的计算机使用功能展现了前所未有的端到端语言到桌面操作能力。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10323), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1858526493661446553) |\n| 3) **LLM评估的统计方法** - 提出了五项关键的统计建议，以更严格地评估LLM性能差异。这些建议包括：1) 利用中心极限定理来衡量所有可能问题的理论平均值，而不仅仅是观测到的平均值；2) 当问题相互关联而非独立时，对标准误差进行聚类；3) 通过重采样或使用下一个标记的概率来降低问题内部的方差；4) 由于评估中共享相同的问题，因此分析模型之间的配对差异；5) 使用功效分析来确定合适的样本量，以便检测模型之间有意义的差异。作者认为，这些统计方法将帮助研究人员更好地判断模型间的性能差异是真正的能力差距，还是仅仅由偶然因素造成的，从而实现更精确、更可靠的模型评估。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00640), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1858976458330505639) |\n| 4) **面向开放性解决方案的开放式推理模型** - 提出Marco-o1，这是一种专为开放性解决方案设计的推理模型；Marco-o1基于思维链微调、蒙特卡洛树搜索、反思机制以及更多最新的推理策略构建而成；在MGSM（英语）数据集上，Marco-o1的准确率提升了+6.17%，在MGSM（中文）数据集上则提升了+5.60%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.14405), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1860003607606706197) |\n| 5) **基于LLM的自动修复缺陷代理** - 在SWE-bench Lite基准测试上分析了七种领先的基于LLM的缺陷修复系统，发现由字节跳动开发的MarsCode Agent取得了最高的成功率，达到39.33%；研究揭示，对于错误定位而言，行级故障定位精度比文件级精度更为关键，而缺陷重现能力则显著影响修复的成功率；结果显示，168个已解决的问题中，有24个只有通过重现技术才能解决，尽管当问题描述已经清晰时，重现有时反而会误导LLM；结论指出，要提高自动修复缺陷的有效性，还需在LLM的推理能力和代理工作流设计方面进行改进。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10213), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1859964808789135668) |\n| 6) **降低大词汇量语言模型中的内存消耗** - 介绍了一种名为Cut Cross-Entropy (CCE)的新方法，通过优化交叉熵损失的计算方式，大幅减少LLM训练过程中的内存占用；目前，LLM训练中的交叉熵层由于需要存储所有可能词汇标记的logits，往往占据大量内存（某些模型甚至高达90%）。CCE通过仅计算正确标记的logits，并利用闪存即时对所有logits进行log-sum-exp运算来解决这一问题；作者表明，该方法可将Gemma 2的内存占用从24GB降至仅1MB；该方法利用softmax计算的固有稀疏性，跳过对梯度贡献可以忽略不计的元素；最终证明，CCE在实现如此大幅度的内存缩减的同时，并未牺牲训练速度或收敛性，从而允许在训练过程中使用更大的批次，也可能使LLM训练的扩展更加高效。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.09009) |\n| 7) **BABY-AIGS** - 一个用于自动化科学发现的多智能体系统，强调通过自动化消融研究来进行证伪。该系统在三项机器学习任务（数据工程、自指令对齐和语言建模）上进行了测试，展示了产生有意义科学发现的能力。然而，其性能仍低于经验丰富的研究人员。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11910v1), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1859656533489188928) |\n| 8) **提示格式是否会影响LLM性能** - 研究了不同提示格式（纯文本、Markdown、JSON和YAML）如何影响GPT模型在各类任务中的表现；发现GPT-3.5-turbo的性能会因提示格式的不同而变化多达40%，而像GPT-4这样的大型模型则对格式变化表现出更强的鲁棒性；研究认为，不存在适用于所有模型或任务的通用最优格式——例如，GPT-3.5-turbo通常在使用JSON格式时表现更好，而GPT-4则更偏好Markdown；同一型号系列的模型往往具有相似的格式偏好，但这些偏好在不同型号系列之间并不通用；研究建议，在进行提示工程和模型评估时，应仔细考虑提示格式对模型性能的影响，以及如何将其应用于具体应用场景。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.10541) |\n| 9) **FinRobot** - 一个用于股票研究的AI代理框架，采用多智能体思维链提示，将数据分析与人类般的推理相结合，生成可与大型券商媲美的专业投资报告；它利用三个智能体：数据-CoT智能体负责整合多种数据源，实现稳健的财务集成；概念-CoT智能体则进行分析师式的推理，生成可操作的洞察；最后，论点-CoT智能体将这些洞察综合成连贯的投资论点和报告。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.08804) |\n| 10) **Bi-Mamba** - 一种可扩展的1位Mamba架构，专为效率更高的LLM设计，涵盖7.8亿、13亿和27亿参数等多种规模；Bi-Mamba的性能可与全精度版本（如FP16或BF16）相媲美，同时显著降低了内存占用，且准确性优于后训练二值化Mamba基线。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.11843) |\n\n## 本周顶级机器学习论文（11月11日至11月17日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **人工智能对创新的影响** - 研究表明，顶尖科学家能够利用其领域知识优先筛选出有前景的人工智能建议，而其他研究人员则会浪费大量资源去验证假阳性结果；研究还发现，引入人工智能材料发现技术可显著提升生产力，新材料的发现数量增加44%，专利申请量增加39%，产品创新数量增加17%。然而，这些收益也伴随着令人担忧的权衡：82%的科学家表示，由于创造力下降和技能未能充分发挥，工作满意度有所降低。  | [论文](https:\u002F\u002Faidantr.github.io\u002Ffiles\u002FAI_innovation.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1856424446720127024) |\n| 2) **精度的扩展规律** - 提出了“精度感知”的扩展规律，用于预测大型语言模型在训练和推理过程中精度对其性能的影响；主要发现包括：1) 随着模型训练数据量的增加，后训练量化带来的负面影响愈发显著，最终导致额外的预训练反而有害；2) 以较低精度进行训练时，需要增大模型规模才能维持性能；3) 在联合优化模型规模、数据量和精度的情况下，计算最优的训练精度约为7–8位，并且与计算资源无关。此外，当模型规模固定时，计算最优精度会随数据量的增加呈近似对数增长趋势。作者在参数规模达17亿、训练数据量达260亿token的模型上验证了其预测，结果表明，极高（16位）和极低（低于4位）的训练精度都可能并非最佳选择。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04330), [推文](https:\u002F\u002Fx.com\u002Ftanishqkumar07\u002Fstatus\u002F1856045600355352753) |\n| 3) **Evo** - 一个拥有70亿参数的人工智能模型，旨在跨多个生物尺度理解和生成DNA序列。该模型基于270万个原核生物和噬菌体基因组进行训练，能够处理长达131千碱基的序列，同时保持单核苷酸分辨率，从而既能理解分子水平的相互作用，又能捕捉全基因组范围内的模式。Evo在预测和生成功能性DNA、RNA和蛋白质序列方面表现出色，甚至首次成功生成经实验验证的CRISPR-Cas复合物及转座系统。  | [论文](https:\u002F\u002Fwww.science.org\u002Fdoi\u002F10.1126\u002Fscience.ado9336), [推文](https:\u002F\u002Fx.com\u002Farcinstitute\u002Fstatus\u002F1857138107038187945) |\n| 4) **OpenCoder** - 介绍了OpenCoder，这是一个完全开源的专注于代码生成与理解的大语言模型。作者指出，构建高性能代码大语言模型的关键因素包括：(1) 使用针对代码优化的启发式规则进行有效的去重数据清洗；(2) 检索与代码相关的高质量文本语料库；以及(3) 在退火和监督微调阶段使用高质量的合成数据。OpenCoder在60亿参数以上的完全开源模型中表现超越此前的同类模型，并不仅公开模型权重，还发布了完整的训练流程、数据集和协议，以支持可重复的研究。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04905), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857515355595526450) |\n| 5) **测试时训练在抽象推理中的惊人效果** - 探讨了测试时训练（TTT）——即在推理过程中临时更新模型参数——如何通过ARC基准测试提升大型语言模型的抽象推理能力。研究确定了三个关键要素：针对类似任务的初始微调、辅助任务的格式与增强方法，以及基于实例的训练。TTT显著提升了性能，相比基础微调模型，准确率最高可提高6倍。将TTT应用于80亿参数的语言模型时，他们在ARC公开验证集上取得了53%的准确率，使神经网络方法的最新水平提高了近25%。通过将其方法与程序生成方法相结合，他们进一步提升了公开验证集的准确率至61.9%，达到人类平均表现水平。研究结果表明，显式符号搜索并非提升大型语言模型抽象推理能力的唯一途径；在少量示例的基础上应用测试时训练同样可以非常有效。  | [论文](https:\u002F\u002Fekinakyurek.github.io\u002Fpapers\u002Fttt.pdf), [推文](https:\u002F\u002Fx.com\u002Fakyurekekin\u002Fstatus\u002F1855680785715478546) |\n| 6) **用于实现基础模型驱动代理可观ility的AgentOps分类法** - 分析了AgentOps平台和工具，强调需要全面的可观ility和可追溯性功能，以确保基于基础模型的自主代理系统在其开发和生产生命周期中的可靠性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.05285v1), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857400667318702118) |\n| 7) **迈向RAG的最佳检索与搜索策略** - 考察了检索环节对QA任务中RAG流水线性能的影响。研究使用BGE-base和ColBERT检索器配合LLaMA和Mistral模型进行了实验，发现加入更多黄金（相关）文档能够提升QA准确率；同时发现，采用召回率较低的近似最近邻搜索仅会对性能产生轻微影响，却能显著提升速度并节省内存。研究还指出，添加噪声或不相关文档会持续降低性能，这与先前的研究结论相矛盾。研究最后得出结论：优化黄金文档的检索对于RAG性能至关重要，而在实际应用中，以较低的搜索精度运行是一种可行的方案。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07396), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1856709865802252710) |\n| 8) **用少量示例缓解LLM越狱攻击** - 提出了一种新的防御方法，旨在应对LLM越狱攻击，重点在于检测到新型攻击后迅速调整防御措施，而非追求事前完美的对抗鲁棒性。借助新基准测试，最有效的方法是通过对输入分类器进行微调，仅需看到每种攻击策略的一个示例，即可将已知攻击类型的成功率降低240倍以上，对新型变种的攻击成功率也能降低15倍。研究证明，快速响应新型越狱攻击可以作为传统静态防御的有效替代方案。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.07494), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1856752093945540673) |\n| 9) **Transformer混合模型** - 介绍了一种名为Transformer混合模型（MoT）的新稀疏多模态Transformer架构，其性能可媲美传统模型，但处理文本和图像时仅需约一半的计算资源。MoT仅使用55.8%的FLOPs便能达到密集型基准模型的性能。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.04996) |\n| 10) **HtmlRAG** - 提出了一种新颖的方法，建议在构建RAG系统时使用HTML格式而非纯文本。核心发现是，保留HTML结构能够提供比单纯转换为文本更丰富的语义和结构信息，因为后者通常会丢失标题、表格和语义标签等重要格式。为解决HTML文档长度超出LLM上下文窗口的问题，作者开发了一种两步修剪方法：首先清除不必要的HTML元素（使文档长度减少94%），然后采用基于块树的修剪策略，结合嵌入式和生成式修剪技术，进一步精简内容的同时保留关键信息。在六个不同QA数据集上的实验表明，HtmlRAG的表现优于现有的纯文本方法，证实了在RAG系统中保留HTML结构的优势。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02959v1), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1857870511302390013) |\n\n## 本周机器学习顶级论文（11月4日至11月10日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **迈向人工智能文明的多智能体模拟** - 展示了10至1000多个AI智能体在智能体社会中的行为与演进；提出了PIANO架构，使智能体能够实时与人类及其他智能体互动；表明智能体可以自主发展出专业化角色、遵守并改变集体规则，以及进行文化和宗教传承。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00114), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853290196286021940) |\n| 2) **小型语言模型综合综述** - 对小型语言模型（SLM）的综述，并讨论了定义、应用、性能提升、可靠性等相关问题。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03350), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854532748154695717) |\n| 3) **Magentic-One** - 一种新型通用多智能体系统，旨在处理复杂的网络和文件相关任务；它使用一个编排智能体来指挥四个专业智能体：WebSurfer负责浏览器操作，FileSurfer负责文件管理，Coder负责编程任务，ComputerTerminal则用于控制台操作；Magentic-One在GAIA、AssistantBench和WebArena等多个基准测试中取得了具有竞争力的表现，且无需对其核心架构进行修改。 | [论文](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Fmagentic-one-a-generalist-multi-agent-system-for-solving-complex-tasks\u002F), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854910759232585786) |\n| 4) **上下文学习混合模型** - 利用演示数据的子集通过上下文学习训练专家模型；给定一个训练集，通过可训练的权重函数将各专家模型的下一个词预测结果进行组合；该方法适用于黑盒大语言模型，因为无需访问模型的内部参数。其优点包括：1) 在性能上可与标准的上下文学习相媲美，同时在数据、内存和计算效率上显著更高；2) 对噪声演示和标签不平衡具有较强的鲁棒性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02830), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854252169492562171) |\n| 5) **通过弹出窗口攻击视觉-语言智能体** - 研究表明，在现有智能体测试环境中集成对抗性弹出窗口，可使攻击成功率高达86%；这会使智能体的任务成功率下降47%；研究者还指出，基本的防御措施（例如指示智能体忽略弹出窗口）效果甚微。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02391), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853810252308774955) |\n| 6) **基于大语言模型的多专家提示方法** - 通过模拟多位专家并整合其回答，以改进大语言模型的响应；该方法引导大语言模型根据输入指令，模拟多个专家并从个体及整合后的观点中选择最佳回复；在TruthfulQA生成任务上，该方法使用ChatGPT达到了新的最先进水平，超越了当前的SOTA 87.97%；同时，它还在事实性和实用性方面提升了表现，降低了毒性与伤害性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00492), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853286452227899851) |\n| 7) **大语言模型的数字理解能力** - 对大语言模型的数字理解与处理能力（NUPA）进行了全面分析；发现简单的微调可以在许多但并非所有任务上显著提升NUPA；同时报告称，旨在增强NUPA的技术对预训练模型的微调效果不佳；研究还探讨了应用于NUPA的思维链技术，并指出思维链方法存在可扩展性挑战，使其难以在实际场景中应用。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.03766), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1854528742095458337) |\n| 8) **WebRL** - 提出了一种自进化在线课程强化学习框架，以弥合开源与专有LLM驱动的网络智能体之间的差距；该框架将Llama-3.1-8B的成功率由4.8%提升至42.4%，GLM4-9B的成功率则由6.1%提升至43%；开源模型的表现显著优于GPT-4-Turbo（17.6%）和GPT-4o（13.9%）；自进化课程解决了网络智能体训练任务稀缺的问题；其背后是强大的基于结果监督的奖励模型，用于评估任务成功与否；自适应强化学习策略有助于应对在线学习中的分布漂移，确保持续改进。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.02337), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853821990177485311) |\n| 9) **边学习边适应** - 提出了一个两阶段微调方法，首先帮助大语言模型从工具生成的解决方案中学习，然后训练它们判断何时直接解决问题、何时使用工具；在数学、气候科学和流行病学等基准测试上的实验显示，该方法带来了显著提升，准确率提高了28%，工具使用精度也比GPT-4和Claude-3.5等领先模型高出14%；这种两阶段方法有助于大语言模型自适应地解决不同复杂程度的科学问题。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00412), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853281778594979877) |\n| 10) **大语言模型的个性化** - 提出了一套全面的框架，用于理解个性化大语言模型；介绍了个性化不同方面的分类体系，并整合了现有关于个性化文本生成及下游应用的研究成果。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2411.00027), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1853276249981907386) |\n\n## 本周顶级机器学习论文（10月28日—11月3日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **大语言模型中概念的几何结构** - 研究了稀疏自编码器（SAE）中概念表示的几何结构，从三个尺度展开：1) 相关概念之间原子级别的平行四边形模式（如：男:女::王:后），2) 类似大脑的功能“叶区”，分别对应数学\u002F代码等不同类型的知识，3) 星系级别的特征值分布，显示中间模型层具有专门化的结构。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19750), [推文](https:\u002F\u002Fx.com\u002Ftegmark\u002Fstatus\u002F1851288315867041903) |\n| 2) **SimpleQA** - 一个由4,326道简短事实性问题组成的挑战性基准，这些问题针对GPT-4的回答进行对抗性收集；报告指出，GPT-4o和Claude等前沿模型的准确率不足50%；研究发现，模型声明的信心与实际准确率之间存在正相关校准关系，表明这些模型具备一定的置信度认知；同时认为，大语言模型在置信度校准方面仍有提升空间。 | [论文](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-simpleqa\u002F), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1851680760539025639) |\n| 3) **自动化智能体工作流生成** - 提出了一种全新的框架，用于自动化生成智能体工作流；该框架将工作流优化重新表述为基于代码表示的工作流搜索问题，其中边连接调用大语言模型的节点；通过改进版的蒙特卡洛树搜索（MCTS）高效探索搜索空间，借助代码修改、树状经验积累和执行反馈迭代优化工作流；在六个基准数据集上的实验表明，AFlow效果显著，相比手动设计方法提升了5.7%，相比现有自动化方法提升了19.5%；此外，AFlow还使小型模型在特定任务上以仅为其推理成本4.55%的开销超越了GPT-4o。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10762), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1852339570891014415) |\n| 4) **大语言模型通过启发式集合解决数学问题** - 利用因果分析找到解释大语言模型在执行基础算术逻辑时行为的神经元；研究发现并提出假设，不同类型的启发式神经元组合是产生正确算术答案的机制；进一步指出，不同启发式类型无序组合的方式解释了模型在算术提示上的大部分准确率。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21272), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1851233281116946923) |\n| 5) **o1复现之旅** - 报告称正在复现OpenAI的o1模型的能力；其“旅程学习”技术不仅鼓励学习捷径，还注重完整的探索过程，包括试错、反思和回溯；声称仅使用327个训练样本，“旅程学习”技术在MATH数据集上就比捷径学习高出8.0%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18982), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1850748790308761988) |\n| 6) **区分大语言模型幻觉中的无知与错误** - 提出一种方法来区分两种类型的幻觉：模型缺乏知识导致的幻觉（HK-）以及尽管拥有正确知识但仍产生幻觉的情况（HK+）；研究团队利用所提出的方案构建了模型专属数据集，并证明与通用数据集相比，模型专属数据集在检测HK+幻觉方面更为有效。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.22071), [推文](https:\u002F\u002Fx.com\u002FAdiSimhi\u002Fstatus\u002F1851650371615125563) |\n| 7) **多模态RAG** - 探讨如何在工业领域中将多模态模型最佳地集成到RAG系统中；同时深入讨论了使用大语言模型作为“裁判”对这类系统进行评估的方法。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.21943), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1851479149690642456) |\n| 8) **提示工程与外部工具在大语言模型幻觉率中的作用** - 测试了多种旨在降低大语言模型幻觉率的提示策略和框架；研究发现，更简单的提示技术优于复杂方法；同时指出，由于工具使用的复杂性增加，大语言模型代理的幻觉率更高。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.19385), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1850745569125253401) |\n| 9) **MrT5** - 一种更高效的字节级语言模型变体，采用动态令牌删除机制（通过学习得到的删除门）将序列长度缩短多达80%，同时保持模型性能；这使得推理速度更快，且无需传统分词即可更好地处理多语言文本；MrT5在XNLI和字符级操作等下游任务上与ByT5相比仍保持竞争力，同时显著提升了推理运行时间。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20771), [推文](https:\u002F\u002Fx.com\u002FJulieKallini\u002Fstatus\u002F1851278833061704170) |\n| 10) **松弛递归Transformer** - 提出了一种名为“松弛递归Transformer”的新方法，通过跨层参数共享大幅降低大语言模型规模，同时保持性能；该模型以标准预训练的Transformer为基础初始化，但仅使用一组独特的层，并以循环方式重复多次；随后通过深度低秩适应（LoRA）模块为层间约束增加灵活性；研究表明，该方法有望使推理吞吐量大幅提升（2–3倍）。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.20672), [推文](https:\u002F\u002Fx.com\u002Fraymin0223\u002Fstatus\u002F1851216039822180759) |\n\n## 本周顶级机器学习论文（10月21日至10月27日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **智能体式信息检索** - 介绍了由大语言模型智能体能力所塑造的智能体式信息检索；讨论了该技术的不同前沿应用及其面临的挑战。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.09713), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848396596230127655) |\n| 2) **Aya Expanse** - 一个面向多语言能力的开源基础模型系列；发布了80亿和320亿参数的模型，其中包含迄今为止最大的多语言数据集之一，共计5.13亿个样本；此次发布还包括Aya-101，作者称其为覆盖101种语言的最全面的多语言模型；Aya Expanse 32B的表现优于Gemma 2 27B、Mistral 8x22B以及Llama 3.1 70B——后者参数规模是前者的两倍。  | [论文](https:\u002F\u002Fcohere.com\u002Fblog\u002Faya-expanse-connecting-our-world), [推文](https:\u002F\u002Fx.com\u002FCohereForAI\u002Fstatus\u002F1849435983449587796) |\n| 3) **对思维链的理论理解** - 研究发现，在演示中同时加入正确与错误的推理路径，能够提升中间步骤及思维链推理的准确性；所提出的“连贯思维链”方法在多个基准测试上显著提升了性能；例如，在“追踪打乱的物体”数据集中，Gemini Pro的准确率提升了6.60%（从58.20%增至64.80%），而在“桌上的企鹅”任务中，DeepSeek 67B的准确率则提高了6.17%（从73.97%增至80.14%）。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.16540), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849139985712369907) |\n| 4) **面向大语言模型的数据合成与增强综述** - 对大语言模型生命周期中的数据生成技术进行了全面总结；内容涵盖数据准备、预训练、微调、指令微调、偏好对齐及应用场景等。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12896), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848445736591163886) |\n| 5) **LongRAG** - 增强了RAG系统对长上下文知识的理解能力，包括全局信息与事实细节；其核心组件包括混合检索器、大语言模型增强的信息提取器、思维链引导的过滤器以及大语言模型增强的生成器。这些组件使RAG系统能够有效挖掘全局性长上下文信息并精准识别事实细节；LongRAG的表现优于长上下文大语言模型（提升6.94%）、先进RAG（提升6.16%）以及普通RAG（提升17.25%）。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18050), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849494571946066295) |\n| 6) **大语言模型中的特征调控评估** - 通过人为调节不同特征的强度来观察模型输出的变化，从而评估大语言模型中的特征调控效果；研究重点在于与社会偏见相关的29个特征，并探讨特征调控是否有助于缓解社会偏见。研究发现，特征调控有时会导致非预期效应，而引入中立性特征则可在不降低文本质量的前提下，在9个社会维度上减少社会偏见。 | [论文](https:\u002F\u002Fwww.anthropic.com\u002Fresearch\u002Fevaluating-feature-steering), [推文](https:\u002F\u002Fx.com\u002FAnthropicAI\u002Fstatus\u002F1849840131412296039) |\n| 7) **Granite 3.0** - 推出了轻量级基础模型，参数规模从4亿到80亿不等；支持代码编写、RAG、推理和函数调用等功能，主要面向企业级应用场景，包括本地部署和端侧运行环境；在语言理解、推理、代码编写、函数调用及安全性等多个学术基准测试中表现出色。 | [论文](https:\u002F\u002Fgithub.com\u002Fibm-granite\u002Fgranite-3.0-language-models\u002Fblob\u002Fmain\u002Fpaper.pdf), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848404138641527105) |\n| 8) **大语言模型反映了其创造者的意识形态** - 研究表明，大语言模型展现出多样化的意识形态立场，这与其创造者的世界观密切相关；研究还发现，同一款大语言模型在中文和英文语境下的回应存在一致性的规范性差异；此外，西方与非西方的大语言模型在地缘政治冲突中的关键角色问题上也存在规范性分歧。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.18417), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1849860985500352968) |\n| 9) **适用于大语言模型的可扩展水印技术** - 提出了一种名为SynthID-Text的文本水印方案，能够在保持文本质量的同时实现高检测准确率，并将延迟开销降至最低；该方案将水印技术与推测采样相结合，利用模型在单词选择时的最终得分模式以及调整后的概率分数来嵌入水印。作者通过评估近1000万条Gemini回复的反馈，验证了该方法的可行性和可扩展性。 | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08025-4), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1849110263871529114) |\n| 10) **OpenAI o1模型的推理模式** - 与其他测试时计算资源利用方法相比，o1在大多数数据集上均取得了最佳性能；研究者观察到，o1最常用的推理模式是分治法和自我精炼；o1会根据不同的任务采用不同的推理策略：对于常识推理任务，o1倾向于识别上下文并强调约束条件；而对于数学和编程任务，则主要依赖于方法复用和分治法。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13639), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1848782378631892997) |\n\n## 本周顶级机器学习论文（10月14日–10月20日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **Thinking LLMs** - 提出一种训练方法，使大型语言模型具备思考能力，从而在无需人工标注数据的情况下完成通用指令遵循任务；采用迭代搜索与优化流程来探索思维生成，使模型能够在无直接监督的情况下学习；针对每条用户指令生成的思维候选会由一个评判模型进行打分；仅对响应内容进行评估，以确定最佳和最差的选项；随后将对应的完整输出作为选择对和拒绝对用于DPO（本文称为“思维偏好优化”）。实验表明，在AlpacaEval和Arena-Hard基准上表现优异。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.10630), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846227797972603047) |\n| 2) **Model Swarms** - 提出一种新的协作式搜索算法，利用群体智能来微调大型语言模型；一组LLM专家在权重空间中协同移动，优化代表各种微调目标的效用函数；实验表明，Model Swarms能够灵活地将LLM专家适配到单个任务、多任务领域、奖励模型以及不同的人类兴趣中。在不同任务和情境下，相比12种模型组合基线，性能提升可达21.0%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11163), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846592954921849029) |\n| 3) **聊天机器人中的一人称公平性** - 研究与ChatGPT交互的用户相关的一人称公平性问题；具体而言，衡量了模型对用户姓名可能存在的偏见；利用由GPT-4o驱动的模型分析聊天机器人对不同用户名回应中的模式及姓名敏感性；研究指出，总体而言，后训练阶段显著缓解了有害的刻板印象；同时报告称，在娱乐和艺术等开放性任务领域，模型表现出最高程度的偏见（即倾向于创作主角性别与用户姓名所推断性别一致的故事）。 | [论文](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Ffirst-person-fairness-in-chatbots.pdf), [推文](https:\u002F\u002Fx.com\u002FOpenAINewsroom\u002Fstatus\u002F1846238809991925838) |\n| 4) **LLMs中的内省能力** - 报告称，LLMs可以通过内省获取其训练数据中无法推断的知识；认为LLMs自身包含一些特权信息，这些信息有望推动更可解释、更可控的系统发展；同时指出，这种内省能力存在局限性，模型在需要对长篇输出进行推理的任务上难以准确预测自身行为。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13787), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1847297594525094081) |\n| 5) **Janus** - 提出一种统一的自回归框架，用于多模态理解和生成；该框架将视觉编码拆分为独立路径，并借助单一Transformer架构来提升视觉理解和生成任务的灵活性与性能；声称能够缓解使用单一视觉编码器时常见的权衡问题；性能超越了以往的统一模型，并达到或超过特定任务模型的水平。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.13848), [推文](https:\u002F\u002Fx.com\u002Fdeepseek_ai\u002Fstatus\u002F1847191319464300652) |\n| 6) **长上下文RAG的推理规模效应** - 使用两种策略研究RAG的规模效应：上下文学习（DRAG）和迭代提示（IterRAG）；发现，在最优配置下，RAG性能会随着有效上下文长度的扩展而持续提升；当推理计算资源得到合理分配时，长上下文RAG的性能可实现线性增长；基于此，开发了一套计算资源分配模型，为长上下文RAG场景下的最优计算资源配置提供实用指导。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.04343), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1847350506127315088) |\n| 7) **Agent S** - 一个全新的开源代理框架，可通过GUI实现与计算机的自主交互；Agent S解决了知识获取、长周期任务规划以及动态界面处理等挑战；引入了经验增强的层次化规划方法，结合搜索与检索技术；通过代理-计算机接口进行推理并控制GUI代理；在OSWorld基准测试上的评估显示，Agent S的成功率比基线高出9.37%（相对提升83.6%），达到了新的最先进水平。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08164v1), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846930425849303424) |\n| 8) **LLM合并中的模型亲缘关系** - 提出“模型亲缘关系”这一指标，用于衡量LLM之间的相似程度；该指标被用于制定一种模型合并策略（基于模型亲缘关系的Top-k贪心合并），从而获得更好的性能；作者发现，这一新标准可以有效地、持续地应用于模型合并。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12613), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846753148007846329) |\n| 9) **关于OpenAI o1系列模型的规划能力** - 报告称，o1-preview在自我评估和约束遵循方面尤为出色；同时也指出，这些o1模型在决策和内存管理方面存在瓶颈，尤其在空间推理任务中更为明显；具体而言，模型会产生冗余动作，并且难以在空间复杂任务中进行泛化。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2409.19924), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1846032256902869135) |\n| 10) **CoTracker3** - 提出一种新的点跟踪模型及一种半监督训练方案；通过使用现成的教师模型生成伪标签，实现在训练过程中无需标注的真实视频即可使用；该方法在架构和训练方案上更为简单，却能在使用1000倍更少数据的情况下取得更好效果。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.11831), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1846595406261899363) |\n\n## 本周顶级机器学习论文（10月7日–10月13日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **MLE-Bench** - 提出了一种用于评估机器学习智能体在机器学习工程能力方面的新基准；包含来自Kaggle的75个与机器学习工程相关的竞赛任务，测试模型训练、数据集准备和实验运行等ML工程技能；借助AIDE支架工具，OpenAI的o1-preview模型在16.9%的竞赛中达到了Kaggle铜牌水平。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07095), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1844429536353714427) |\n| 2) **差分Transformer** - 提出了一种差分注意力机制，能够在放大对相关上下文关注的同时抑制噪声；随着模型规模和训练数据量的增加，差分Transformer的表现优于传统Transformer；作者认为，由于该架构较少被无关上下文“干扰”，因此在长上下文建模、关键信息检索、幻觉缓解、上下文学习以及减少激活异常值等方面具有优势。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05258), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843694897020150216) |\n| 3) **精明RAG** - 提出了一种新颖的RAG方法，以应对大型语言模型在检索增强和知识冲突方面的不足；精明RAG能够自适应地从LLM的内部知识中提取关键信息，然后以源意识为导向，迭代整合内外部知识；其设计旨在通过交互式整合机制（即识别一致段落、检测其中的冲突信息并过滤掉无关内容）更好地融合内部与外部信息。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07176), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844435988019544565) |\n| 4) **ToolGen** - 通过将工具表示为一种独特标记，直接将工具知识融入大型语言模型中，使模型能够生成工具调用及其参数，从而实现无缝的工具调用与语言生成；基于超过47,000种工具的实验结果表明，ToolGen在工具检索和自主任务完成方面均取得了优异效果。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.03439), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843491766114422930) |\n| 5) **长上下文LLM与RAG结合** - 研究发现，对于许多长上下文LLM而言，随着引入的文本片段数量增加，输出质量会下降；报告称性能下降的原因是检索到了“硬负样本”；为此提出了两种改进长上下文LLM驱动RAG的方法：重排序检索结果，以及进行针对RAG的微调，加入中间推理以帮助判断相关性；这些方法显著提升了长上下文RAG任务的准确性和鲁棒性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05983), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844828836619334066) |\n| 6) **GSM-符号化** - 使用基于符号模板构建的基准测试了几种当前最先进模型，该基准可生成多样化的数学问题；研究发现，LLM在回答同一问题的不同变体时表现存在差异；当调整题目中的数值时，所有模型的性能都会下降；而随着题目难度的增加（例如增加条件句的数量），性能则会显著恶化；作者推测，观察到的性能下降可能是由于当前LLM缺乏逻辑推理能力所致。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05229), [推文](https:\u002F\u002Fx.com\u002FMFarajtabar\u002Fstatus\u002F1844456880971858028) |\n| 7) **Optima** - 提出了一种新框架，通过训练大型语言模型来提升基于LLM的多智能体系统的通信效率和任务执行效果；该框架采用迭代式的生成、排序、选择和训练范式，并结合奖励函数以优化性能、降低token消耗并提高通信效率；同时引入受蒙特卡洛树搜索启发的技术用于DPO数据生成，以促进多样化的探索；在需要大量信息交换的任务上，相较于单智能体基线和基于Llama 3 8B的普通MAS，Optima表现出持续的性能提升，且仅使用不到10%的token便实现了2.8倍的性能增益。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.08115), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844578931732844963) |\n| 8) **ScienceAgentBench** - 一个用于严格评估专为科学工作流设计的智能体的新基准；在对开源及专有大型语言模型进行测试后，表现最好的智能体也仅能独立完成32.4%的任务，而在专家提供知识的情况下，这一比例也仅为34.3%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.05080), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1843697964243382586) |\n| 9) **加法就是全部所需** - 提出了一种算法，能够用整数加法近似浮点数乘法运算；该算法的计算复杂度低于8位浮点数，但精度更高；作者报告称，在张量处理硬件中应用所提出的L-Mul操作，有望将逐元素浮点张量乘法的能耗降低95%，并将点积运算的能耗降低80%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.00907), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844043652966072742) |\n| 10) **LLM的说服力与反社会行为** - 研究了在具有社会层级结构的多智能体环境中，LLM之间的互动模式；研究场景设定为一名狱警与一名寻求额外庭院活动时间或越狱的囚犯；研究发现，在涉及权力动态的多智能体环境中，LLM之间难以展开有效对话；此外，研究还指出，智能体的角色设定对其行为起着关键作用，甚至在未明确提示的情况下，仅仅分配角色就会导致反社会行为的发生。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.07109), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1844427182141211054) |\n\n## 本周顶级机器学习论文（9月30日—10月6日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **Movie Gen** - 一套用于生成高质量1080p高清视频的基础模型，支持不同宽高比及同步音频；其中300亿参数的模型可处理7.3万视频标记的上下文长度，从而以16帧\u002F秒的速度生成16秒长的视频。此外，还提出了一个130亿参数的视频转音频生成模型，以及通过后训练获得的新型视频编辑模型，在文本到视频合成、视频个性化、视频转音频生成等任务上均达到当前最优性能。  | [论文](https:\u002F\u002Fai.meta.com\u002Fstatic-resource\u002Fmovie-gen-research-paper), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1842188252541043075) |\n| 2) **我们是否只需要RNN？** - 重新审视RNN，并指出通过移除输入门、遗忘门和更新门中的隐藏状态，RNN可以高效地并行训练；这一改变使得LSTM和GRU等架构不再需要进行时间反向传播（BPTT）。研究者提出了minLSTM和minGRU，其在序列长度为512时速度提升了175倍。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01201), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1842246985790914608) |\n| 3) **大语言模型知道的远比它们表现出来的多** - 研究发现，大语言模型中的“真实性”信息集中在特定的标记上，这一洞察有助于提升错误检测性能并进一步缓解相关问题。此外，研究者还声称，可以通过内部表示预测大语言模型可能犯下的错误类型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.02707), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1842240840389001381) |\n| 4) **面向推理时技术的架构搜索框架** - 提出了一种模块化框架，用于通过组合多种推理时技术来构建和优化大语言模型。该方法将大语言模型系统设计的挑战重新定义为超参数优化问题。在MT-Bench和CodeContests等基准测试上进行验证后，Archon超越了GPT-4o和Claude 3.5 Sonnet等领先模型，平均准确率提升了15.1%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15254), [推文](https:\u002F\u002Fx.com\u002FAzaliamirh\u002Fstatus\u002F1840892626096345530) |\n| 5) **RATIONALYST** - 一种用于推理过程监督的模型，能够实现对多样化推理任务的泛化。该过程通过在Pile数据集中的7.9万个推理理由以及少量人工干预的推理数据集上进行预训练来实现。基于LLaMa-3-8B微调后的模型，在7个推理基准测试中，推理准确率平均提高了3.9%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01044) |\n| 6) **对o1-preview的分析** - 报告指出，像o1-preview这样的大型推理模型虽然在更复杂的任务上有所提升，但其定性趋势与之前的大语言模型相似。o1对示例和任务的概率较为敏感，在高概率场景下表现更好，且所需的“思考标记”数量也少于低概率场景。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01792), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1841842414157472240) |\n| 7) **FRAMES** - 一个统一的框架，用于评估大语言模型提供事实性回答的能力、检索能力以及生成最终答案所需的推理能力。该框架包含需要整合多源信息的多跳问题。报告指出，当前最先进的大语言模型在该任务上表现不佳，未使用检索功能时仅能达到40%的准确率。而提出的多步检索方法可将性能提升至66%的准确率。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12941), [推文](https:\u002F\u002Fx.com\u002F_philschmid\u002Fstatus\u002F1840628834275602585) |\n| 8) **并非所有大语言模型的推理能力都相同** - 深入研究了大语言模型在小学数学问题解决方面的能力。报告指出，大语言模型在推理能力上存在显著差距，尤其是在解决组合型题目与独立解决问题时，性能差异巨大。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.01748), [推文](https:\u002F\u002Fx.com\u002FarianTBD\u002Fstatus\u002F1841875515860517130) |\n| 9) **对o1的评估** - 对OpenAI的o1-preview大语言模型进行了全面评估，结果显示其在诸多任务上表现出色，包括竞技编程、生成连贯准确的放射科报告、高中水平的数学推理任务、芯片设计任务、人类学与地质学、量化投资、社交媒体分析等领域和问题。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.18486), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1840953712635732006) |\n| 10) **为更好的少样本图像合成设计先验分布** - 使用有限数据训练GAN等生成模型颇具挑战。目前的隐式最大似然估计方法（IMLE）在训练时选择的潜在代码与推理时选择的潜在代码之间对应关系不足。所提出的RS-IMLE方法通过改变训练时的先验分布，提升了测试时的性能，并实现了更高品质的图像生成。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17439), [推文](https:\u002F\u002Fx.com\u002FKL_Div\u002Fstatus\u002F1841729946302943295) |\n\n## 本周顶级机器学习论文（9月23日至9月29日）- 2024年\n| **论文**  | **链接** | \n| ------------- | ------------- | \n| 1) **Llama 3.2** - 发布了中小型视觉大模型（110亿和900亿参数），以及轻量级纯文本模型（10亿和30亿参数）；纯文本模型经过训练，支持128K令牌的上下文长度，在一系列任务上表现优于同类模型；视觉模型在图像理解任务上超越了Claude 3 Haiku等其他模型。 | [论文](https:\u002F\u002Fai.meta.com\u002Fblog\u002Fllama-3-2-connect-2024-vision-edge-mobile-devices\u002F)，[推文](https:\u002F\u002Ftwitter.com\u002FDoctor_Zou\u002Fstatus\u002F1782752058124554272) | \n| 2)  **Molmo**  - 发布了一组开源、最先进的多模态人工智能模型；Molmo系列中的720亿参数模型在开源权重和数据模型中表现最佳；在多个基准测试中，其性能也优于GPT-4o、Claude 3.5和Gemini 1.5等专有模型。 | [论文](https:\u002F\u002Fmolmo.allenai.org\u002Fpaper.pdf)，[推文](https:\u002F\u002Ftwitter.com\u002Femmanuel_vincze\u002Fstatus\u002F1708249637918752987) | \n| 3) **AlphaChip**  - 一种基于强化学习的方法，用于设计芯片的物理布局；据称AlphaChip已被用于谷歌TPU的另外三代产品中；此次发布还包括该方法的开源实现，以帮助对多种芯片模块进行预训练，从而应用于新模块；同时还发布了在20个TPU模块上预训练的模型检查点。 | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-08032-5)，[推文](https:\u002F\u002Ftwitter.com\u002FGoogleAI\u002Fstatus\u002F1676118998259507200) | \n| 4) **大型语言模型仍无法规划**  - 评估了o1等大型推理模型是否具备规划能力；研究发现，一种与领域无关的规划器可以解决所有Mystery Blocksworld实例，但大型语言模型即使在小型实例上也表现不佳；o1-preview在该任务上有效，但随着计划长度增加，性能会逐渐下降。结论指出，尽管o1在更具挑战性的规划问题上有所进展，但其准确率的提升并不能被视为通用或稳健。 |  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13373)，[推文](https:\u002F\u002Ftwitter.com\u002Fjohnxschulman\u002Fstatus\u002F1657558270450917378) | \n| 5) **规模扩大的指令型模型可靠性降低**  - 指出更大、更注重指令遵循的大型语言模型可能会变得不太可靠；研究从三个维度考察大型语言模型：难度一致性、任务回避和提示稳定性；结果表明，早期模型常常会回避用户问题，而规模扩大、指令优化后的模型则更容易给出看似合理却错误的答案，尤其是在人类监督者容易忽视的难题上。 |  [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-07930-y)，[推文](https:\u002F\u002Ftwitter.com\u002Frylanmshea\u002Fstatus\u002F1583460628966346752) | \n| 6) **思维逻辑**  - 提出了一种名为“思维逻辑”（LoT）的新提示技术，利用命题逻辑从输入上下文中生成并注入扩展的逻辑信息；该技术使CoT在ReClor数据集上的性能提升了4.35%；在LogiQA数据集中，CoT+SelfConsistency的性能提升了5%；同时，它还将ToT在ProofWriter数据集上的表现提高了8%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.17539)，[推文](https:\u002F\u002Ftwitter.com\u002FIsItPerplexity\u002Fstatus\u002F1704255260019798052) | \n| 7) **RAG及更远**  - 发表了一篇综述，介绍了一种RAG任务分类方法，可根据所需外部数据的类型和任务重点将用户查询划分为四个级别；总结了构建健壮的数据增强型LLM应用所面临的关键挑战，以及应对这些挑战的最有效技术。 |  [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.14924)，[推文](https:\u002F\u002Ftwitter.com\u002Fmishigna\u002Fstatus\u002F1703461946958463118) | \n| 8) **o1在医学领域的初步研究**  - 对o1-preview模型在医疗场景中的应用进行了初步探索；结果显示，在19个数据集和两个新创建的复杂问答场景中，o1的平均准确率分别比之前的GPT-4高出6.2%和6.6%；同时也指出了幻觉、多语言能力不一致以及评估指标差异等问题。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15277)，[推文](https:\u002F\u002Ftwitter.com\u002FRichardEvans_AI\u002Fstatus\u002F1691963090436067397) | \n| 9) **小型语言模型综述**  - 一篇关于小型语言模型（SLMs）的全面综述，涵盖架构、训练数据集和训练算法；分析了59个最先进的开源SLM及其推理、上下文学习、数学和编程等能力；此外还讨论了设备端运行时成本、延迟、内存占用以及一些有价值的见解。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.15790)，[推文](https:\u002F\u002Ftwitter.com\u002Fsebatian_ruder\u002Fstatus\u002F1691611318636159002) | \n| 10) **Minstrel**  - 一个具有反思能力的多生成式智能体系统，可自动化生成结构化提示；它提出了LangGPT这一可扩展的提示设计框架。Minstrel构建于LangGPT之上，实验表明，无论是由Minstrel生成还是手动编写的结构化提示，在引导大型语言模型完成任务方面都表现得更好。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.13449)，[推文](https:\u002F\u002Ftwitter.com\u002FLiZhang1351\u002Fstatus\u002F1702992849091985677) |\n\n## 本周机器学习顶级论文（9月16日–9月22日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **Moshi** - 介绍了一种语音-文本基础模型和全双工语音对话框架；文中详细阐述了该系统的多个组成部分：Helium 是一个 70 亿参数的文本大语言模型；Mimi 是一种语义-声学神经音频编码器，在音频质量方面达到最先进水平；此外还提出了一种分层多流架构，能够以语音到语音的方式生成任意对话。 | [论文](https:\u002F\u002Fkyutai.org\u002FMoshi.pdf), [推文](https:\u002F\u002Fx.com\u002Fkyutai_labs\u002Fstatus\u002F1836427396959932492) |\n| 2) **通过强化学习训练大语言模型实现自我修正** - 开发了一种多轮在线强化学习方法，以提升大语言模型的自我修正能力；该方法完全基于模型自动生成的数据；实验表明，监督微调在学习自我修正方面效果不佳，并且存在训练数据与模型输出之间的分布不匹配问题；研究提出了一种两阶段方法，先优化修正行为，再通过奖励加成来增强训练过程中的自我修正能力；将其应用于 Gemini 1.0 Pro 和 1.5 Flash 模型后，在 MATH 和 HumanEval 基准测试上分别将基础模型的自我修正能力提升了 15.6% 和 9.1%，达到了当前最先进的水平。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12917), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1837228446839361984) |\n| 3) **Qwen2.5 Coder** - 一系列包含 15 亿和 70 亿参数的模型；基于 Qwen2.5 架构，并在 5.5 万亿 tokens 上持续预训练；在超过 10 个基准测试中均取得最先进性能；具备强大的代码生成、补全、推理和修复能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12186), [推文](https:\u002F\u002Fx.com\u002Fhuybery\u002Fstatus\u002F1837170643563073960) |\n| 4) **思维图谱 (DoT)** - 通过数学严谨性提升大语言模型的推理能力；该方法将大语言模型的迭代推理建模为有向无环图的构建过程，将命题、批判、细化和验证整合进统一的 DAG 结构中；这使得 DoT 能够捕捉超越线性或树状结构的复杂逻辑推理。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10038), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835882277563179512) |\n| 5) **软件工程中的智能体** - 提供了关于基于大语言模型的智能体在软件工程领域应用框架的全面综述。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.09030), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835705359723319702) |\n| 6) **使用思维链还是不使用？** - 研究哪些类型的任务最受益于思维链（CoT）提示；通过对 100 多篇论文的元分析及多次评估，发现思维链主要在涉及数学和逻辑的任务中表现出显著的性能优势；研究指出，思维链带来的大部分收益源于对符号执行的改进，但符号求解器的表现更胜一筹。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12183), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836599280477299013) |\n| 7) **量化指令微调大语言模型的综合评估** - 对 70 亿至 4050 亿参数的大语言模型，在不同量化方法下的表现进行了评估；主要发现包括：1) 将较大规模模型量化到与较小 FP16 模型相近的规模时，通常在大多数基准测试中表现更好；2) 性能会因量化方法、模型规模和位宽的不同而显著变化，其中仅权重量化方法在大型模型中往往效果更佳；3) 任务难度对量化导致的精度下降影响不大。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11055), [推文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11055) |\n| 8) **思维迭代 (IoT)** - 提出了一种名为“思维迭代”的框架，通过自适应的推理路径来增强大语言模型的响应能力和推理能力；该框架利用一个内部对话代理作为引导者，动态调整推理路径，从而实现跨路径的自适应探索并提高响应准确性；与思维链和思维树（均为固定流程）不同，其提示生成是一个动态过程，能够灵活适应。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.12618), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836977595847692671) |\n| 9) **薛定谔的记忆** - 利用通用逼近定理解释大语言模型的记忆机制。同时提出了一种新的评估大语言模型性能的方法，即比较不同模型的记忆容量；Transformer 架构可被视为一种动态拟合的 UAT 模型，具有强大的自适应输入拟合能力，从而使大语言模型能够仅凭少量输入信息就回忆起完整内容。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.10482), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1835882330323554321) |\n| 10) **数学越狱提示** - 使用 GPT-4o 生成数学编码的提示，作为一种有效的越狱技术；在 13 种最先进模型上平均攻击成功率达到 73.6%；这一结果凸显了现有安全训练机制无法泛化到数学编码输入的问题。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.11445), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1836603922405806501) |\n\n## 本周机器学习顶级论文（9月9日–9月15日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **利用大语言模型学习推理** - 一个全新的大语言模型系列，通过强化学习训练，在回应复杂任务前先进行推理；它会生成较长的内部思维链，在科学、代码和数学相关任务中表现卓越；在2024年国际信息学奥林匹克竞赛中排名位于第49百分位，并且在科学相关基准测试上超越了人类博士级别的准确率。 -  | [论文](https:\u002F\u002Fopenai.com\u002Findex\u002Flearning-to-reason-with-llms\u002F)，[推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1834278217626317026) |\n| 2) **Chai-1** - 一种用于分子结构预测的新多模态基础模型，能够预测蛋白质、小分子、DNA、RNA等；在药物发现领域的多项任务中达到最先进水平；在PoseBusters基准测试中取得77%的成功率（AlphaFold 3为76%），并在CASP15蛋白质单体结构预测数据集上获得0.849的Cα LDDT值（ESM3-98B为0.801）。  | [论文](https:\u002F\u002Fwww.chaidiscovery.com\u002Fblog\u002Fintroducing-chai-1)，[推文](https:\u002F\u002Fx.com\u002Fjoshim5\u002Fstatus\u002F1833183091776721106) |\n| 3) **大语言模型能否生成新颖的研究思路** - 研究发现，由大语言模型生成的研究思路被认为比人类专家的想法更具新颖性（p \u003C0.05），但在灵活性方面略逊一筹；研究还指出，大语言模型代理在创意生成过程中缺乏多样性，且并非可靠的评估者。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04109)，[推文](https:\u002F\u002Fx.com\u002FChengleiSi\u002Fstatus\u002F1833166031134806330) |\n| 4) **DataGemma** - 包含一系列针对Gemma 2模型的微调版本，旨在帮助大语言模型访问并整合数值与统计数据；提出了一种名为检索交错生成（RIG）的新方法，可将Data Commons中的公开统计数据可靠地融入大语言模型的回答中；RIG是一种工具启发式方法，能够将统计标记与自然语言问题交错排列，从而适于从Data Commons中检索信息；为实现这一能力，他们在Gemini 1.5协助下生成的指令-响应数据集上对大语言模型进行了微调；采用RIG方法后，事实准确性从5-7%提升至约58%。  | [论文](https:\u002F\u002Fdocs.datacommons.org\u002Fpapers\u002FDataGemma-FullPaper.pdf)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834235024675406012) |\n| 5) **智能体工作流记忆** - 引入智能体工作流记忆机制，以提取常用的工作流程并在需要时提供给智能体；该机制支持离线与在线使用，旨在指导智能体的后续生成过程；其灵感来源于人类如何从过往经验中学习可重用的工作流程，并将其用于指导未来行动；声称在Mind2Web和WebArena基准测试上，以更高效的方式分别将基线成功率提升了24.6%和51.1%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.07429)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834059522198896706) |\n| 6) **小型语言模型在大语言模型时代的作用** - 深入探讨大语言模型与小型语言模型之间的关系；小型语言模型的常见应用包括数据整理、训练更强的模型、高效推理、评估、检索等；为从业者提供了深入理解这些小型语言模型价值的洞见。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06857)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834063138586829273) |\n| 7) **LLaMa-Omni** - 一种用于实现低延迟语音交互的大语言模型架构；基于Llama-3.1-8B-Instruct，能够在接收到语音指令时同时生成文本和语音响应；响应延迟低至226毫秒；从架构上看，该模型包含语音编码器（Whisper-large-v3）、语音适配器、大语言模型以及语音解码器；研究团队还构建了一个包含20万条语音交互及对应响应的数据集。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06666)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1834227729241440340) |\n| 8) **大语言模型能否激发新颖的科学研究思路** - 探讨大语言模型是否能够生成新颖的科学研究思路；报告称，Claude和GPT模型往往更倾向于与作者对未来研究想法的观点保持一致；这一现象在科学、经济学和医学等多个领域均有体现。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.06185)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833695968656793610) |\n| 9) **Sigmoid自注意力的理论、分析与最佳实践** - 提出Flash-Sigmoid，这是一种硬件感知且内存高效的Sigmoid注意力实现；在H100 GPU上，其推理内核速度相比FlashAttention-2最高可提升17%；研究表明，SigmoidAttn在各类任务和领域中与SoftwaxAttn的表现相当。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04431)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833522827842220244) |\n| 10) **实现大语言模型的峰值性能** - 从训练、推理和系统服务三个角度系统性地回顾了提升和加速大语言模型的方法；总结了围绕训练、硬件、可扩展性和可靠性的最新优化与加速策略。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.04833)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1833344402892460364) |\n\n## 本周顶级机器学习论文（9月2日至9月8日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **AlphaProteo** - 提出了一类用于蛋白质设计的机器学习模型；在七个目标蛋白上，与现有其他方法相比，其结合亲和力提高了3到300倍，实验成功率也更高；研究表明，AlphaProteo在PDB数据库中数百个目标蛋白上的表现与这七个目标相当。  | [论文](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Falphaproteo-generates-novel-proteins-for-biology-and-health-research\u002FAlphaProteo2024.pdf), [推文](https:\u002F\u002Fx.com\u002FGoogleDeepMind\u002Fstatus\u002F1831710991475777823) |\n| 2) **长上下文大模型时代的RAG** - 报告指出，长上下文大模型容易出现对相关信息关注不足的问题，而RAG系统正是为了解决这一问题（即更多地利用相关信息）；他们提出了一种保持顺序的RAG机制，可提升长上下文问答任务的性能；不过该方法并不完美，随着检索到的片段数量增加，回答质量会先上升后下降；他们提到存在一个最佳点，在那里可以用比长上下文大模型少得多的token数实现更好的质量。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.01666), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831389521839267888) |\n| 3) **战略思维链** - 一种通过在中间CoT推理步骤之前融入战略知识来提升LLM性能的方法；该问题解决策略有助于引导CoT路径和最终答案的生成；声称使用Llama3-8b模型在GSM8K数据集上实现了21.05%的提升。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.03271v1) |\n| 4) **AI对高技能工作的有效影响** - 研究了生成式AI对软件开发人员的影响；结果显示，使用GitHub Copilot等AI工具的开发者完成的任务数量增加了26.08%；同时表明，经验较少的开发者更倾向于采用AI工具，并且生产效率提升更为显著。  | [论文](https:\u002F\u002Fpapers.ssrn.com\u002Fsol3\u002Fpapers.cfm?abstract_id=4945566), [推文](https:\u002F\u002Fx.com\u002Femollick\u002Fstatus\u002F1831739827773174218) |\n| 5) **OLMoE** - 推出了一个完全开源的稀疏混合专家架构的大语言模型。OLMoE参数规模为70亿，每个输入token仅激活10亿个参数；此外还有一款经过指令微调的版本，声称性能优于Llama-2-13B-Chat和DeepSeekMoE 16B。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02060), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831357563620753577) |\n| 6) **LongCite** - 利用现成的大语言模型合成大规模监督微调数据集，以改进带有引用的长上下文问答任务；训练了80亿和90亿参数的模型，能够在长文本上下文中增强引用生成能力并提高回答的准确性；声称在他们提出的LongBench-Cite基准测试中甚至超越了GPT-4o。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02897),  [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1831522905009828051)  |\n| 7) **MemLong** - 使用外部检索器获取历史信息，从而增强长上下文大模型的能力；在多个长上下文基准测试中 consistently 超越了其他最先进模型，并且可以在单张3090 GPU上将上下文长度从4k扩展至80k。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16967),  [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1830610367854112799)  |\n| 8) **RAG噪声在大模型中的作用** - 提出了一项基准测试（NoiserBench），用于衡量不同类型的噪声信息对RAG性能的影响；报告指出，在研究的多种有益噪声类型中（如语义噪声、数据类型噪声和非法句子噪声），非法句子噪声在不同模型和数据集上均表现出最强的性能提升效果。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.13533),  [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1830984315326660617)  |\n| 9) **超越偏好：AI对齐的新视角** - 对当前主流的“人类偏好调优”这一AI对齐方法提出了挑战；阐述了人类偏好调优为何无法捕捉人类价值观的深层语义内涵；认为AI对齐需要重新定位，不应仅仅基于人类偏好进行对齐，而应基于与其社会角色相适应的规范性标准来进行对齐。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16984), [推文](https:\u002F\u002Fx.com\u002Fxuanalogue\u002Fstatus\u002F1831044533779669136) |\n| 10) **基于大语言模型的软件工程智能体** - 一篇关于基于大语言模型的软件工程智能体的综述论文，涵盖了从需求工程、测试生成到软件维护等多个视角。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2409.02977), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1832115557749121385) |\n\n## 本周机器学习顶级论文（8月26日—9月1日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **GameGen** - 一款由扩散模型驱动的游戏引擎，能够在复杂环境中实现长时间轨迹上的实时交互；采用两阶段训练流程，先由强化学习智能体学习，再由扩散模型生成帧；在单个TPU上即可以20 fps的帧率交互式地模拟DOOM。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14837), [推文](https:\u002F\u002Fx.com\u002FiScienceLuvr\u002Fstatus\u002F1828617875432841490) |\n| 2) **用于时间序列分析的代理式RAG** - 提出了一种用于时间序列分析的代理式RAG框架；采用多智能体架构，由一个协调智能体调度专门的子智能体来完成时间序列任务；这些子智能体利用经过微调的小型语言模型，并能检索包含历史模式和趋势知识的相关提示，从而帮助提升对新数据的预测能力。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14484), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1828838209461043455) |\n| 3) **AutoGen Studio** - 一个用于快速原型化AI智能体的低代码界面。它构建于AutoGen框架之上，也可用于调试和评估多智能体工作流。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15247), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829163090715529358) |\n| 4) **基于LLM的说服博弈** - 声称可以使用多智能体框架来提升LLM的说服效果；主智能体进行说服性对话，而辅助智能体则执行响应分析、信息检索等关键任务；研究发现，LLM能够促使用户改变观点并做出购买决策；例如，销售智能体可使用户观点产生71%的积极转变。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15879), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829156960291185117) |\n| 5) **更小、更弱，却更好** - 研究发现，相较于由更强但更昂贵的模型生成的数据，较弱且廉价（WC）模型生成的合成数据更适合用于模型的微调；总体而言，结果表明，WC模型可能是训练先进LLM推理模型的一种计算最优方法。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.16737), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829526629787242878) |\n| 6) **Transfusion** - 提出了一种针对离散与连续数据联合训练多模态模型的方法；将下一个标记预测与扩散模型相结合，用于在混合模态序列上训练Transformer模型；研究表明，可以从70亿参数的模型扩展到2万亿个多模态标记，其性能可与同等规模的扩散模型和语言模型相媲美。 | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.11039), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1828836885176967327) |\n| 7) **ReMamba** - 探讨了Mamba模型的长上下文能力和效率；Mamba由于其类似RNN的特性，存在长上下文不足的问题；通过以下压缩策略来浓缩信息：在第一次前向传播中保留top-k隐藏状态，并在第二次前向传播中利用Mamba的选择性机制将其融入状态空间；在LongBench上相比基线提升了3.2，在L-Eval上提升了1.6；该策略似乎也适用于Mamba 2。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.15496), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1829151312266637813) |\n| 8) **仅靠Text2SQL还不够** - 提出了表增强生成（TAG），这是一种用于回答数据库自然语言问题的统一框架；它代表了LLM与数据库之间更广泛且尚未被充分探索的交互方式；开发了一个基准测试，并发现标准方法正确回答查询的比例不超过20%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14717v1), [推文](https:\u002F\u002Fx.com\u002Flianapatel_\u002Fstatus\u002F1828939097487945948) |\n| 9) **音乐领域的基础模型** - 提供了当前最先进的音乐预训练模型和基础模型的全面综述。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14340), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1828456481114538437) |\n| 10) **持续多模态预训练指南** - 一份关于持续多模态预训练的综合指南；介绍了FoMo-In-Flux，这是一个大规模、细粒度且长期的持续预训练基准测试。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14471), [推文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.14471) |\n\n## 本周顶级机器学习论文（8月19日–8月25日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **自动化智能体系统设计** - 提出元智能体搜索，这是一种基于不断增长的先前发现档案，迭代地编程和测试新智能体的元智能体；声称通过他们的方法，可以学习任何可能的智能体系统，包括提示、工具使用、控制流等；他们通过关注三个主要组件来实现这一点，即搜索空间（定义智能体）、搜索算法（探索搜索空间）和评估函数（评估候选智能体）。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08435), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825378027347271719) |\n| 2) **LLM剪枝与蒸馏的实际应用** - 提供了一份关于有效压缩Llama 3.1和Mistral NeMo模型的综合报告；文中介绍了对原始模型分别应用剪枝和蒸馏方法，以生成4B和8B参数模型；在剪枝之前，他们还在自己的数据集上对教师模型进行微调，从而获得更好的蒸馏效果；他们的压缩策略产生了一款最先进的8B模型（MN-Minitron-8B），在常见语言建模基准测试中表现优于所有类似规模的模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11796), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826676365044675042) |\n| 3) **Vizier高斯过程多臂 bandit 算法** - 介绍Vizier算法，该算法基于高斯过程多臂 bandit 优化，已被谷歌用于数百万次优化和研究；文中提供了Vizier算法的开源Python实现，并附有基准测试结果，证明其更广泛的应用性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11527), [推文](https:\u002F\u002Fx.com\u002FXingyouSong\u002Fstatus\u002F1826554454084333723) |\n| 4) **表格数据上的语言建模** - 呈现了一份关于表格数据语言建模技术的全面综述；内容包括表格数据结构和数据类型的分类、用于模型训练和评估的数据集、建模技术和训练目标、数据处理方法、流行架构以及面临的挑战和未来研究方向。  | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2408.10548), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826094372179366023) |\n| 5) **提升LLM的鲁棒性** - 提出一种两阶段提示技术，用于从上下文中移除无关信息；该技术作为一种自我缓解机制，首先识别无关信息，然后将其过滤掉；这有助于提升模型的鲁棒性，并在推理任务中取得更好的整体性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.10615), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826451091774447983) |\n| 6) **GraphRAG方法的全面概述** - 重点讨论应用于GraphRAG工作流的技术（基于图的索引、图引导的检索以及图增强的生成）；考察了GraphRAG的任务、应用、评估及工业用例。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08921), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825937537782698377) |\n| 7) **MagicDec** - 展示了推测解码如何在长上下文生成场景中提升吞吐量、降低延迟并保持准确性；研究发现，随着序列长度和批量大小的增加，瓶颈会从计算密集型转变为内存密集型；基于这些洞察，他们表明即使在使用大批次时，也可以更有效地利用推测解码来处理更长的序列。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.11049), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826090969906778122) |\n| 8) **LLM的可控文本生成** - 提供了一份关于LLM中可控文本生成方法的全面综述；讨论了安全性、一致性、风格和有用性等问题。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.12599), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1826824199010132429) |\n| 9) **PEDAL** - 使用一种混合自集成方法（基于多样化的示例）来提升LLM的整体性能；具体而言，它利用多样化的示例生成多个候选响应，再通过LLM将这些响应聚合为最终输出；与贪婪解码相比，这种方法能获得更高的准确性，而成本则低于自一致方法。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08869), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825373675631071609) |\n| 10) **LLM实践中的挑战与应对** - 汇编了一组重要问题及其富有洞见的答案；问题按基础设施、软件架构、数据、应用和脑科学等主题进行了分类。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.09416), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1825932441980162374) |\n\n## 本周顶级机器学习论文（8月12日至8月18日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **AI科学家** - 一种新型AI代理，能够自主开发并撰写一篇完整的会议级别科学论文，成本低于15美元；它通过使前沿大语言模型开展独立研究并总结发现来自动化科学发现过程；此外，还使用自动评审器对生成的论文进行评估，声称在论文评分方面接近人类水平，并能产出经其自动评审器判定超过顶级机器学习会议接受门槛的论文。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06292), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1823189280883097788) |\n| 2) **Grok-2** - 一款具备强大代码、数学和推理能力的新前沿模型，包含大型和小型两个版本；在LMSYS聊天机器人竞技场上表现优于Claude 3.5 Sonnet和GPT-4-Turbo；声称在指令遵循、检索、工具使用及事实准确性等方面均有提升；在MMLU和HumanEval基准上与Claude 3.5 Sonnet（6月发布）和GPT-4o（5月发布）展开竞争。  | [论文](https:\u002F\u002Fx.ai\u002Fblog\u002Fgrok-2), [推文](https:\u002F\u002Fx.com\u002Fxai\u002Fstatus\u002F1823597788573098215) |\n| 3) **LongWriter** - 提出AgentWrite框架，使现成的大语言模型能够生成超过2万词的连贯文本；AgentWrite将长文本生成任务分解为多个子任务，采用分而治之的方式逐步完成；该代理将任务拆解为多个写作子任务，并将各部分结果拼接成最终输出（即规划+写作）。随后，利用该方法构建SFT数据集，用于微调大语言模型，使其能够自动生成更长且连贯的文本。一个90亿参数的模型经DPO进一步优化后，在其基准测试中达到最先进水平，并超越了专有模型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.07055), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1823551063946850712) |\n| 4) **EfficientRAG** - 训练一个自编码器语言模型来标记和打标签文档片段；它检索相关片段，将其标注为\u003C终止>或\u003C继续>,并对\u003C继续>的片段进行注释以便持续处理；随后训练一个过滤模型，根据原始问题和先前的标注生成下一轮查询；这一过程迭代进行，直到所有片段都被标记为\u003C终止>或达到最大迭代次数为止。当上述流程已收集到足够信息以回答初始问题时，最终的生成器（一个语言模型）会生成最终答案。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04259), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822744591810114044) |\n| 5) **RAGChecker** - 一种细粒度的评估框架，用于诊断RAG系统中的检索和生成模块；研究表明，RAGChecker与人工判断具有更好的相关性，并揭示了RAG架构设计选择中的一些深刻模式和权衡取舍。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.08067), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1824460245051081216) |\n| 6) **HybridRAG** - 将GraphRAG和VectorRAG相结合，形成混合RAG系统，其性能均优于单独使用任一方法；该系统已在一组金融收益电话会议记录上进行了测试。结合两种方法的优势，能够提供更准确的查询答案。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04948), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822832843455648000) |\n| 7) **rStar** - 引入自我博弈式相互推理机制，无需微调或借助更强大的模型即可提升小型语言模型的推理能力；通过从小型语言模型获取类人推理动作，对蒙特卡洛树搜索进行增强，从而构建更为丰富的推理路径；另一台小型语言模型则对这些路径提供无监督反馈，目标小型语言模型据此选择最终的推理路径作为答案。rStar将LLaMA2-7B在GSM8K任务上的准确率从12.51%提升至63.91%，并持续提高其他小型语言模型的准确性。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.06195), [推文](https:\u002F\u002Fx.com\u002FAtakanTekparmak\u002Fstatus\u002F1823776878747877572) |\n| 8) **LLM推理时计算资源的最优扩展** - 研究大语言模型推理时计算资源的扩展行为；具体而言，分析在给定固定推理计算资源的情况下，大语言模型还能提升多少性能；研究发现，不同扩展策略的效果因提示难度而异。基于此，提出了一种自适应的计算资源最优策略，相较于最佳N次采样基线，可将效率提升4倍以上；报告指出，在FLOPs匹配的评估中，经过优化的推理时计算资源扩展甚至可以超越规模大14倍的模型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.05109), [推文](https:\u002F\u002Fx.com\u002Fsea_snell\u002Fstatus\u002F1821263798772363598) |\n| 9) **MedGraphRAG** - 针对医疗领域的图结构框架，专注于增强大语言模型能力并生成循证结果；采用混合静态语义方法对文档进行分块，以更好地捕捉上下文；通过图结构表示实体和医学知识，形成一个相互连接的全局图；该方法提高了精确度，并在多个医疗问答基准测试中优于当前最先进模型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04187), [推文](https:\u002F\u002Fx.com\u002FMarktechpost\u002Fstatus\u002F1823069406924288110) |\n| 10) **NL2QL综述** - 一份全面概述由大语言模型驱动的NL2SQL技术的综述；涵盖模型、数据收集、评估方法以及错误分析等内容。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.05109), [推文](https:\u002F\u002Fx.com\u002F_reachsumit\u002Fstatus\u002F1822835969743347815) |\n\n## 本周机器学习顶级论文（8月5日—8月11日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **SAM 2** - 一个开源的统一模型，用于在图像和视频中实现实时、可提示的对象分割；无需自定义适配即可应用于未见过的视觉内容；为实现视频中准确的掩码预测，引入了一种记忆机制来存储目标对象及其先前交互的信息；该记忆模块还支持对任意长度视频的实时处理；SAM2 在17个零样本视频数据集上的交互式视频分割任务中显著优于先前方法，同时所需的人工干预次数减少了三倍。  | [论文](https:\u002F\u002Fai.meta.com\u002Fresearch\u002Fpublications\u002Fsam-2-segment-anything-in-images-and-videos\u002F)，[推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1818055906179105010) |\n| 2) **结构化生成限制推理能力** - 研究结构化生成是否会影响大型语言模型的推理能力和领域知识的全面性；观察到，与自由格式的回答相比，施加格式限制会导致语言模型的推理能力显著下降；当对推理任务应用更严格的格式约束时，这种退化效应会进一步加剧。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02442)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1822357786820284555) |\n| 3) **从大型语言模型到面向软件工程的基于LLM的智能体** - 一篇关于当前软件工程领域基于大型语言模型的智能体实践与解决方案的综述论文；涵盖了需求工程、代码生成、测试生成以及自主决策等重要主题；还包括不同软件工程应用场景中使用的基准测试、评估指标和模型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02479)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821549401866686604) |\n| 4) **Transformer解释器** - 介绍一款开源的交互式工具，用于了解Transformer模型的内部工作机制；它会在用户的浏览器中本地运行一个GPT-2实例，并允许用户使用自己的输入进行实验。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04619)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821986172215742716) |\n| 5) **增强用于RAG的大型语言模型** - 推出RAGFoundry，一个用于RAG场景的增强型大型语言模型开源框架；支持数据构建、训练、推理和评估；其一项有用的应用是创建数据增强型数据集，用于在RAG环境中微调和评估大型语言模型。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02545)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820864003590995973) |\n| 6) **利用弱模型和强模型合成Text-to-SQL数据** - 提出一种集成式合成数据方法，以构建高度专业化的最先进Text-to-SQL模型SENSE；来自强模型的合成数据增强了数据多样性，而来自弱模型的有价值错误数据则结合执行器，通过执行反馈进行学习；采用偏好学习技术对大型语言模型进行指令微调，使其能够从正确和错误样本中共同学习；SENSE在SPIDER和BIRD基准测试上取得了最先进的结果，缩小了开源模型与使用闭源模型的方法之间的性能差距。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.03256)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821227584920621061) |\n| 7) **对话式提示工程** - 提出一种方法，帮助用户通过交互方式明确期望输出，从而创建个性化提示；该方法包含两个阶段：1) 模型根据用户提供的未标注数据生成初始指令，2) 模型分享输出，用户针对输出和指令提供反馈并进行优化；这一迭代过程最终生成个性化的少样本提示，在目标任务上表现更好、更优化。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04560)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821981401861718488) |\n| 8) **自训练评估者** - 一种仅使用合成训练数据来改进基于模型的评估者的方法；首先生成对比性输出（优质和劣质的模型响应），然后训练一个“作为裁判的大型语言模型”来生成推理轨迹和最终判断；该自我改进方案会以迭代方式重复训练过程，利用其改进后的预测结果；声称性能超越GPT-4等大型语言模型裁判，并可媲美基于标注样本训练的顶尖奖励模型；在RewardBench上，将强大的Llama3-70BInstruct模型得分从75.4提升至88.3（采用多数投票后为88.7）。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.02666)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820849115607044401) |\n| 9) **RAGEval** - 提出一个简单的框架，用于自动生成评估数据集，以评估不同大型语言模型在不同场景下的知识运用情况；该框架基于种子文档定义模式，进而生成多样化的文档，从而形成问答对；这些问答对既基于文章内容，也基于配置设置。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01262)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1820507831491239978) |\n| 10) **Mamba综述** - 对现有跨领域、跨任务的Mamba基模型进行了系统性回顾；重点关注Mamba基模型的最新进展、将Mamba适配到多样化数据的技术、Mamba擅长的应用场景以及有前景的研究方向。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.01129)，[推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1821556218168549561) |\n\n## 本周顶级机器学习论文（7月29日—8月4日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **元奖励大模型** - 提出了一种无需人类监督的自我改进对齐技术，其中大模型会评估自身的判断，并利用反馈来提升其判断能力；研究表明，采用这种“大模型作为元法官”的方法可以提高大模型的判断能力和指令遵循能力；单纯通过自我改进生成更好的响应（执行角色）会很快达到饱和；而该工作则通过增强大模型的自我评估能力（判断角色），以避免奖励黑客等问题；此外，还引入了第三种角色——元法官，用于评估模型自身的判断。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19594), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818680848058585119) |\n| 2) **MindSearch** - 提出了一种基于大模型的多智能体框架，用于执行复杂的网络信息检索与整合任务；一个网络规划器能够有效分解复杂查询，随后由网络搜索器在互联网上进行分层信息检索，以提高检索结果的相关性；规划组件基于迭代图构建技术，更好地建模复杂的问题解决过程；该多智能体框架通过将推理和检索任务分配给专门的智能体，能够更好地处理长上下文问题。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20183), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818673381069226053) |\n| 3) **自推理增强的RAG** - 提出了一种端到端的自推理框架，以提升RAG系统的可靠性和可追溯性；该框架利用大模型自身生成的推理轨迹；大模型负责执行以下三个过程：1) 相关性感知：判断检索文档与问题之间的相关性；2) 证据导向选择：挑选并引用相关文档，然后从所引用的文档中自动选取关键语句片段作为证据；3) 轨迹分析：基于前两个过程收集的所有自推理轨迹，生成简洁的分析报告，并给出最终的推断答案。这种方法有助于模型更加精准地筛选、推理并区分相关与不相关的文档，从而提升整个RAG系统的准确性；该框架仅使用2,000个由GPT-4生成的训练样本，便达到了与GPT-4相当的性能。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19813), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818139150882664696) |\n| 4) **约束式思维链** - 在不牺牲性能的前提下限制模型推理输出长度；研究表明，将LLaMA2-70b的推理过程限制在100字以内，可在GSM8K数据集上将准确率由36.01%（CoT）提升至41.07%（CCoT），同时平均输出长度减少28字。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.19825), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818133220484898992) |\n| 5) **面向对话系统的自适应RAG** - 开发了一个门控模型，用于预测对话系统是否需要借助RAG来提升其响应质量；研究显示，基于RAG的对话系统具有生成高质量回复和高置信度回答的潜力；同时指出，生成回复的置信度水平与增强知识的相关性之间存在显著关联。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21712), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818843407977959756) |\n| 6) **ShieldGemma** - 基于Gemma 2推出了一套全面的大模型安全内容审核模型，涵盖危险内容、毒性、仇恨言论等关键有害类型分类器。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21772), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818837753292853349) |\n| 7) **人格化代理评估** - 提出了一个用于评估大模型中人格化代理能力的基准测试；研究发现，尽管Claude 3.5 Sonnet是一款更为先进的模型，但其PersonaScore相比GPT 3.5仅提升了2.97%。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18416), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1817964944949739544) |\n| 8) **机器遗忘综述** - 对生成式AI中的机器遗忘技术进行了全面综述。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.20516), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818476462262906985) |\n| 9) **ThinK** - 提出了一种解决KV缓存内存消耗低效问题的方法；重点关注长上下文场景和推理环节，提出了一种基于查询的KV缓存修剪方法，在最小化注意力权重损失的同时，有选择地修剪最不重要的通道。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.21018), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1818474655461621903) |\n| 10) **拒绝的艺术** - 对当前实现大模型拒绝能力的方法进行了综述；提供了用于衡量大模型弃权行为的评估基准和指标。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18418), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1817961056465035596) |\n\n## 本周机器学习顶级论文（7月22日—7月28日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **Llama 3.1** - 一系列包含80亿、700亿和4050亿参数的大型语言模型；支持八种语言，上下文窗口扩展至128K个标记；在通用知识、数学推理和工具使用等能力上表现具有竞争力，在某些情况下甚至超越了当前最先进模型。  | [论文](https:\u002F\u002Fscontent.fbze2-1.fna.fbcdn.net\u002Fv\u002Ft39.2365-6\u002F452387774_1036916434819166_4173978747091533306_n.pdf?_nc_cat=104&ccb=1-7&_nc_sid=3c67a6&_nc_ohc=t6egZJ8QdI4Q7kNvgHPkimJ&_nc_ht=scontent.fbze2-1.fna&oh=00_AYCV8TJ9rZquHu-nvz4-TFSZXLmCjer_LVQTms1bFpzHpA&oe=66A5D24D), [推文](https:\u002F\u002Fx.com\u002FAIatMeta\u002Fstatus\u002F1815766327463907421) |\n| 2) **AlphaProof & Alpha Geometry 2** - 解决了今年国际数学奥林匹克竞赛中的6道题中的4道，相当于银牌水平；AlphaProof由一个Gemini模型组成，该模型可自动将自然语言的问题陈述转化为形式化表述（即形式化网络）；随后，求解网络会搜索证明或反证，并利用AlphaZero逐步自我训练，以学会解决更复杂的问题；AlphaGeometry 2则是一种神经符号混合系统，用于证明几何问题；它基于Gemini模型，从零开始在大量合成数据上进行训练。 | [论文](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fai-solves-imo-problems-at-silver-medal-level\u002F), [推文](https:\u002F\u002Fx.com\u002FJeffDean\u002Fstatus\u002F1816498336171753948) |\n| 3) **RAG vs. 长上下文LLM** - 比较了RAG与长上下文LLM，发现长上下文LLM在平均性能上优于RAG，而RAG的成本则显著更低；提出了Self-Route方法，利用自我反思机制将查询路由至RAG或LC；报告称，Self-Route可在保持与LC相当性能的同时大幅降低计算成本。   | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16833), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816495687984709940) |\n| 4) **OpenDevin** - 提出了一种用于开发能够通过软件与世界交互的通用型智能体的平台；其功能包括：1) 智能体、接口和环境之间交互的机制，2) 包含沙箱操作系统和网络浏览器的环境供智能体使用，3) 用于创建和执行代码的接口，4) 多智能体支持，以及5) 评估框架。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16741), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816872317286281688) |\n| 5) **LazyLLM** - 介绍了一种用于高效长上下文LLM推理的新型动态令牌剪枝方法；该方法可使Llama 2 7B模型的预填充阶段加速2.34倍，并保持高精度；它在预填充和解码阶段都会选择性地计算对下一个标记预测至关重要的KV值；允许语言模型在不同生成步骤中动态选择来自上下文的不同令牌子集，即使这些令牌在之前的步骤中已被剪枝。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.14057), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1815225416409309264) |\n| 6) **教导LLM智能体自我改进** - 声称可以通过额外的环境反馈，在多轮对话中迭代微调LLM，使其具备自我改进的能力；LLM能够在后续迭代中递归地检测并纠正之前的错误；在推理任务（GSM8K和MATH）上提升了7B模型的自我改进能力，其改进程度甚至超过了强大的专有模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18219), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816671382585114855) |\n| 7) **Text-to-SQL综述** - 提供了一份关于使用LLM进行Text-to-SQL任务的综述，内容涵盖提示工程技巧、微调方法、基准测试等。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.15186), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1815599057974223015) |\n| 8) **MINT-1T** - 开源了一个包含1万亿个标记的大规模多模态交错数据集，其中包含34亿张图像；数据集中还新增了PDF和ArXiv论文等来源。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.11271), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816250935930142834) |\n| 9) **合成数据上的模型坍塌** - 研究了在递归生成的数据上训练模型的影响；发现使用模型生成的内容进行训练可能导致不可逆的缺陷，即原始内容分布消失；研究表明，这种被称为“模型坍塌”的现象会在LLM、VAE和GMM中发生；尽管测试是在较小规模的模型（约1亿参数）上进行的，但作者认为这种效应极有可能随着时间推移传递到更大规模的模型中。 | [论文](https:\u002F\u002Fwww.nature.com\u002Farticles\u002Fs41586-024-07566-y), [推文](https:\u002F\u002Fx.com\u002Falexandr_wang\u002Fstatus\u002F1816491442069782925) |\n| 10) **通过生成约束缓解幻觉** - 提出了一种无需训练的新方法来缓解LLM中的幻觉；他们缩放了记忆增强型LLM解码器中的读出向量，以限制生成过程；近期研究指出，具有显式记忆机制的LLM有助于降低幻觉；而这项工作则利用记忆增强型LLM，在解码器中应用轻量级的记忆原语来约束生成，从而减少幻觉。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.16908), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1816491986209104104) |\n\n## 本周顶级机器学习论文（7月15日—7月21日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **提升大语言模型输出的可读性** - 通过迭代训练小型验证器来预测解决方案的正确性，训练有用的求解器生成能被验证器接受的正确答案，同时训练“狡猾”的求解器生成能够欺骗验证器的错误答案；这一过程有助于训练出既能产生正确文本，又易于人类和AI系统理解的模型，从而构建更加可信的系统。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13692), [推文](https:\u002F\u002Fx.com\u002FOpenAI\u002Fstatus\u002F1813623470452064432) |\n| 2) **SpreadsheetLLM** - 提出一种高效的编码方法，以优化大语言模型对电子表格的理解与推理能力；开发了一种由基于结构锚点的压缩、倒排索引转换以及数据格式感知聚合模块组成的表格压缩器，能够高效地压缩和编码电子表格；在GPT-4的上下文学习场景下，其在电子表格表格检测任务上的性能提升了25.6%。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09025), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1812674543963578794) |\n| 3) **用于RAG中高效答案生成的上下文嵌入** - 提出一种有效的上下文压缩方法，以减少RAG系统中的长上下文并加快生成速度；将长上下文压缩为少量的上下文嵌入，允许在解码时间和生成质量之间进行不同比例的权衡；在保持高性能的同时，推理时间最多可缩短5.69倍，GFLOPs最多可减少22倍。  | [论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2407.09252), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1812937765769867561) |\n| 4) **弱监督到强推理** - 展示了如何利用弱监督在不依赖人工标注或高级模型的情况下，激发大语言模型的强大推理能力；报告称，强大的模型可以在未明确训练的情况下自动优化其训练数据；这使得模型的学习范围得以扩展，并在推理性能上实现规模化提升。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.13647), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1814130275485704597) |\n| 5) **大语言模型中提示工程方法综述** - 汇集了适用于多种自然语言处理任务的提示工程方法。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12994), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1814135222562165104) |\n| 6) **大语言模型的拒绝训练能否推广至过去时态？** - 研究发现，只需将大语言模型的请求改写为过去时态，就能突破许多最先进大语言模型的安全机制；例如，“如何制作莫洛托夫鸡尾酒？”可以改写为“人们过去是如何制作莫洛托夫鸡尾酒的？”研究还表明，在GPT-4o上，直接使用此类请求的成功率可从1%提升至88%；结论指出，当前的对齐技术并不总是能按预期泛化。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11969), [推文](https:\u002F\u002Fx.com\u002Fmaksym_andr\u002Fstatus\u002F1813608842699079750) |\n| 7) **大语言模型能否在100万token上下文窗口中完成检索与推理？** - 提出一个逐步递增难度的任务框架（NeedleBench），用于评估大语言模型的长上下文检索与推理能力；同时提出了“祖先轨迹挑战”，该挑战增加了复杂逻辑推理的需求，而这类需求在现实世界的长上下文任务中十分常见；研究结果表明，即使文本长度不足2K个token，当前的大语言模型在处理涉及复杂逻辑关系的推理任务时仍面临困难。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11963), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1813581074624070109) |\n| 8) **将系统2蒸馏至系统1** - 探讨了自监督方法，旨在从系统2的技术中蒸馏出高质量的输出，然后微调系统1使其预测结果与系统2技术一致，但无需生成中间步骤；将推理过程蒸馏至系统1可显著降低推理成本。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06023v1), [推文](https:\u002F\u002Fx.com\u002Fwillccbb\u002Fstatus\u002F1813012865454121179) |\n| 9) **使用LLMSuite探索先进大语言模型** - 分享了使用和评估大语言模型的实用技巧；涵盖的解决方案包括ReAct、RAG以及参数高效方法等。  | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.12036), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1813980712346763589) |\n| 10) **超越欧几里得** - 提供了一份图文并茂的指南及图形化的分类体系，介绍非欧几里得机器学习领域的最新进展。  | [论文](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2407.09468), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1812927886766010653) |\n\n## 本周顶级机器学习论文（7月8日至7月14日）- 2024年\n| **论文**  | **链接** |\n| ------------- | ------------- |\n| 1) **FlashAttention-3** - 提出将FlashAttention适配以充分利用现代硬件；在现代GPU上加速注意力机制所采用的技术包括生产者-消费者异步、分块矩阵乘法与softmax运算的交错执行，以及分块量化和非相干处理；在H100 GPU上，使用FP16时速度提升1.5至2.0倍，最高可达740 TFLOPs\u002Fs（75%利用率），而使用FP8则接近1.2 PFLOPs\u002Fs。 | [论文](https:\u002F\u002Ftridao.me\u002Fpublications\u002Fflash3\u002Fflash3.pdf), [推文](https:\u002F\u002Fx.com\u002Ftri_dao\u002Fstatus\u002F1811453622070444071) |\n| 2) **RankRAG** - 引入一种新的指令微调框架，用于进行有效的上下文排序和答案生成，从而增强LLM的RAG能力；该框架利用一个小规模排序数据集，性能超越现有的专家排序模型；实验表明，Llama3-RankRAG在九个知识密集型基准测试中显著优于Llama3-ChatQA-1.5和GPT-4模型。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02485v1), [推文](https:\u002F\u002Fx.com\u002F_weiping\u002Fstatus\u002F1808551184309104896) |\n| 3) **百万专家混合模型** - 提出一种参数高效的专家检索机制，利用产品键技术从百万个小型专家中进行稀疏检索；该方法通过学习得到的路由索引结构，将计算成本与参数量解耦，从而高效地路由到数量极多的小型专家；实验表明，其效率优于密集FFW、粗粒度MoE以及产品键记忆（PKM）层。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.04153), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1810389538340290724) |\n| 4) **LLM中的推理：几何视角** - 从几何角度探讨LLM的推理过程；研究发现，更高的内在维度意味着LLM具有更强的表达能力；研究还揭示了LLM的表达能力与其自注意力图密度之间的联系；分析表明，这些图的密度决定了MLP模块输入的内在维度。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.02678), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1810329294884741594) |\n| 5) **LLM中情境性幻觉的缓解** - 提出一种新方法，能够检测并显著减少LLM中的情境性幻觉（例如，在XSum摘要任务中可降低10%）；该方法基于上下文与新生成标记之间注意力权重之比构建幻觉检测模型（针对每个注意力头）；其假设是，情境性幻觉与LLM对所提供上下文信息的关注程度有关；此外，还提出了一种基于检测方法的解码策略，可有效缓解情境性幻觉；该检测器还可跨模型迁移，无需重新训练。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.07071), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1811072508637884750) |\n| 6) **RouteLLM** - 提出高效的路由器模型，可在推理过程中动态选择较强或较弱的LLM，以实现成本与性能之间的平衡；其训练框架利用人类偏好数据及数据增强技术来提升性能；实验表明，在某些情况下可将成本大幅降低超过两倍，同时保持响应质量。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.18665v2), [推文](https:\u002F\u002Fx.com\u002Flmsysorg\u002Fstatus\u002F1807812671238258931) |\n| 7) **专家混合模型综述** - 一篇关于专家混合模型（MoE）的综述论文，涵盖了MoE的技术细节、开源实现、评估方法以及MoE在实际中的应用。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.06204), [推文](https:\u002F\u002Fx.com\u002Fomarsar0\u002Fstatus\u002F1811127876819026283) |\n| 8) **智能体互联网** - 一种新框架，旨在解决多智能体系统中的若干局限性，如整合多样化的第三方智能体以及适应动态任务需求的能力；该框架引入了智能体集成协议、即时通讯架构设计以及用于异构智能体间有效协作的动态机制。 | [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.07061v2), [推文](https:\u002F\u002Fx.com\u002F_akhaliq\u002Fstatus\u002F1810872693501157855) |\n| 9) **3DGen** - 一种全新的端到端文本转3D资产生成流水线，可在不到一分钟内完成；该流水线集成了AssetGen和TextureGen等最先进组件，以三种方式表示3D对象，即 v","# ML-Papers-of-the-Week 快速上手指南\n\n**ML-Papers-of-the-Week** 是由 DAIR.AI 维护的开源项目，旨在每周精选并汇总顶级机器学习（ML）论文。该项目并非一个需要编译或运行的软件工具，而是一个结构化的知识库\u002F索引列表。开发者可通过订阅邮件或直接浏览 GitHub 仓库来获取最新的学术动态。\n\n## 环境准备\n\n由于本项目本质为文档索引与资讯聚合，无需复杂的开发环境配置。\n\n*   **系统要求**：任意支持现代浏览器的操作系统（Windows, macOS, Linux）。\n*   **前置依赖**：\n    *   **GitHub 账号**（可选）：用于 Star 仓库以跟踪更新。\n    *   **电子邮件地址**（可选）：用于订阅每周通讯。\n    *   **网络访问**：需能够访问 GitHub 及 Substack 邮件服务。\n\n## 安装步骤（订阅方式）\n\n本项目无需通过 `pip`、`npm` 或 `docker` 进行传统意义上的“安装”。获取内容的最佳方式是订阅官方通讯或收藏仓库。\n\n1.  **访问项目主页**：\n    在浏览器中打开项目的 GitHub 仓库页面。\n\n2.  **订阅每周通讯（推荐）**：\n    点击以下链接订阅 NLP News 通讯，即可每周将精选论文列表发送至您的邮箱：\n    \n    [Subscribe to our newsletter](https:\u002F\u002Fnlpnews.substack.com\u002F)\n\n3.  **关注仓库更新**：\n    在 GitHub 仓库页面点击右上角的 **Watch** 按钮，选择 **Custom** -> **Releases** 或 **All Activity**，以便在仓库更新最新论文列表时收到通知。\n\n## 基本使用\n\n### 1. 浏览历史论文列表\n\n仓库按年份和周次组织了所有精选论文。您可以直接点击 README 中的链接跳转至特定周期的论文汇总。\n\n**示例路径结构：**\n*   **2025年**：包含从 1月到 6月（及后续）的每周列表。\n*   **2024年**：包含全年的每周列表。\n*   **2023年**：包含全年的每周列表。\n\n**操作示例：**\n若要查看 **2025年6月23日至6月29日** 的顶级论文，请在 README 中找到对应章节并点击链接：\n`[Top ML Papers of the Week (June 23 - June 29)](.\u002F#top-ml-papers-of-the-week-june-23---june-29---2025)`\n\n### 2. 检索特定主题\n\n由于所有论文均归档在 Markdown 文件中，您可以利用 GitHub 的文件搜索功能或浏览器的页面搜索功能（`Ctrl+F` \u002F `Cmd+F`）来查找特定关键词（如 \"LLM\", \"Diffusion\", \"RLHF\" 等）。\n\n**使用技巧：**\n*   进入具体的周度 Markdown 文件。\n*   使用浏览器搜索功能输入您感兴趣的技术术语。\n*   直接访问对应的 arXiv 链接或项目主页进行深入阅读。\n\n### 3. 贡献与反馈\n\n如果您发现遗漏的重要论文或有其他建议，可以通过 GitHub 的 **Issues** 页面提交反馈，或通过 **Pull Requests** 参与社区维护。","某AI初创公司的算法工程师李明，正负责优化一款垂直领域的智能客服模型。面对每周在arXiv上发布的数千篇新论文，他需要在极短时间内筛选出能提升模型推理效率或降低幻觉率的前沿技术，以保持产品竞争力。\n\n### 没有 ML-Papers-of-the-Week 时\n- **信息过载严重**：每天花费2小时漫无目的地浏览arXiv最新列表，大部分时间与当前业务无关，导致核心开发时间被碎片化切割。\n- **错失关键突破**：由于缺乏高效筛选机制，容易漏掉那些标题不显眼但实质影响巨大的“隐形冠军”论文，直到竞品上线类似功能才后知后觉。\n- **复现成本高昂**：盲目下载大量论文后，发现多数缺乏官方代码或实验细节模糊，投入数天复现却无果，造成研发资源的极大浪费。\n- **团队认知不同步**：团队成员各自关注不同细分领域，周会时难以快速对齐行业最新风向，技术选型往往依赖个人经验而非社区共识。\n\n### 使用 ML-Papers-of-the-Week 后\n- **精准获取精华**：每周一早晨只需10分钟阅读精选列表，直接锁定与LLM推理优化相关的Top 5论文，将文献调研时间压缩90%以上。\n- **紧跟技术前沿**：通过DAIR.AI的专业筛选，迅速捕捉到关于“稀疏注意力机制”的最新突破，比竞争对手提前两周将其集成到测试版中。\n- **降低试错风险**：优先关注列表中附带高质量开源实现或详细解读的论文，确保所选技术具备可落地性，大幅提升了实验成功率。\n- **统一技术视野**：将每周精选文章分享至团队频道，作为技术分享的固定素材，帮助团队快速建立对行业趋势的共同认知，加速决策流程。\n\n核心价值在于，ML-Papers-of-the-Week 充当了高效的“学术过滤器”，让开发者从海量噪音中解脱，将宝贵精力集中于真正具有变革潜力的技术创新上。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdair-ai_ML-Papers-of-the-Week_9fa67243.png","dair-ai","DAIR.AI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdair-ai_38e7eafe.png","Democratizing Artificial Intelligence Research, Education, and Technologies",null,"dair_ai","https:\u002F\u002Fwww.dair.ai\u002F","https:\u002F\u002Fgithub.com\u002Fdair-ai",12272,768,"2026-04-02T23:45:27",1,"","未说明",{"notes":91,"python":89,"dependencies":92},"该仓库并非可执行的 AI 软件工具，而是一个 curated list（精选列表），用于汇总每周顶级机器学习论文。它主要由 Markdown 文件和链接组成，无需安装任何依赖、GPU 或特定运行环境，仅需浏览器即可阅读。",[],[51,26,54,13,15,14],[95,96,97,98,99],"ai","data-science","deeplearning","machine-learning","nlp",7,"2026-03-27T02:49:30.150509","2026-04-06T10:27:35.211577",[104,109,114,119,124,128],{"id":105,"question_zh":106,"answer_zh":107,"source_url":108},11586,"为什么最新的“每周机器学习论文”列表没有更新？","这是一个开源项目，而非全职工作。维护者会在有空闲时间时进行更新，因此更新频率可能不固定。如果遇到延迟，请耐心等待或关注维护者在社交媒体上的动态。","https:\u002F\u002Fgithub.com\u002Fdair-ai\u002FML-Papers-of-the-Week\u002Fissues\u002F20",{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},11587,"如果发现某周的论文摘要缺失，该怎么办？","通常维护者会随后补全缺失的内容。例如，当用户反馈11月20日至26日的列表缺失时，维护者随后更新了列表。建议先检查仓库是否已有最新提交，或耐心等待维护者处理。","https:\u002F\u002Fgithub.com\u002Fdair-ai\u002FML-Papers-of-the-Week\u002Fissues\u002F15",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},11588,"该仓库是否接受非传统会议论文的资源（如长篇幅框架文档或工具指南）？","虽然仓库主要关注学术论文集，但也欢迎对社区有实际价值的资源。例如，针对LLM调试的自包含框架PDF（如WFGY 1.0），如果其具有实用性、开源许可且能直接应用于前沿模型，可以被考虑作为“LLM调试\u002F自愈框架”或“鲁棒性\u002F可靠性资源”部分的内容加入。","https:\u002F\u002Fgithub.com\u002Fdair-ai\u002FML-Papers-of-the-Week\u002Fissues\u002F38",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},11589,"如何建议改进论文的元数据格式（如日期和代码链接）？","用户可以通过Issue提出具体建议。例如，建议将论文日期格式统一为 YYYY-MM-DD，并建议在收录时检查arXiv页面上是否附有代码链接，以提高资源的可用性和可复现性。","https:\u002F\u002Fgithub.com\u002Fdair-ai\u002FML-Papers-of-the-Week\u002Fissues\u002F1",{"id":125,"question_zh":126,"answer_zh":127,"source_url":113},11590,"是否有自动化机制同步社交媒体和GitHub的更新？","社区建议建立CI\u002FCD流程，以便在Twitter\u002FX上发布线程的同时，自动触发GitHub PR来更新论文摘要。这有助于保持不同平台间内容的一致性并减少人工维护负担。",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},11591,"如果发现特定月份（如10月或9月）的论文列表缺失，如何处理？","这可能是由于维护期间的疏漏。用户可以像处理其他缺失周次一样，通过Issue提醒维护者。虽然此类问题可能不会立即得到详细回复，但通常会促使维护者在后续更新中补全遗漏的数据。","https:\u002F\u002Fgithub.com\u002Fdair-ai\u002FML-Papers-of-the-Week\u002Fissues\u002F19",[]]