[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-microsoft--Magma":3,"tool-microsoft--Magma":61},[4,18,26,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",141543,2,"2026-04-06T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,60],"视频",{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":113,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":115,"updated_at":116,"faqs":117,"releases":151},4531,"microsoft\u002FMagma","Magma","[CVPR 2025] Magma: A Foundation Model for Multimodal AI Agents","Magma 是一款面向多模态 AI 智能体的基础模型，旨在打通虚拟数字世界与真实物理世界的交互壁垒。它不仅能像人类一样理解图像和视频内容，更能基于目标自主生成可视化的行动规划与具体操作指令，从而胜任复杂的代理任务。\n\n传统 AI 模型往往局限于单一的内容识别或文本生成，难以在动态环境中进行空间推理和连续决策。Magma 正是为了解决这一痛点而生，它在用户界面导航、机器人操控以及通用视听理解等任务上均达到了业界领先水平，特别是在空间理解与逻辑推理方面表现卓越。\n\nMagma 的核心技术亮点在于其可扩展的预训练策略。除了利用现有的智能体数据外，它还能直接从野外未标注的视频中学习视觉轨迹，这种独特的学习方式赋予了模型极强的泛化能力，使其能更好地适应现实世界的复杂应用。\n\n这款模型非常适合 AI 研究人员、开发者以及从事机器人学和自动化系统设计的专业人士使用。无论是希望构建能自动操作软件界面的智能助手，还是开发能在实体环境中执行任务的机器人，Magma 都提供了一个强大且灵活的基础底座，助力多模态智能体技术的落地与创新。","\u003Cdiv align=\"center\">\n\u003Ch2>🤖 Magma: A Foundation Model for Multimodal AI Agents\u003C\u002Fh2>\n\n[Jianwei Yang](https:\u002F\u002Fjwyang.github.io\u002F)\u003Csup>*\u003C\u002Fsup>\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Reuben Tan](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Qianhui Wu](https:\u002F\u002Fqianhuiwu.github.io\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Ruijie Zheng](https:\u002F\u002Fruijiezheng.com\u002F)\u003Csup>2\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>&nbsp;\n[Baolin Peng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=u1CNjgwAAAAJ&hl=en&oi=ao)\u003Csup>1\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>&nbsp;\n[Yongyuan Liang](https:\u002F\u002Fcheryyunl.github.io)\u003Csup>2\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>\n\n[Yu Gu](http:\u002F\u002Fyu-gu.me\u002F)\u003Csup>1\u003C\u002Fsup>&nbsp;\n[Mu Cai](https:\u002F\u002Fpages.cs.wisc.edu\u002F~mucai\u002F)\u003Csup>3\u003C\u002Fsup>&nbsp;\n[Seonghyeon Ye](https:\u002F\u002Fseonghyeonye.github.io\u002F)\u003Csup>4\u003C\u002Fsup>&nbsp;\n[Joel Jang](https:\u002F\u002Fjoeljang.github.io\u002F)\u003Csup>5\u003C\u002Fsup>&nbsp;\n[Yuquan Deng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=LTC0Q6YAAAAJ&hl=en)\u003Csup>5\u003C\u002Fsup>&nbsp;\n[Lars Liden](https:\u002F\u002Fsites.google.com\u002Fsite\u002Flarsliden)\u003Csup>1\u003C\u002Fsup>&nbsp;\n[Jianfeng Gao](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpeople\u002Fjfgao\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>▽\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup> Microsoft Research; \u003Csup>2\u003C\u002Fsup> University of Maryland; \u003Csup>3\u003C\u002Fsup> University of Wisconsin-Madison  \n\u003Csup>4\u003C\u002Fsup> KAIST; \u003Csup>5\u003C\u002Fsup> University of Washington\n\n\u003Csup>*\u003C\u002Fsup> Project lead  \u003Csup>†\u003C\u002Fsup> First authors  \u003Csup>‡\u003C\u002Fsup> Second authors  \u003Csup>▽\u003C\u002Fsup> Leadership  \n\n\u003Ch3 style=\"color:#b22222;\"> CVPR 2025 \u003C\u002Fh3>\n\n\u003Ch4>\n\u003Ca href=\"https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.13130\">📄 arXiv Paper\u003C\u002Fa> &nbsp; \n\u003Ca href=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002F\">🌐 Project Page\u003C\u002Fa> &nbsp; \n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FMagma-8B\">🤗 Hugging Face Model\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002Fmicrosoft-magma-8b\u002Fversion\u002F1\u002Fregistry\u002FHuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47\">☁️ Azure AI Foundry\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=SbfzvUU5yM8\">📺 Video\u003C\u002Fa>\n\u003C\u002Fh4>\n\n\u003C!-- \u003Ch3>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-UI\">🤗 Gradio UI Agent\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-Gaming\">🤗 Gradio Gaming Agent\u003C\u002Fa>\n\u003C\u002Fh3> -->\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Cp2>The Path Towards Multimodal AI Agents\u003C\u002Fp2>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_d7a057548da8.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n## :sparkles: Highlights\n* **Digital and Physical Worlds:** Magma is the first-ever foundation model for multimodal AI agents, designed to handle complex interactions across both virtual and real environments!\n* **Versatile Capabilities:** Magma as a single model not only possesses generic image and videos understanding ability, but also generate goal-driven visual plans and actions, making it versatile for different agentic tasks!\n* **State-of-the-art Performance:** Magma achieves state-of-the-art performance on various multimodal tasks, including UI navigation, robotics manipulation, as well as generic image and video understanding, in particular the spatial understanding and reasoning!\n* **Scalable Pretraining Strategy:** Magma is designed to be **learned scalably from unlabeled videos** in the wild in addition to the existing agentic data, making it strong generalization ability and suitable for real-world applications!\n\n## :fire: News\n* **[2025.04.29]** [Mind2Web](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-Mind2Web-SoM) and [AITW](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-AITW-SoM) with SoM prompting annotations are released on hugging face! We used them for our Magma downstream finetuning and reported the results in our table.\n* **[2025.04.12]** 🔥We released the pretraining videos with visual traces on hugging face [Magma-Video-ToM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-Video-ToM).\n* **[2025.04.06]** Open X-Embodiment pretraining data with visual traces can be downloaded from [Magma-OXE-ToM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-OXE-ToM).\n* **[2025.03.16]** We released the demo code for generating SoM and ToM for instructional videos (i.e., Alg. 2 in our paper) in [SoM\u002FToM Generation](#som-and-tom-generation).\n* **[2025.03.09]** 🔥 We released Magma training code, and an exampler for training Magma-8B on Magma-820K dataset. Check out the [Model Training](#model-training)\n* **[2025.03.06]** We released a new demo for showing robot planning capabilities. Run `python agents\u002Frobot_traj\u002Fapp.py` to start the demo!\n* **[2025.02.28]** We released two demos, [Magma-UI](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-UI) and [Magma-Gaming](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-Gaming) on Hugging Face. Check out our model's action grounding and planning capabilities!\n* **[2025.02.26]** ⭐ Exciting News! Magma got accepted by CVPR 2025!\n* **[2025.02.25]** 🎉 Big News! We are releasing the Magma model on [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FMagma-8B) and [Azure AI Foundry](https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002Fmicrosoft-magma-8b\u002Fversion\u002F1\u002Fregistry\u002FHuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47)!\n* **[2025.02.23]**  We released the Magma Inference code!\n* **[2025.02.20]**  Magma has reached the top spot on [Hacker News](https:\u002F\u002Fnews.ycombinator.com\u002Ffront)!\n* **[2025.02.19]**  We will be releasing our code, model and UI navigation demo by [MSR Forum on 02.25 next Tuesday](https:\u002F\u002Fresearchforum.microsoft.com\u002F)!\n* **[2025.02.18]**  Our Flagship Project Magma at MSR is released on [arXiv](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.13130)!\n\n## :bookmark_tabs: Todos\nWe will be releasing all the following contents:\n- [x] Model inference code\n- [x] Add UI and Gaming agent Demos\n- [x] Model checkpoint\n- [x] Training code\n- [x] Open-XE pretraining data with traces\n- [x] Video pretraining data with traces\n- [ ] SeeClick and Vision2UI pretraining data with SoM\n- [ ] UI\u002FLibero finetuning script\n- [ ] Video finetune script\n\n## :clipboard: Outline\n- [What is Magma?](#what-is-magma)\n- [How we pretrain Magma?](#how-we-pretrain-magma)\n- [Installation](#installation)\n- [Data Preprocessing](#data-preprocessing)\n  - [SoM and ToM Generation](#som-and-tom-generation)\n- [Model Training](#model-training)\n  - [Pretraining on Open-X without SoM\u002FToM](#pretraining-on-open-x-without-somtom)\n  - [Finetuning on Magma-820K](#finetuning-on-magma-820k)\n- [Model Usage](#model-usage)\n  - [Inference](#inference)\n    - [Inference with Huggingface Transformers](#inference-with-huggingface-transformers)\n    - [Inference with local code from this repo](#inference-with-local-code-from-this-repo)\n    - [Inference with bitsandbytes](#inference-with-bitsandbytes)\n    - [Benchmarking](#benchmarking)\n  - [Evaluation with lmms-eval](#evaluation-with-lmms-eval)\n  - [Evaluation with SimplerEnv](#evaluation-with-simplerenv)\n  - [Multi-images or Video](#multi-images-or-video)\n  - [API Server](#api-server) \n  - [Agent Demos](#agent-demos)\n      - [UI Agent](#ui-agent)\n      - [Gaming Agent](#gaming-agent)\n      - [Robot Visual Planning](#robot-visual-planning)\n- [Citation](#citation)\n- [Acknowledgements](#acknowledgements)\n\n## What is Magma?\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_c773bee8dbd2.png\" width=\"50%\">\n\u003C\u002Fdiv>\n\n**Magma is a foundation model for multimodal AI agents**. As the bedrock for multimodal agentic models, it should possess strong capabilities to perceive the multimodal world AND takes goal-driven actions precisely (see above figure). With this in mind, we are striving for the following goals:\n\n* **Verbal and spatial-temporal intelligence:** Magma is supposed to have both strong verbal and spatial-temporal intelligence to understand images and videos, ground its actions on the observations, and further translate the external goal into action plan and executions.\n* **Digital and physical world:** Magma should not be limited to either the digital world (e.g., web navigation) or the physical world (e.g., robotics manipulation), but rather be able to work across both worlds, just like humans ourselves.\n\nWith this in mind, we developed a new pretraining data, which mostly consists of unlabeled videos in the wild plus the existing annotated agentic data, and a new pretraining framework, which unifies the training of all three modalities (text, image, and action), to train a new foundation model for multimodal AI agents, named Magma.\n\n## How we pretrain Magma?\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_c4e5b5999469.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\nWe pursue the goal through two dimensions:\n\n* **Large-scale heterogeneous training data**: we curate a large amount of data in the wild, including existing multimodal understanding data, UI navigation data, and robotics manipulation data, and unlabeled videos in the wild. We also propose a new data collection pipeline to collect unlabeled videos in the wild, which is scalable and cost-effective. To attain useful action supervision from raw videos and robotics trajectories, we meticulously removed the camera motions in the videos and then transform the motions into \"action\" supervisions for our model training. These provide unique signals for the model to learn the cross-modal connections and long-horizon action prediction and planning.\n\n* **Universal pretraining objectives**: texts and actions are inherently different and thus cause a huge gap, while visual tokens are continuous. We propose a universal pretraining framework that unifies the training of all three modalities, and we show that this is crucial for the model to learn the cross-modal connections. More specifically, we proposed Set-of-Mark and Trace-of-Mark as the auxiliary tasks for our model pretraining, as the bridge of different output modalities. In this way, we are building a great alignment between the text and action modalities, and also between the image and action modalities.\n\n## Installation\n\n1. Clone this repo to your local machine:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\ncd Magma\n```\n2. Install the dependencies:\n\n```bash\nconda create -n magma python=3.10 -y\nconda activate magma\npip install --upgrade pip\npip install -e .\n```\n\n3. Install packages for training:\n\n```bash\npip install -e \".[train]\"\n```\n\n4. Install packages for agents:\n\n```bash\npip install -e \".[agent]\"\n```\n\n5. Other probably needed packages:\n\n* Co-tracker\n```sh\n# Install co-tracker\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fco-tracker\ncd co-tracker\npip install -e .\npip install imageio[ffmpeg]\ncd ..\u002F\n```\n\n* Kmeans\n```sh\n# Install kmeans_pytorch, note: install with pip will leads to error\ngit clone https:\u002F\u002Fgithub.com\u002Fsubhadarship\u002Fkmeans_pytorch\ncd kmeans_pytorch\npip install -e .\ncd ..\u002F\n```\n\n* Misc\n```sh\n# Install others packages\npip install ipython\npip install faiss-cpu\npip install decord\n```\n\n⚠️ Please make sure you have installed the transformers with correct version (>=4.49.0). If you see some abnormal behavior, please check the version of transformers, and probably see below for the customized transformers.\n\n\u003Cdetails>\n\u003Csummary>Click to expand\u003C\u002Fsummary>\n\n### Customized Transformers\n\n⚠️ One important thing to note is that our model uses [ConvNext](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models\u002Fblob\u002Fmain\u002Ftimm\u002Fmodels\u002Fconvnext.py) as the backbone, which contains a layer scaler parameter [gamma](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models\u002Fblob\u002Fe44f14d7d2f557b9f3add82ee4f1ed2beefbb30d\u002Ftimm\u002Fmodels\u002Fconvnext.py#L144). This leads to a bug of Transformers library as it automatically replace the 'gamma' with 'weight' when loading the model. To fix this, we need to modify the 'transformers\u002Fmodels\u002Fauto\u002Fmodeling_auto.py' file as follows:\n\n```python \nif \"gamma\" in key and \"clip_vision_model\" not in key:\n    key = key.replace(\"gamma\", \"weight\")\n```\nThis bug still exists in the latest transformer version. So please make sure you install the following bug-free customized version of transformers as lised in [pyproject.toml](.\u002Fpyproject.toml):\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n```\n\nor the newest version:\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.48.2\n```\n\n\u003C\u002Fdetails>\n\n## Data Preprocessing\n\n### SoM and ToM Generation\n\nAs shown in Table 1 of our paper, we apply SoM and ToM on both robotics data and instructional videos. To ensure reproducibility, we provide the code to generate SoM and ToM for instructional videos. The code is located in `tools\u002Fsom_tom\u002Fdemo.py`. You can run the following command to generate SoM and ToM for the robotics data:\n\n```bash\npython tools\u002Fsom_tom\u002Fdemo.py\n```\n\nAnd then you can find two videos in the `tools\u002Fsom_tom\u002Fvideos` folder. The original trace extracted from CoTracker is shown in `orig_trace.mp4`, and the SoM-ToM video is named `som_tom.mp4`.\n\n## Model Training\n\nWe provide the instructions to pretrain LLama-3-8B-Instruct on Open-X-Embodiment and finetune Magma-8B on different downstream tasks.\n\n### Pretraining on Open-X without SoM\u002FToM\n\n* Data Preparation\n\nDownload Open-X-Embodiment from the official site. Then edit the data config file [openx.yaml](data_configs\u002Fopenx.yaml) accordingly. The data config file should look like this:\n\n```yaml\n# a list of all the data paths\nDATA_PATH: \n  - \"\u002Fpath\u002Fto\u002Fopen-x\"\nIMAGE_FOLDER:\n  - \"siglip-224px+mx-oxe-magic-soup\"    \nLANGUAGE_PATH:\n  - \"\"\n```\n\n* Pretrain on OpenX\n\nOnce set up the dataset and config, you can run the following command to finetune the model:\n\n```bash\nsh scripts\u002Fpretrain\u002Fpretrain_openx.sh\n```\n* Benefit: We spent tremendous effort to decouple the Open-X dataloader from OpenVLA and make it compatible with other datasets used in our experiments*\n\n### Finetuning on Magma-820K\n\n* Data Preparation\n\nDownload annotation file from [MagmaAI\u002FMagma-820K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-820K). Please prepare the image data according to the dataset list in the dataset page. Once finished, please edit [magma_820k.yaml](data_configs\u002Fmagma_820k.yaml) file accordingly.\n\n```yaml\n# a list of all the data paths\nDATA_PATH: \n  - \"\u002Fpath\u002Fto\u002Fmagma_820k.json\"\nIMAGE_FOLDER:\n  - \"\u002Froot\u002Fto\u002Fmagma_820k\u002Fimages\"\n```\n\n* Finetune from Magma-8B\n\nOnce set up the dataset and config, you can run the following command to finetune the model:\n\n```bash\nsh scripts\u002Ffinetune\u002Ffinetune_magma_820k.sh\n```\n\n## Model Usage\n\n### Inference\n\n#### Inference with Huggingface Transformers\n\nWe have uploaded the model to Huggingface Hub. You can easily load the model and processor with the following code.\n\n\u003Cdetails>\n\u003Csummary>Click to expand\u003C\u002Fsummary>\n\n```python\nfrom PIL import Image\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom transformers import AutoProcessor \n\ndtype = torch.bfloat16\nmodel = AutoModelForCausalLM.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True, torch_dtype=dtype)\nprocessor = AutoProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\nmodel.to(\"cuda\")\n\n# Inference\nimage = Image.open(\".\u002Fassets\u002Fimages\u002Fmagma_logo.jpg\").convert(\"RGB\")\n\nconvs = [\n    {\"role\": \"system\", \"content\": \"You are agent that can see, talk and act.\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\nWhat is the letter on the robot?\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image], texts=prompt, return_tensors=\"pt\")\ninputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)\ninputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)\ninputs = inputs.to(\"cuda\").to(dtype)\n\ngeneration_args = { \n    \"max_new_tokens\": 500, \n    \"temperature\": 0.0, \n    \"do_sample\": False, \n    \"use_cache\": True,\n    \"num_beams\": 1,\n} \n\nwith torch.inference_mode():\n    generate_ids = model.generate(**inputs, **generation_args)\n\ngenerate_ids = generate_ids[:, inputs[\"input_ids\"].shape[-1] :]\nresponse = processor.decode(generate_ids[0], skip_special_tokens=True).strip()\n\nprint(response)\n```\n\u003C\u002Fdetails>\n\n#### Inference with local Transformers code from this repo\n\nIf you want to debug our model, we also provide a local code for inference. You can run the following code to load the model.\n\u003Cdetails>\n\u003Csummary>Click to expand\u003C\u002Fsummary>\n\n```python\nfrom magma.processing_magma import MagmaProcessor\nfrom magma.modeling_magma import MagmaForCausalLM\n\ndtype = torch.bfloat16\nmodel = MagmaForCausalLM.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True, torch_dtype=dtype)\nprocessor = MagmaProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\nmodel.to(\"cuda\")\n```\n\u003C\u002Fdetails>\n\n#### Inference with bitsandbytes\n\nWe also provide a sample code to inference with bitsandbytes. You can run the following code to load the model.\n\n\u003Cdetails>\n\u003Csummary>Click to expand\u003C\u002Fsummary>\n\n```python\nfrom PIL import Image\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom transformers import AutoProcessor \nfrom transformers import BitsAndBytesConfig\n\n# Define quantization configuration\nquantization_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_compute_dtype=torch.float16,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type=\"nf4\"\n)\n\n# Load model with quantization config\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"microsoft\u002FMagma-8B\", \n    trust_remote_code=True,\n    device_map={\"\": 0},  # force everything onto GPU 0\n    quantization_config=quantization_config\n)\nprocessor = AutoProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\n\n# Inference\nimage = Image.open(\"assets\u002Fimages\u002Fmagma_logo.jpg\").convert(\"RGB\")\n\nconvs = [\n    {\"role\": \"system\", \"content\": \"You are agent that can see, talk and act.\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\nWhat is the letter on the robot?\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image], texts=prompt, return_tensors=\"pt\")\ninputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)\ninputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)\n\n# Convert inputs to the correct device and data type\ninputs = {k: v.to(device=model.device, dtype=torch.float16 if v.dtype == torch.float32 else v.dtype) \n          for k, v in inputs.items()}\n\ngeneration_args = { \n    \"max_new_tokens\": 500, \n    \"temperature\": 0.0, \n    \"do_sample\": False, \n    \"use_cache\": True,\n    \"num_beams\": 1,\n} \n\nwith torch.inference_mode():\n    generate_ids = model.generate(**inputs, **generation_args)\n\ngenerate_ids = generate_ids[:, inputs[\"input_ids\"].shape[-1] :]\nresponse = processor.decode(generate_ids[0], skip_special_tokens=True).strip()\nprint(response)\n```\n\u003C\u002Fdetails>\n\n#### Benchmarking\n\nWe benchmark the inference time and memory usage of our model with and without bitsandbytes.\n\n| Model | Inference Time | Peak Memory Usage |\n|-------|----------------|--------------|\n| Magma-8B (bfloat16) | 1.1s | 17GB |\n| Magma-8B (4-bit) | 1.1s | 7GB |\n\n### Evaluation with lmms-eval\n\nPlease refer to [lmms-eval-instruction](tools\u002Flmms-eval-magma) for the detailed instructions to run the evaluation with lmms-eval toolkit.\n\nOnce everything is ready, you can run the following code to evaluate our model from the root folder.\n\n```bash\nsh scripts\u002Fevaluation\u002Flmms-eval\u002Flmms_eval_magma.sh\n```\n\nYou can evaluate other benchmarks by modifying the variable, eval_tasks. The list of `eval_tasks` can be found after running below code.\n```\n# lmms-eval --tasks {list_groups,list_subtasks,list_tags,list}\nlmms-eval --tasks list_groups\n```\n\n### Evaluation with SimplerEnv\n\nPlease refer to [SimplerEnv-instruction](tools\u002Fsimplerenv-magma) for the detailed instructions to run the evaluation with SimplerEnv toolkit.\n\nOnce everything is ready, you can run the following code to evaluate our model.\n\n```bash\nsh scripts\u002Fevaluation\u002Fsimplerenv\u002Fbridge.sh\n```\n\n### Multi-images or Video Support\n\nHandle multiple images is extremely simple for our model. You just simply duplicate the placeholder in your text prompt, and correspondingly add all images into the list. A dummy example is as follows:\n\n```py\nconvs = [\n    {\"role\": \"system\", \"content\": \"You are agent that can see, talk and act.\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\nWhat is the letter on the robot?\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image1,image2,image3], texts=prompt, return_tensors=\"pt\")\n```\nOur model will handle the visual token filling for you!\n\n### API Server\n\nWe provide a FastAPI server for deploying Magma as a REST API service, which enables:\n- Vision and language processing via REST endpoints\n- Action prediction for robotics applications\n- Support for both base64-encoded images and file uploads\n\nThe server can be deployed in three ways:\n1. **Run directly**: Simplest option for development\n2. **Docker container**: Recommended for production\n3. **Native system service**: For system integration\n\n#### Quick Start\n\n```bash\ncd server\n.\u002Fmagma-server.sh run\n```\n\nThis will set up a conda environment, install dependencies, and start the server on port 8080.\n\n#### Docker Deployment\n\n```bash\ncd server\n.\u002Fmagma-server.sh docker up\n```\n\n#### API Endpoints\n\nThe API exposes the following endpoints:\n- `GET \u002Fhealth` - Check if the server is running and model is loaded\n- `POST \u002Fpredict` - Predict using base64-encoded image\n- `POST \u002Fpredict_from_file` - Predict using uploaded image file\n\nFor more details, see the [Server README](server\u002FREADME.md).\n\n### Agent Demos\n\n#### UI Agent\n\nWe built agent models for our model. The first one we built is UI Agent Demo. As our model is pretrained with Set-of-Mark and Trace-of-Mark, it is naturally synergic to [OmniParser](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FOmniParser). Combining them together, you can immediately get an UI agent, run:\n\n```bash\npython agents\u002Fui_agent\u002Fapp.py\n```\n\nMore importantly, as our Magma model not only has the action-grounding ability, but also multimodal understanding and reasoning ability. You can not only ask the model predict where to click with text:\n\n```bash\nGo to the top ranked post\n```\n\nBut also ask free question on the fly! Simply add a prefix \"Q:\" at the beginning of text prompt, e.g.,\n\n```bash\nQ: What is the title of the post?\n```\n\n#### Gaming Agent\n\nWe also built a gaming agent demo. You can run the following command to start the demo:\n\n```bash\npython agents\u002Fgaming_agent\u002Fapp.py\n```\n\nOnce the demo is run, you can see a robot proactively collecting the green blocks. \n\n\u003C!-- Below are the comparison between Magma and other counterparts VLMs:\n\n\u003Cdiv align=\"center\">\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002Fstatic\u002Fvideos\u002Fmagma_vs_llava.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Magma v.s. LLaVA-OneVision.\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002Fstatic\u002Fvideos\u002Fmagma_vs_qwen.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Magma v.s. Qwen-2.0.\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003C\u002Fdiv> -->\n\n#### Robot Visual Planning\n\nWe also built a robot visual planning demo. You can run the following command to start the demo:\n\n```bash\npython agents\u002Frobot_traj\u002Fapp.py\n```\n\nFor this demo, you may encounter an error as discussed in this [issue](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F43), a quick fix is running the following command:\n\n```sh\npip install imageio[ffmpeg]\n```\n\nIf it still does not work, please install the older version of transformers:\n```sh\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n```\n\n\u003C!-- Some example outputs:\n\n\u003Cdiv align=\"center\">\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"assets\u002Fvideos\u002Frobot_pick_up_chip_bag.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Task: Pick up chip bag.\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"assets\u002Fvideos\u002Frobot_push_chip_bag_to_left_edge_of_table.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Task: Push chip bag to left edge of the table.\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003C\u002Fdiv> -->\n\n## User Guidance\n\n\u003C!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem\u002Fapp. -->\n### Direct use\n\nThis model is intended for broad research use in English. The model take images and text as inputs, and produces the textual outputs for the following uses:\n\n* **Image\u002FVideo-Conditioned Text Generation:** The model can generate text (e.g., descriptions, answers) based on the input text and image.\n\n* **Visual Planning Capabilities:** The model can also produce the visual trace as the future planning to accomplish a task (e.g., move object from one place to another).\n\n* **Agentic Capabilities:** The model can also generate UI grounding (e.g., click ``search'' button) and robotics manipulations (e.g., 7 DoF for the robot gripper).\n\nOur model is designed only for research purpose and aimed at knowledge-sharing and accelerating research in multimodal AI, in particularly the mutimodal agentic AI.\n\n### Downstream Use\n\nThe model can be further finetuned for different downstream tasks, such as:\n\n* **Image Captioning and QA:** We can further finetune this model for image captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better spatial understanding and reasoning on these tasks.\n\n* **Video Captioning and QA:** We can further finetune this model for video captioning and QA tasks under the pipeline of multimodal LLMs. Based on our experiments, the model can achieve competitive performance yet better temporal understanding and reasoning on these tasks.\n\n* **UI Navigation:** We can finetune this model for specific UI navigation tasks, such as web navigation or mobile navigation. The model can achieve superior performance on these tasks.\n\n* **Robotics Manipulation:** Our model can be further finetuned for robotics tasks given its general agentic capabilities as a vision-language-action model. After finetuning, our model significantly outperforms the state-of-the-art models such as OpenVLA on robotics manipulation tasks.\n\n## Bias, Risks, and Limitations\n\nPlease note that this model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including privacy, trade compliance laws, etc.) that are relevant to their use case.\n\n\n## Citation\nIf you use this model in your research, please consider citing:\n\n```bibtex\n@misc{yang2025magmafoundationmodelmultimodal,\n      title={Magma: A Foundation Model for Multimodal AI Agents}, \n      author={Jianwei Yang and Reuben Tan and Qianhui Wu and Ruijie Zheng and Baolin Peng and Yongyuan Liang and Yu Gu and Mu Cai and Seonghyeon Ye and Joel Jang and Yuquan Deng and Lars Liden and Jianfeng Gao},\n      year={2025},\n      eprint={2502.13130},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13130}, \n}\n```\n\n## Acknowledgements\n\nOur work is supported by Microsoft Research. We thank all the contributors for their efforts in building this project. \n\nOur work is built on top of some amazing open-source projects, including [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers), [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA), [OpenVLA](https:\u002F\u002Fgithub.com\u002Fopenvla\u002Fopenvla), [SeeClick](https:\u002F\u002Fgithub.com\u002Fnjucckevin\u002FSeeClick), [Mind2Web](https:\u002F\u002Fgithub.com\u002FOSU-NLP-Group\u002FMind2Web), and also a number of awesome open-source datasets, including [Ego4d](https:\u002F\u002Fego4d-data.org\u002F), [Epic-Kitchen](https:\u002F\u002Fepic-kitchens.github.io\u002F2025), [Something-Somethingv2](https:\u002F\u002Fwww.qualcomm.com\u002Fdeveloper\u002Fartificial-intelligence\u002Fdatasets), [Open-X-Embodiment](https:\u002F\u002Frobotics-transformer-x.github.io\u002F), and a number of evaluation benchmarks, including [SimplerEnv](https:\u002F\u002Fgithub.com\u002Fsimpler-env\u002FSimplerEnv), [Libero](https:\u002F\u002Fgithub.com\u002FLifelong-Robot-Learning\u002FLIBERO).\n\n## License\n\nThis project is licensed under the MIT License. See the [LICENSE](LICENSE) file for details.\n\n## Contributing\n\nThis project welcomes contributions and suggestions.  Most contributions require you to agree to a\nContributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us\nthe rights to use your contribution. For details, visit https:\u002F\u002Fcla.opensource.microsoft.com.\n\nWhen you submit a pull request, a CLA bot will automatically determine whether you need to provide\na CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions\nprovided by the bot. You will only need to do this once across all repos using our CLA.\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F).\nFor more information see the [Code of Conduct FAQ](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F) or\ncontact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n## Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft \ntrademarks or logos is subject to and must follow \n[Microsoft's Trademark & Brand Guidelines](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral).\nUse of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship.\nAny use of third-party trademarks or logos are subject to those third-party's policies.\n","\u003Cdiv align=\"center\">\n\u003Ch2>🤖 Magma：多模态AI智能体的基础模型\u003C\u002Fh2>\n\n[Jianwei Yang](https:\u002F\u002Fjwyang.github.io\u002F)\u003Csup>*\u003C\u002Fsup>\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Reuben Tan](https:\u002F\u002Fcs-people.bu.edu\u002Frxtan\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Qianhui Wu](https:\u002F\u002Fqianhuiwu.github.io\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>†\u003C\u002Fsup>&nbsp;\n[Ruijie Zheng](https:\u002F\u002Fruijiezheng.com\u002F)\u003Csup>2\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>&nbsp;\n[Baolin Peng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=u1CNjgwAAAAJ&hl=en&oi=ao)\u003Csup>1\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>&nbsp;\n[Yongyuan Liang](https:\u002F\u002Fcheryyunl.github.io)\u003Csup>2\u003C\u002Fsup>\u003Csup>‡\u003C\u002Fsup>\n\n[Yu Gu](http:\u002F\u002Fyu-gu.me\u002F)\u003Csup>1\u003C\u002Fsup>&nbsp;\n[Mu Cai](https:\u002F\u002Fpages.cs.wisc.edu\u002F~mucai\u002F)\u003Csup>3\u003C\u002Fsup>&nbsp;\n[Seonghyeon Ye](https:\u002F\u002Fseonghyeonye.github.io\u002F)\u003Csup>4\u003C\u002Fsup>&nbsp;\n[Joel Jang](https:\u002F\u002Fjoeljang.github.io\u002F)\u003Csup>5\u003C\u002Fsup>&nbsp;\n[Yuquan Deng](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=LTC0Q6YAAAAJ&hl=en)\u003Csup>5\u003C\u002Fsup>&nbsp;\n[Lars Liden](https:\u002F\u002Fsites.google.com\u002Fsite\u002Flarsliden)\u003Csup>1\u003C\u002Fsup>&nbsp;\n[Jianfeng Gao](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpeople\u002Fjfgao\u002F)\u003Csup>1\u003C\u002Fsup>\u003Csup>▽\u003C\u002Fsup>\n\n\u003Csup>1\u003C\u002Fsup> 微软研究院；\u003Csup>2\u003C\u002Fsup> 马里兰大学；\u003Csup>3\u003C\u002Fsup> 威斯康星大学麦迪逊分校  \n\u003Csup>4\u003C\u002Fsup> KAIST；\u003Csup>5\u003C\u002Fsup> 华盛顿大学\n\n\u003Csup>*\u003C\u002Fsup> 项目负责人  \u003Csup>†\u003C\u002Fsup> 第一作者  \u003Csup>‡\u003C\u002Fsup> 第二作者  \u003Csup>▽\u003C\u002Fsup> 领导层  \n\n\u003Ch3 style=\"color:#b22222;\"> CVPR 2025 \u003C\u002Fh3>\n\n\u003Ch4>\n\u003Ca href=\"https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.13130\">📄 arXiv论文\u003C\u002Fa> &nbsp; \n\u003Ca href=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002F\">🌐 项目主页\u003C\u002Fa> &nbsp; \n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FMagma-8B\">🤗 Hugging Face模型\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002Fmicrosoft-magma-8b\u002Fversion\u002F1\u002Fregistry\u002FHuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47\">☁️ Azure AI Foundry\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=SbfzvUU5yM8\">📺 视频\u003C\u002Fa>\n\u003C\u002Fh4>\n\n\u003C!-- \u003Ch3>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-UI\">🤗 Gradio UI代理\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-Gaming\">🤗 Gradio游戏代理\u003C\u002Fa>\n\u003C\u002Fh3> -->\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Cp2>迈向多模态AI智能体之路\u003C\u002Fp2>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_d7a057548da8.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\u003C\u002Fdiv>\n\n## :sparkles: 亮点\n* **数字与物理世界：** Magma是首个面向多模态AI智能体的基础模型，专为处理虚拟与现实环境中的复杂交互而设计！\n* **多功能能力：** Magma作为单一模型，不仅具备通用的图像和视频理解能力，还能生成目标驱动的视觉规划与行动，使其在各类智能体任务中表现出色！\n* **最先进性能：** Magma在多项多模态任务上取得了最先进的性能，包括UI导航、机器人操作以及通用图像和视频理解，尤其是在空间理解和推理方面！\n* **可扩展的预训练策略：** Magma的设计使其能够**从海量无标注视频中进行可扩展学习**，同时结合现有的智能体数据，从而具备强大的泛化能力，非常适合实际应用！\n\n## :fire: 最新消息\n* **[2025.04.29]** [Mind2Web](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-Mind2Web-SoM) 和 [AITW](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-AITW-SoM) 数据集，附带SoM提示标注，已在Hugging Face上发布！我们使用这些数据对Magma进行了下游微调，并在表格中报告了结果。\n* **[2025.04.12]** 🔥 我们在Hugging Face上发布了带有视觉轨迹的预训练视频 [Magma-Video-ToM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-Video-ToM)。\n* **[2025.04.06]** 可以从 [Magma-OXE-ToM](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-OXE-ToM) 下载带有视觉轨迹的Open X-Embodiment预训练数据。\n* **[2025.03.16]** 我们在 [SoM\u002FToM Generation](#som-and-tom-generation) 中发布了用于生成教学视频的SoM和ToM的示例代码（即我们论文中的算法2）。\n* **[2025.03.09]** 🔥 我们发布了Magma的训练代码，并提供了一个在Magma-820K数据集上训练Magma-8B的示例。请查看 [Model Training](#model-training)。\n* **[2025.03.06]** 我们发布了一个新的演示，展示了机器人的规划能力。运行 `python agents\u002Frobot_traj\u002Fapp.py` 即可启动演示！\n* **[2025.02.28]** 我们在Hugging Face上发布了两个演示：[Magma-UI](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-UI) 和 [Magma-Gaming](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fmicrosoft\u002FMagma-Gaming)。快来体验我们的模型在动作接地和规划方面的能力吧！\n* **[2025.02.26]** ⭐ 激动人心的消息！Magma已被CVPR 2025接收！\n* **[2025.02.25]** 🎉 大新闻！我们将在 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FMagma-8B) 和 [Azure AI Foundry](https:\u002F\u002Fai.azure.com\u002Fexplore\u002Fmodels\u002Fmicrosoft-magma-8b\u002Fversion\u002F1\u002Fregistry\u002FHuggingFace?tid=72f988bf-86f1-41af-91ab-2d7cd011db47) 上发布Magma模型！\n* **[2025.02.23]** 我们发布了Magma的推理代码！\n* **[2025.02.20]** Magma已登上 [Hacker News](https:\u002F\u002Fnews.ycombinator.com\u002Ffront) 的榜首！\n* **[2025.02.19]** 我们将于下周二 [MSR Forum on 02.25](https:\u002F\u002Fresearchforum.microsoft.com\u002F) 上发布我们的代码、模型以及UI导航演示！\n* **[2025.02.18]** 我们在微软研究院的旗舰项目Magma已在 [arXiv](https:\u002F\u002Fwww.arxiv.org\u002Fpdf\u002F2502.13130) 上发布！\n\n## :bookmark_tabs: 待办事项\n我们将陆续发布以下内容：\n- [x] 模型推理代码\n- [x] 添加UI和游戏代理演示\n- [x] 模型检查点\n- [x] 训练代码\n- [x] 带有轨迹的Open-XE预训练数据\n- [x] 带有轨迹的视频预训练数据\n- [ ] SeeClick和Vision2UI预训练数据，附带SoM\n- [ ] UI\u002FLibero微调脚本\n- [ ] 视频微调脚本\n\n## :clipboard: 目录\n- [什么是Magma？](#what-is-magma)\n- [我们如何预训练Magma？](#how-we-pretrain-magma)\n- [安装](#installation)\n- [数据预处理](#data-preprocessing)\n  - [SoM和ToM的生成](#som-and-tom-generation)\n- [模型训练](#model-training)\n  - [在Open-X上进行无SoM\u002FToM的预训练](#pretraining-on-open-x-without-somtom)\n  - [在Magma-820K数据集上进行微调](#finetuning-on-magma-820k)\n- [模型使用](#model-usage)\n  - [推理](#inference)\n    - [使用Hugging Face Transformers进行推理](#inference-with-huggingface-transformers)\n    - [使用本仓库的本地代码进行推理](#inference-with-local-code-from-this-repo)\n    - [使用bitsandbytes进行推理](#inference-with-bitsandbytes)\n    - [基准测试](#benchmarking)\n  - [使用lmms-eval进行评估](#evaluation-with-lmms-eval)\n  - [使用SimplerEnv进行评估](#evaluation-with-simplerenv)\n  - [多张图片或视频](#multi-images-or-video)\n  - [API服务器](#api-server) \n  - [代理演示](#agent-demos)\n      - [UI代理](#ui-agent)\n      - [游戏代理](#gaming-agent)\n      - [机器人视觉规划](#robot-visual-planning)\n- [引用](#citation)\n- [致谢](#acknowledgements)\n\n## 什么是Magma？\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_c773bee8dbd2.png\" width=\"50%\">\n\u003C\u002Fdiv>\n\n**Magma 是一个多模态 AI 代理的基础模型**。作为多模态代理模型的基石，它应当具备强大的多模态感知能力，并能精确地执行目标驱动的行为（见上图）。基于这一理念，我们致力于实现以下目标：\n\n* **语言与时空智能：** Magma 应当同时具备强大的语言理解和时空感知能力，以理解图像和视频、根据观察结果采取行动，并进一步将外部目标转化为行动计划并加以执行。\n* **数字世界与物理世界：** Magma 不应局限于数字世界（如网页导航）或物理世界（如机器人操作），而应能够在两个世界之间无缝切换，正如人类自身一样。\n\n为此，我们开发了一种新的预训练数据集，主要由大量未标注的野外视频以及现有的标注好的代理数据组成；同时，我们还设计了一个全新的预训练框架，能够统一训练文本、图像和动作这三种模态的数据，从而训练出一种面向多模态 AI 代理的新基础模型——Magma。\n\n## 我们如何预训练 Magma？\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_readme_c4e5b5999469.png\" width=\"100%\">\n\u003C\u002Fdiv>\n\n我们从两个方面来实现这一目标：\n\n* **大规模异构训练数据：** 我们精心收集了大量野外数据，包括现有的多模态理解数据、UI 导航数据、机器人操作数据，以及未标注的野外视频。此外，我们还提出了一套可扩展且经济高效的新数据采集流程，用于收集未标注的野外视频。为了从原始视频和机器人轨迹中提取有用的行动监督信号，我们仔细去除了视频中的相机运动，并将其转换为“动作”监督信号，用于模型训练。这些独特的信号帮助模型学习跨模态的关联性，以及长时程的动作预测与规划。\n  \n* **通用预训练目标：** 文本和动作本质上截然不同，因此存在巨大鸿沟，而视觉特征则是连续的。我们提出了一种统一训练三种模态的通用预训练框架，并证明这对于模型学习跨模态联系至关重要。具体而言，我们提出了“标记集合”（Set-of-Mark）和“标记轨迹”（Trace-of-Mark）作为辅助任务，用以连接不同的输出模态。通过这种方式，我们在文本与动作模态之间，以及图像与动作模态之间，建立了良好的对齐关系。\n\n## 安装\n\n1. 将此仓库克隆到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\ncd Magma\n```\n2. 安装依赖项：\n\n```bash\nconda create -n magma python=3.10 -y\nconda activate magma\npip install --upgrade pip\npip install -e .\n```\n\n3. 安装用于训练的包：\n\n```bash\npip install -e \".[train]\"\n```\n\n4. 安装用于代理的包：\n\n```bash\npip install -e \".[agent]\"\n```\n\n5. 其他可能需要的包：\n\n* Co-tracker\n```sh\n# 安装 co-tracker\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fco-tracker\ncd co-tracker\npip install -e .\npip install imageio[ffmpeg]\ncd ..\u002F\n```\n\n* Kmeans\n```sh\n# 安装 kmeans_pytorch，注意：使用 pip 安装会导致错误\ngit clone https:\u002F\u002Fgithub.com\u002Fsubhadarship\u002Fkmeans_pytorch\ncd kmeans_pytorch\npip install -e .\ncd ..\u002F\n```\n\n* 其他\n```sh\n# 安装其他包\npip install ipython\npip install faiss-cpu\npip install decord\n```\n\n⚠️ 请确保您已安装正确版本的 transformers（>=4.49.0）。如果遇到异常行为，请检查 transformers 的版本，必要时可参考下方的定制版 transformers。\n\n\u003Cdetails>\n\u003Csummary>点击展开\u003C\u002Fsummary>\n\n### 定制版 Transformers\n\n⚠️ 需要注意的一点是，我们的模型使用 [ConvNext](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models\u002Fblob\u002Fmain\u002Ftimm\u002Fmodels\u002Fconvnext.py) 作为骨干网络，其中包含一个层缩放参数 [gamma](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-image-models\u002Fblob\u002Fe44f14d7d2f557b9f3add82ee4f1ed2beefbb30d\u002Ftimm\u002Fmodels\u002Fconvnext.py#L144)。这导致 Transformers 库出现了一个 bug：在加载模型时，它会自动将“gamma”替换为“weight”。为了解决这个问题，我们需要修改 `transformers\u002Fmodels\u002Fauto\u002Fmodeling_auto.py` 文件如下：\n\n```python \nif \"gamma\" in key and \"clip_vision_model\" not in key:\n    key = key.replace(\"gamma\", \"weight\")\n```\n该 bug 在最新版本的 Transformers 中仍然存在。因此，请务必安装以下无 bug 的定制版 transformers，其版本号列于 [pyproject.toml](.\u002Fpyproject.toml) 中：\n\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n```\n\n或者最新的版本：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.48.2\n```\n\n\u003C\u002Fdetails>\n\n## 数据预处理\n\n### SoM 和 ToM 的生成\n\n如论文表 1 所示，我们将 SoM 和 ToM 应用于机器人数据和教学视频。为确保实验的可重复性，我们提供了用于生成教学视频 SoM 和 ToM 的代码，位于 `tools\u002Fsom_tom\u002Fdemo.py`。您可以运行以下命令来生成机器人数据的 SoM 和 ToM：\n\n```bash\npython tools\u002Fsom_tom\u002Fdemo.py\n```\n\n随后，您将在 `tools\u002Fsom_tom\u002Fvideos` 文件夹中找到两段视频。原始轨迹由 CoTracker 提取后保存为 `orig_trace.mp4`，而 SoM-ToM 视频则命名为 `som_tom.mp4`。\n\n## 模型训练\n\n我们提供了在 Open-X-Embodiment 数据集上预训练 LLama-3-8B-Instruct，以及在不同下游任务上微调 Magma-8B 的详细说明。\n\n### 在 Open-X 上不使用 SoM\u002FToM 进行预训练\n\n* 数据准备\n\n从官方网站下载 Open-X-Embodiment 数据集。然后相应地编辑数据配置文件 [openx.yaml](data_configs\u002Fopenx.yaml)。数据配置文件应如下所示：\n\n```yaml\n# 所有数据路径的列表\nDATA_PATH: \n  - \"\u002Fpath\u002Fto\u002Fopen-x\"\nIMAGE_FOLDER:\n  - \"siglip-224px+mx-oxe-magic-soup\"    \nLANGUAGE_PATH:\n  - \"\"\n```\n\n* 在 OpenX 上进行预训练\n\n完成数据集和配置的设置后，您可以运行以下命令来微调模型：\n\n```bash\nsh scripts\u002Fpretrain\u002Fpretrain_openx.sh\n```\n* 优点：我们付出了巨大努力，将 Open-X 数据加载器与 OpenVLA 解耦，使其能够兼容我们实验中使用的其他数据集*\n\n### 在 Magma-820K 数据集上进行微调\n\n* 数据准备\n\n从 [MagmaAI\u002FMagma-820K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FMagmaAI\u002FMagma-820K) 下载标注文件。请根据数据集页面上的图像列表准备图像数据。完成后，请相应地编辑 [magma_820k.yaml](data_configs\u002Fmagma_820k.yaml) 文件。\n\n```yaml\n\n# 所有数据路径的列表\nDATA_PATH: \n  - \"\u002Fpath\u002Fto\u002Fmagma_820k.json\"\nIMAGE_FOLDER:\n  - \"\u002Froot\u002Fto\u002Fmagma_820k\u002Fimages\"\n```\n\n* 从 Magma-8B 进行微调\n\n在设置好数据集和配置后，您可以运行以下命令来微调模型：\n\n```bash\nsh scripts\u002Ffinetune\u002Ffinetune_magma_820k.sh\n```\n\n## 模型使用\n\n### 推理\n\n#### 使用 Hugging Face Transformers 进行推理\n\n我们已将模型上传至 Hugging Face Hub。您可以通过以下代码轻松加载模型和处理器。\n\n\u003Cdetails>\n\u003Csummary>点击展开\u003C\u002Fsummary>\n\n```python\nfrom PIL import Image\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom transformers import AutoProcessor \n\ndtype = torch.bfloat16\nmodel = AutoModelForCausalLM.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True, torch_dtype=dtype)\nprocessor = AutoProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\nmodel.to(\"cuda\")\n\n# 推理\nimage = Image.open(\".\u002Fassets\u002Fimages\u002Fmagma_logo.jpg\").convert(\"RGB\")\n\nconvs = [\n    {\"role\": \"system\", \"content\": \"你是一个能看、能说、能行动的智能体。\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n机器人上是什么字母？\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image], texts=prompt, return_tensors=\"pt\")\ninputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)\ninputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)\ninputs = inputs.to(\"cuda\").to(dtype)\n\ngeneration_args = { \n    \"max_new_tokens\": 500, \n    \"temperature\": 0.0, \n    \"do_sample\": False, \n    \"use_cache\": True,\n    \"num_beams\": 1,\n} \n\nwith torch.inference_mode():\n    generate_ids = model.generate(**inputs, **generation_args)\n\ngenerate_ids = generate_ids[:, inputs[\"input_ids\"].shape[-1] :]\nresponse = processor.decode(generate_ids[0], skip_special_tokens=True).strip()\n\nprint(response)\n```\n\u003C\u002Fdetails>\n\n#### 使用本仓库中的本地 Transformers 代码进行推理\n\n如果您想调试我们的模型，我们也提供了本地推理代码。您可以运行以下代码来加载模型。\n\u003Cdetails>\n\u003Csummary>点击展开\u003C\u002Fsummary>\n\n```python\nfrom magma.processing_magma import MagmaProcessor\nfrom magma.modeling_magma import MagmaForCausalLM\n\ndtype = torch.bfloat16\nmodel = MagmaForCausalLM.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True, torch_dtype=dtype)\nprocessor = MagmaProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\nmodel.to(\"cuda\")\n```\n\u003C\u002Fdetails>\n\n#### 使用 bitsandbytes 进行推理\n\n我们还提供了一个使用 bitsandbytes 进行推理的示例代码。您可以运行以下代码来加载模型。\n\n\u003Cdetails>\n\u003Csummary>点击展开\u003C\u002Fsummary>\n\n```python\nfrom PIL import Image\nimport torch\nfrom transformers import AutoModelForCausalLM\nfrom transformers import AutoProcessor \nfrom transformers import BitsAndBytesConfig\n\n# 定义量化配置\nquantization_config = BitsAndBytesConfig(\n    load_in_4bit=True,\n    bnb_4bit_compute_dtype=torch.float16,\n    bnb_4bit_use_double_quant=True,\n    bnb_4bit_quant_type=\"nf4\"\n)\n\n# 加载带有量化配置的模型\nmodel = AutoModelForCausalLM.from_pretrained(\n    \"microsoft\u002FMagma-8B\", \n    trust_remote_code=True,\n    device_map={\"\": 0},  # 强制所有内容加载到 GPU 0\n    quantization_config=quantization_config\n)\nprocessor = AutoProcessor.from_pretrained(\"microsoft\u002FMagma-8B\", trust_remote_code=True)\n\n# 推理\nimage = Image.open(\"assets\u002Fimages\u002Fmagma_logo.jpg\").convert(\"RGB\")\n\nconvs = [\n    {\"role\": \"system\", \"content\": \"你是一个能看、能说、能行动的智能体。\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n机器人上是什么字母？\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image], texts=prompt, return_tensors=\"pt\")\ninputs['pixel_values'] = inputs['pixel_values'].unsqueeze(0)\ninputs['image_sizes'] = inputs['image_sizes'].unsqueeze(0)\n\n# 将输入转换为正确的设备和数据类型\ninputs = {k: v.to(device=model.device, dtype=torch.float16 if v.dtype == torch.float32 else v.dtype) \n          for k, v in inputs.items()}\n\ngeneration_args = { \n    \"max_new_tokens\": 500, \n    \"temperature\": 0.0, \n    \"do_sample\": False, \n    \"use_cache\": True,\n    \"num_beams\": 1,\n} \n\nwith torch.inference_mode():\n    generate_ids = model.generate(**inputs, **generation_args)\n\ngenerate_ids = generate_ids[:, inputs[\"input_ids\"].shape[-1] :]\nresponse = processor.decode(generate_ids[0], skip_special_tokens=True).strip()\nprint(response)\n```\n\u003C\u002Fdetails>\n\n#### 基准测试\n\n我们对模型在使用和不使用 bitsandbytes 情况下的推理时间和内存占用进行了基准测试。\n\n| 模型 | 推理时间 | 内存峰值 |\n|-------|----------------|--------------|\n| Magma-8B (bfloat16) | 1.1秒 | 17GB |\n| Magma-8B (4-bit) | 1.1秒 | 7GB |\n\n### 使用 lmms-eval 进行评估\n\n请参阅 [lmms-eval-instruction](tools\u002Flmms-eval-magma) 以获取使用 lmms-eval 工具包进行评估的详细说明。\n\n一切准备就绪后，您可以在根目录下运行以下代码来评估我们的模型。\n\n```bash\nsh scripts\u002Fevaluation\u002Flmms-eval\u002Flmms_eval_magma.sh\n```\n\n您可以通过修改变量 `eval_tasks` 来评估其他基准。运行以下代码可以查看 `eval_tasks` 的列表：\n```\n# lmms-eval --tasks {list_groups,list_subtasks,list_tags,list}\nlmms-eval --tasks list_groups\n```\n\n### 使用 SimplerEnv 进行评估\n\n请参阅 [SimplerEnv-instruction](tools\u002Fsimplerenv-magma) 以获取使用 SimplerEnv 工具包进行评估的详细说明。\n\n一切准备就绪后，您可以通过运行以下代码来评估我们的模型。\n\n```bash\nsh scripts\u002Fevaluation\u002Fsimplerenv\u002Fbridge.sh\n```\n\n### 多张图片或视频支持\n\n对于我们的模型来说，处理多张图片非常简单。您只需在文本提示中重复占位符，并相应地将所有图片添加到列表中即可。一个示例代码如下：\n\n```py\nconvs = [\n    {\"role\": \"system\", \"content\": \"你是一个能看、能说、能行动的智能体。\"},            \n    {\"role\": \"user\", \"content\": \"\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n\u003Cimage_start>\u003Cimage>\u003Cimage_end>\\n机器人上是什么字母？\"},\n]\nprompt = processor.tokenizer.apply_chat_template(convs, tokenize=False, add_generation_prompt=True)\ninputs = processor(images=[image1,image2,image3], texts=prompt, return_tensors=\"pt\")\n```\n\n我们的模型会自动为您完成视觉标记的填充！\n\n### API 服务器\n\n我们提供一个基于 FastAPI 的服务器，用于将 Magma 部署为 REST API 服务，从而实现以下功能：\n- 通过 REST 端点进行视觉和语言处理\n- 为机器人应用提供动作预测\n- 支持 base64 编码的图片以及文件上传\n\n该服务器可以通过三种方式部署：\n1. **直接运行**：最简单的开发方式\n2. **Docker 容器**：推荐用于生产环境\n3. **原生系统服务**：适用于系统集成场景\n\n#### 快速入门\n\n```bash\ncd server\n.\u002Fmagma-server.sh run\n```\n\n这将设置一个 conda 环境、安装依赖项，并在端口 8080 上启动服务器。\n\n#### Docker 部署\n\n```bash\ncd server\n.\u002Fmagma-server.sh docker up\n```\n\n#### API 端点\n\nAPI 提供以下端点：\n- `GET \u002Fhealth` - 检查服务器是否运行及模型是否已加载\n- `POST \u002Fpredict` - 使用 base64 编码的图片进行预测\n- `POST \u002Fpredict_from_file` - 使用上传的图片文件进行预测\n\n更多详细信息，请参阅 [Server README](server\u002FREADME.md)。\n\n### 代理演示\n\n#### UI 代理\n\n我们为自己的模型构建了代理模型。首个构建的是 UI 代理演示。由于我们的模型使用 Set-of-Mark 和 Trace-of-Mark 进行预训练，因此与 [OmniParser](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FOmniParser) 天然具有协同效应。将两者结合，您即可立即获得一个 UI 代理，运行命令如下：\n\n```bash\npython agents\u002Fui_agent\u002Fapp.py\n```\n\n更重要的是，我们的 Magma 模型不仅具备动作接地能力，还拥有多模态理解与推理能力。您不仅可以使用文本让模型预测点击位置：\n\n```bash\n前往排名第一的帖子\n```\n\n还可以随时提出自由问题！只需在文本提示前加上前缀“Q:”，例如：\n\n```bash\nQ: 这篇帖子的标题是什么？\n```\n\n#### 游戏代理\n\n我们还构建了一个游戏代理演示。您可以运行以下命令来启动演示：\n\n```bash\npython agents\u002Fgaming_agent\u002Fapp.py\n```\n\n演示运行后，您将看到一个机器人主动收集绿色方块。\n\n\u003C!-- 以下是 Magma 与其他同类 VLM 的对比：\n\n\u003Cdiv align=\"center\">\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002Fstatic\u002Fvideos\u002Fmagma_vs_llava.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Magma 与 LLaVA-OneVision 对比。\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"https:\u002F\u002Fmicrosoft.github.io\u002FMagma\u002Fstatic\u002Fvideos\u002Fmagma_vs_qwen.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>Magma 与 Qwen-2.0 对比。\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003C\u002Fdiv> -->\n\n#### 机器人视觉规划\n\n我们也构建了机器人视觉规划演示。您可以运行以下命令来启动演示：\n\n```bash\npython agents\u002Frobot_traj\u002Fapp.py\n```\n\n在此演示中，您可能会遇到此 [issue](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F43) 中讨论的问题，快速解决方法是运行以下命令：\n\n```sh\npip install imageio[ffmpeg]\n```\n\n如果仍然无效，请安装较旧版本的 transformers：\n```sh\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n```\n\n\u003C!-- 一些示例输出：\n\n\u003Cdiv align=\"center\">\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"assets\u002Fvideos\u002Frobot_pick_up_chip_bag.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>任务：拿起薯片袋。\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003Cvideo width=\"48%\" controls autoplay>\n    \u003Csource src=\"assets\u002Fvideos\u002Frobot_push_chip_bag_to_left_edge_of_table.mp4\" type=\"video\u002Fmp4\">\n    \u003Cp>任务：将薯片袋推到桌子左边缘。\u003C\u002Fp>\n\u003C\u002Fvideo>\n\u003C\u002Fdiv> -->\n\n## 用户指南\n\n\u003C!-- 本节适用于无需微调或接入更大生态系统的模型使用场景。 -->\n### 直接使用\n\n该模型旨在以英语为基础，广泛应用于研究领域。模型接受图像和文本作为输入，并生成文本输出，可用于以下用途：\n\n* **基于图像\u002F视频的文本生成：** 模型可以根据输入的文本和图像生成文本（如描述、答案）。\n* **视觉规划能力：** 模型还可以生成视觉轨迹，作为完成任务的未来规划（如将物体从一处移动到另一处）。\n* **智能体能力：** 模型还能生成 UI 接地指令（如点击“搜索”按钮）以及机器人操作指令（如机器人夹爪的 7 自由度控制）。\n\n我们的模型仅用于研究目的，旨在促进多模态人工智能领域的知识共享与研究加速，尤其是多模态智能体 AI 方面的研究。\n\n### 下游应用\n\n该模型可进一步微调以适应不同的下游任务，例如：\n* **图像字幕与问答：** 我们可以在此模型的基础上，结合多模态大语言模型的流水线，进一步微调用于图像字幕和问答任务。根据我们的实验，该模型在这些任务上表现出色，同时在空间理解与推理方面更具优势。\n* **视频字幕与问答：** 同样地，我们也可以将其微调用于视频字幕和问答任务。实验表明，该模型在时间理解与推理方面表现优异，性能可与现有最佳模型媲美。\n* **UI 导航：** 该模型还可针对特定的 UI 导航任务进行微调，例如网页导航或移动应用导航。在这些任务上，模型的表现尤为突出。\n* **机器人操控：** 由于该模型具备视觉-语言-行动一体化的通用智能体能力，因此非常适合进一步微调以应用于机器人任务。微调后，该模型在机器人操控任务上的表现显著优于当前最先进的模型，如 OpenVLA。\n\n## 偏见、风险与局限性\n\n请注意，该模型并非专门为所有下游用途而设计或评估。开发者在选择具体应用场景时，应充分考虑语言模型的常见局限性，并在实际使用前对准确性、安全性及公平性进行评估与缓解，尤其是在高风险场景下。此外，开发者还需了解并遵守与其应用场景相关的适用法律或法规（包括隐私保护、贸易合规等）。\n\n\n## 引用\n如果您在研究中使用本模型，请考虑引用以下文献：\n\n```bibtex\n@misc{yang2025magmafoundationmodelmultimodal,\n      title={Magma: 多模态智能体的基础模型}, \n      author={Jianwei Yang、Reuben Tan、Qianhui Wu、Ruijie Zheng、Baolin Peng、Yongyuan Liang、Yu Gu、Mu Cai、Seonghyeon Ye、Joel Jang、Yuquan Deng、Lars Liden、Jianfeng Gao},\n      year={2025},\n      eprint={2502.13130},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.13130}, \n}\n```\n\n## 致谢\n\n我们的工作得到了微软研究院的支持。我们感谢所有为构建该项目付出努力的贡献者。\n\n我们的工作建立在一些令人惊叹的开源项目之上，包括 [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)、[LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)、[OpenVLA](https:\u002F\u002Fgithub.com\u002Fopenvla\u002Fopenvla)、[SeeClick](https:\u002F\u002Fgithub.com\u002Fnjucckevin\u002FSeeClick)、[Mind2Web](https:\u002F\u002Fgithub.com\u002FOSU-NLP-Group\u002FMind2Web)，以及多个优秀的开源数据集，如 [Ego4d](https:\u002F\u002Fego4d-data.org\u002F)、[Epic-Kitchen](https:\u002F\u002Fepic-kitchens.github.io\u002F2025)、[Something-Somethingv2](https:\u002F\u002Fwww.qualcomm.com\u002Fdeveloper\u002Fartificial-intelligence\u002Fdatasets)、[Open-X-Embodiment](https:\u002F\u002Frobotics-transformer-x.github.io\u002F)。此外，我们还使用了若干评估基准，例如 [SimplerEnv](https:\u002F\u002Fgithub.com\u002Fsimpler-env\u002FSimplerEnv) 和 [Libero](https:\u002F\u002Fgithub.com\u002FLifelong-Robot-Learning\u002FLIBERO)。\n\n## 许可证\n\n本项目采用 MIT 许可证授权。详情请参阅 [LICENSE](LICENSE) 文件。\n\n## 贡献\n\n本项目欢迎各类贡献和建议。大多数贡献都需要您签署贡献者许可协议（CLA），以声明您有权并将您的贡献权利授予我们使用。有关详细信息，请访问 https:\u002F\u002Fcla.opensource.microsoft.com。\n\n当您提交拉取请求时，CLA 机器人会自动判断您是否需要提供 CLA，并相应地标记您的 PR（例如添加状态检查或评论）。您只需按照机器人提供的指示操作即可。对于使用我们 CLA 的所有仓库，您只需完成一次此步骤。\n\n本项目已采纳 [微软开源行为准则](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F)。更多信息请参阅 [行为准则常见问题解答](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F) 或发送邮件至 [opencode@microsoft.com](mailto:opencode@microsoft.com) 咨询更多问题或意见。\n\n## 商标\n\n本项目可能包含与项目、产品或服务相关的商标或标识。未经授权使用微软商标或标识须遵守并遵循 [微软商标与品牌指南](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks\u002Fusage\u002Fgeneral)。在本项目的修改版本中使用微软商标或标识时，不得造成混淆或暗示微软的赞助关系。任何第三方商标或标识的使用均应遵守该第三方的相关政策。","# Magma 快速上手指南\n\nMagma 是微软研究院推出的首个面向多模态 AI 智能体（Multimodal AI Agents）的基础模型。它不仅能理解图像和视频，还能在数字世界（如 UI 导航、游戏）和物理世界（如机器人操作）中生成目标驱动的视觉计划与动作。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS\n*   **Python 版本**: 3.10\n*   **GPU**: 推荐使用 NVIDIA GPU (支持 CUDA)，显存建议 16GB 以上以运行 8B 模型\n*   **包管理器**: Conda (推荐) 或 pip\n*   **Git**: 用于克隆代码库\n\n> **注意**：本指南基于官方仓库内容整理。由于网络原因，国内开发者在安装依赖时若遇到超时，可临时配置 pip 使用清华或阿里镜像源（例如：`pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`）。\n\n## 安装步骤\n\n### 1. 克隆项目代码\n首先将 Magma 仓库克隆到本地并进入目录：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\ncd Magma\n```\n\n### 2. 创建并激活虚拟环境\n建议使用 Conda 创建独立的 Python 3.10 环境：\n\n```bash\nconda create -n magma python=3.10 -y\nconda activate magma\npip install --upgrade pip\n```\n\n### 3. 安装核心依赖\n安装项目基础包：\n\n```bash\npip install -e .\n```\n\n### 4. 安装可选组件（按需）\n如果您需要进行模型训练或运行智能体演示，请执行以下命令：\n\n```bash\n# 安装训练所需依赖\npip install -e \".[train]\"\n\n# 安装智能体演示所需依赖\npip install -e \".[agent]\"\n```\n\n### 5. 安装额外工具包\nMagma 依赖一些特定的第三方库（如 Co-tracker, Kmeans 等），需手动安装：\n\n**安装 Co-tracker:**\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fco-tracker\ncd co-tracker\npip install -e .\npip install imageio[ffmpeg]\ncd ..\u002F\n```\n\n**安装 Kmeans (注意不要直接用 pip 安装原版):**\n```sh\ngit clone https:\u002F\u002Fgithub.com\u002Fsubhadarship\u002Fkmeans_pytorch\ncd kmeans_pytorch\npip install -e .\ncd ..\u002F\n```\n\n**安装其他杂项依赖:**\n```sh\npip install ipython faiss-cpu decord\n```\n\n### ⚠️ 关键步骤：安装定制版 Transformers\nMagma 使用 ConvNext 作为骨干网络，其中包含 `gamma` 参数，这与原生 Hugging Face Transformers 库存在兼容性冲突（会自动将 `gamma` 替换为 `weight` 导致报错）。\n\n**必须**安装项目维护的修复版 Transformers，请选择以下任一版本：\n\n```bash\n# 方案 A：稳定版\npip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n\n# 方案 B：最新版\n# pip install git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.48.2\n```\n*请确保安装的 transformers 版本 >= 4.49.0 且包含上述修复，否则模型加载会出现异常。*\n\n## 基本使用\n\n安装完成后，您可以通过 Hugging Face `transformers` 库直接加载模型进行推理。以下是最简单的单图推理示例。\n\n### 使用 Hugging Face Transformers 进行推理\n\n确保已登录 Hugging Face（如果需要访问私有权重）或直接运行以下代码加载公开模型 `microsoft\u002FMagma-8B`：\n\n```python\nimport torch\nfrom transformers import AutoProcessor, AutoModelForCausalLM\n\n# 加载处理器和模型\nmodel_id = \"microsoft\u002FMagma-8B\"\nprocessor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_id, \n    torch_dtype=torch.bfloat16, \n    device_map=\"auto\",\n    trust_remote_code=True\n)\n\n# 准备输入数据\n# 假设您有一张图片路径和一个提示词\nimage_path = \"assets\u002Fimages\u002Fdemo.png\" \nprompt = \"Describe this image and suggest possible actions.\"\n\n# 处理输入\ninputs = processor(images=[image_path], text=prompt, return_tensors=\"pt\").to(model.device)\n\n# 生成输出\nwith torch.no_grad():\n    output_ids = model.generate(\n        **inputs,\n        max_new_tokens=512,\n        do_sample=False\n    )\n\n# 解码结果\noutput_text = processor.decode(output_ids[0], skip_special_tokens=True)\nprint(output_text)\n```\n\n### 运行智能体演示 (可选)\n如果您安装了 `[agent]` 依赖，可以尝试运行官方的 UI 导航或机器人规划演示：\n\n```bash\n# 启动机器人轨迹规划演示\npython agents\u002Frobot_traj\u002Fapp.py\n```\n\n现在您已经成功部署了 Magma，可以开始探索其在多模态理解与智能体任务中的强大能力。更多高级用法（如视频输入、多图推理、微调训练）请参考项目仓库中的详细文档。","某智能家居机器人开发团队正致力于让机器人通过观察人类操作视频，自主学会在复杂物理环境中完成“整理桌面”等长序列任务。\n\n### 没有 Magma 时\n- **感知与行动割裂**：团队需分别训练视觉理解模型和动作控制策略，导致机器人能“看懂”物体却无法精准规划抓取路径，空间推理能力极弱。\n- **数据标注成本高昂**：为了让机器人学会新任务，工程师必须人工逐帧标注大量操作视频中的关键动作和物体状态，耗时数周且难以规模化。\n- **泛化能力受限**：模型仅在特定实验室环境下有效，一旦光照变化或物品摆放位置微调，机器人便无法适应，需重新采集数据训练。\n- **多模态协同困难**：处理涉及屏幕指令（数字世界）与机械臂操作（物理世界）的混合任务时，需编写复杂的规则引擎桥接两个系统，维护成本极高。\n\n### 使用 Magma 后\n- **端到端多模态决策**：Magma 作为统一基座模型，直接输入视频即可生成包含空间推理的目标驱动计划，机器人能精准理解“把红色杯子移到笔记本电脑左侧”并执行。\n- **无监督视频学习**：利用 Magma 可扩展的预训练策略，团队直接投喂海量未标注的野外操作视频，模型自动提取视觉轨迹（Visual Traces），将数据准备周期从数周缩短至数天。\n- **强大的零样本泛化**：凭借在数字与物理双世界的预训练，Magma 让机器人在未见过的家庭场景中也能灵活调整策略，无需针对新环境重新训练。\n- **跨域任务无缝切换**：单个 Magma 模型即可同时处理手机 UI 导航和实体机械臂操控，不再需要额外的规则桥接，实现了真正的通用智能体架构。\n\nMagma 通过统一数字与物理世界的感知行动闭环，将多模态智能体的开发从繁琐的定制化拼凑转变为高效的可扩展学习。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_Magma_d7a05754.png","microsoft","Microsoft","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmicrosoft_4900709c.png","Open source projects and samples from Microsoft",null,"opensource@microsoft.com","OpenAtMicrosoft","https:\u002F\u002Fopensource.microsoft.com","https:\u002F\u002Fgithub.com\u002Fmicrosoft",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",95.8,{"name":87,"color":88,"percentage":89},"Shell","#89e051",4.1,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.1,1912,156,"2026-04-06T10:54:49","MIT","Linux","需要 NVIDIA GPU（隐含，因依赖 CUDA 生态及 bitsandbytes），具体显存和 CUDA 版本未说明","未说明",{"notes":102,"python":103,"dependencies":104},"必须安装特定分支的定制版 transformers (git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git) 以修复 ConvNext 骨干网络中 'gamma' 参数的加载错误。项目推荐使用 conda 管理环境。额外依赖包括 co-tracker 和 kmeans_pytorch，需从 GitHub 源码安装而非 pip。模型支持 8B 参数规模，推理可能需要较大显存。","3.10",[105,106,107,108,109,110,111,112],"torch","transformers>=4.49.0 (需使用定制版)","co-tracker","kmeans_pytorch","faiss-cpu","decord","imageio[ffmpeg]","ipython",[35,15,60,13,114],"其他","2026-03-27T02:49:30.150509","2026-04-07T00:47:02.806272",[118,123,128,133,138,143,147],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},20629,"运行 Hugging Face 示例代码时，模型输出大量重复的 token（如 'the the the...'），如何解决？","这通常是由于使用了不兼容的 transformers 版本或模型参数加载错误导致的。解决方案是安装特定版本的自定义 transformers 库。请运行以下命令：\npip install -U git+https:\u002F\u002Fgithub.com\u002Fjwyang\u002Ftransformers.git@dev\u002Fjwyang-v4.44.1\n注意：README 中提到的较新版本（如 v4.48.2）可能仍然无法工作，建议回退到 v4.44.1 版本。如果问题依旧，建议在全新的虚拟环境中重新安装并检查模型文件。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F39",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},20630,"如何在 Mind2Web 或 AITW 等数据集上生成用于 UI 导航任务的 SoM（Set-of-Mark）标注？","可以使用项目自带的脚本生成 UI 截图的 SoM 标注。具体参考 `agents\u002Fui_agent\u002Fapp.py` 文件的第 170 行代码来生成 SoM。对于标注文件的格式，通常需要遵循 JSON 中的 \"conversations\" 列表格式，其中目标输出为基于 SoM 的 Mark ID。维护者表示会上传相关的示例代码供参考。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F64",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},20631,"运行机器人轨迹演示程序 (robot_traj\u002Fapp.py) 时报错 'expected bytes, NoneType found' 且无视频输出，如何修复？","该问题已被修复。此错误通常出现在 `robot_traj\u002Futils\u002Fvisualizer.py` 中，原因是视频写入时编解码器参数为空。请确保拉取最新的代码更新。此外，该演示主要用于展示 2D 视觉轨迹规划；如果需要检查 7 自由度 (7DoF) 的输出或进行更复杂的评估，请参考 README 中关于 'Evaluation with SimplerEnv' 的部分进行设置。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F43",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},20632,"Magma-8B 模型权重是否已在 Hugging Face 上公开发布？","是的，模型已经发布。早期用户遇到的“模型未发布”问题已解决。如果您现在无法下载，请确认您访问的是正确的 Hugging Face 仓库地址，并确保网络连接正常。该 Issue 已因问题不再有效而关闭。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F8",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},20633,"在计算 ToM (Trajectory of Motion) 时，对于相机运动较大的视频，正负轨迹的判断是否准确？需要调整哪些参数？","在 `tools\u002Fsom_tom\u002Fdemo.py` 中，正负轨迹是基于单应性变换前的 `trace_lengths` 确定的。对于相机运动剧烈的视频，这可能导致几乎所有点都被标记为正，从而影响 ToM 质量。建议尝试在单应性变换之后、判断逻辑之前更新 `trace_lengths`。此外，针对不同数据集，可能需要调整 `epsilon` 参数或其他阈值（如代码第 34 行的 0.5）以优化效果。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FMagma\u002Fissues\u002F42",{"id":144,"question_zh":145,"answer_zh":146,"source_url":132},20634,"Magma 模型能否预测包含深度信息的 3D 轨迹坐标（如 x, y, z, pitch, roll, yaw）？","目前的机器人轨迹演示程序主要生成和展示 2D 图像平面上的视觉轨迹规划点。虽然模型分析可以输出正确的移动指令，但当前的可视化流程和默认输出侧重于 2D 位置。如果需要处理从图像到 3D 世界的映射或获取完整的 3D 坐标，可能需要结合额外的深度估计模块或使用 SimplerEnv 环境进行进一步的 3D 任务评估，目前原生代码主要针对 2D 轨迹。",{"id":148,"question_zh":149,"answer_zh":150,"source_url":137},20635,"在 Jetson AGX Orin (aarch64) 架构上安装和运行 Magma 是否支持？","根据社区反馈，目前在 x86_64 架构上安装运行正常，但在 aarch64 架构（如 Jetson IGX\u002FAGX Orin）上存在库兼容性问题，导致安装或运行失败。维护者正在尝试解决这些依赖冲突。如果您必须在 Orin 上运行，可能需要手动解决底层库的编译问题，或者等待官方对 aarch64 的正式支持更新。",[]]