[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-OpenMotionLab--MotionGPT":3,"tool-OpenMotionLab--MotionGPT":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":10,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":109,"github_topics":110,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":121,"updated_at":122,"faqs":123,"releases":154},2247,"OpenMotionLab\u002FMotionGPT","MotionGPT","[NeurIPS 2023] MotionGPT: Human Motion as a Foreign Language, a unified motion-language generation model using LLMs","MotionGPT 是一款将人体动作视为“外语”的统一生成模型，旨在打通自然语言与三维动作数据之间的语义壁垒。它主要解决了以往模型难以同时高质量处理多种动作任务（如根据文字生成动作、为动作添加描述、动作预测及中间帧补全）的难题，实现了跨模态数据的深度耦合。\n\n这款工具非常适合人工智能研究人员、动画开发者以及需要智能动作生成能力的创作者使用。对于研究者，它提供了基于大语言模型（LLM）的多模态预训练新范式；对于开发者与设计师，它能通过简单的文本指令快速生成流畅、符合语义的 3D 人物动画，大幅降低内容创作门槛。\n\nMotionGPT 的核心技术亮点在于其独特的“动作词汇表”构建方式。它利用离散向量量化技术，将复杂的 3D 动作序列转化为类似单词的“动作令牌（Motion Tokens）”，从而让大语言模型能够像处理文本一样直接理解和生成动作。此外，项目引入了提示学习（Prompt Learning）机制，支持通过问答形式灵活交互。作为 NeurIPS 2023 的收录成果，MotionGPT 在多项基准测试中均达到了业界领先水平，并提供了便捷的 HuggingFace 演示与开源代码，便","MotionGPT 是一款将人体动作视为“外语”的统一生成模型，旨在打通自然语言与三维动作数据之间的语义壁垒。它主要解决了以往模型难以同时高质量处理多种动作任务（如根据文字生成动作、为动作添加描述、动作预测及中间帧补全）的难题，实现了跨模态数据的深度耦合。\n\n这款工具非常适合人工智能研究人员、动画开发者以及需要智能动作生成能力的创作者使用。对于研究者，它提供了基于大语言模型（LLM）的多模态预训练新范式；对于开发者与设计师，它能通过简单的文本指令快速生成流畅、符合语义的 3D 人物动画，大幅降低内容创作门槛。\n\nMotionGPT 的核心技术亮点在于其独特的“动作词汇表”构建方式。它利用离散向量量化技术，将复杂的 3D 动作序列转化为类似单词的“动作令牌（Motion Tokens）”，从而让大语言模型能够像处理文本一样直接理解和生成动作。此外，项目引入了提示学习（Prompt Learning）机制，支持通过问答形式灵活交互。作为 NeurIPS 2023 的收录成果，MotionGPT 在多项基准测试中均达到了业界领先水平，并提供了便捷的 HuggingFace 演示与开源代码，便于用户快速上手体验。","\u003Cdiv align= \"center\">\n    \u003Ch1> Official repo for MotionGPT \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_aeddd3544f46.jpg\" width=\"35px\">\u003C\u002Fh1>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n    \u003Ch2> \u003Ca href=\"https:\u002F\u002Fmotion-gpt.github.io\u002F\">MotionGPT: Human Motion as a Foreign Language\u003C\u002Fa>\u003C\u002Fh2>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fmotion-gpt.github.io\u002F\">Project Page\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14795\">Arxiv Paper\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMotionLab\u002FMotionGPT\">HuggingFace Demo\u003C\u002Fa> •\n  \u003Ca href=\"#️-faq\">FAQ\u003C\u002Fa> •\n  \u003Ca href=\"#-citation\">Citation\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n\u003C!-- \u003Cimg src=\"https:\u002F\u002Fcdn.discordapp.com\u002Fattachments\u002F941582479117127680\u002F1111543600879259749\u002F20230526075532.png\" width=\"350px\"> -->\n\n|                                                   Teaser Video                                                   |                                                    Demo Video                                                    |\n| :--------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------: |\n| \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002Fa741e162-b2f4-4f65-af8e-aa19c4115a9e\" \u002F> | \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002Fae966d17-6326-43e6-8d5b-8562cf3ffd52\" \u002F> |\n\n\u003C\u002Fdiv>\n\n\u003C!-- ### [MotionGPT: Human Motion as a Foreign Language](https:\u002F\u002Fmotion-gpt.github.io\u002F) -->\n\u003C!-- ### [Project Page](https:\u002F\u002Fmotion-gpt.github.io\u002F) | [Arxiv Paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14795) | [HuggingFace Demo](xxx) -->\n\n## 🏃 Intro MotionGPT\n\nMotionGPT is a **unified** and **user-friendly** motion-language model to learn the semantic coupling of two modalities and generate high-quality motions and text descriptions on **multiple motion tasks**.\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>Technical details\u003C\u002Fb>\u003C\u002Fsummary>\n\nThough the advancement of pre-trained large language models unfolds, the exploration of building a unified model for language and other multi-modal data, such as motion, remains challenging and untouched so far. Fortunately, human motion displays a semantic coupling akin to human language, often perceived as a form of body language. By fusing language data with large-scale motion models, motion-language pre-training that can enhance the performance of motion-related tasks becomes feasible. Driven by this insight, we propose MotionGPT, a unified, versatile, and user-friendly motion-language model to handle multiple motion-relevant tasks. Specifically, we employ the discrete vector quantization for human motion and transfer 3D motion into motion tokens, similar to the generation process of word tokens. Building upon this “motion vocabulary”, we perform language modeling on both motion and text in a unified manner, treating human motion as a specific language. Moreover, inspired by prompt learning, we pre-train MotionGPT with a mixture of motion-language data and fine-tune it on prompt-based question-and-answer tasks. Extensive experiments demonstrate that MotionGPT achieves state-of-the-art performances on multiple motion tasks including text-driven motion generation, motion captioning, motion prediction, and motion in-between.\n\n\u003Cimg width=\"1194\" alt=\"pipeline\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_f0c43573ffaf.png\">\n\u003C\u002Fdetails>\n\n## 🚩 News\n\n- [2025\u002F06\u002F30] Release 🔥\u003Ca href=\"https:\u002F\u002Fmotiongpt3.github.io\u002F\"> MotionGPT3 \u003C\u002Fa>🔥 **A bimodal motion-language framework using MoT architecture.**\n- [2023\u002F09\u002F22] MotionGPT got accepted by NeurIPS 2023\n- [2023\u002F09\u002F11] Release the \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMotionLab\u002FMotionGPT\">huggingface demo\u003C\u002Fa>  🔥🔥🔥\n- [2023\u002F09\u002F09] Release the training of MotionGPT V1.0 🔥🔥🔥\n- [2023\u002F06\u002F20] Upload paper and init project\n\n## ⚡ Quick Start\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Setup and download\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. Conda environment\n\n```\nconda create python=3.10 --name mgpt\nconda activate mgpt\n```\n\nInstall the packages in `requirements.txt` and install [PyTorch 2.0](https:\u002F\u002Fpytorch.org\u002F)\n\n```\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\nWe test our code on Python 3.10.6 and PyTorch 2.0.0.\n\n### 2. Dependencies\n\nRun the script to download dependencies materials:\n\n```\nbash prepare\u002Fdownload_smpl_model.sh\nbash prepare\u002Fprepare_t5.sh\n```\n\nFor Text to Motion Evaluation\n\n```\nbash prepare\u002Fdownload_t2m_evaluators.sh\n```\n\n### 3. Pre-train model\n\nRun the script to download the pre-train model\n\n```\nbash prepare\u002Fdownload_pretrained_models.sh\n```\n\n### 4. (Optional) Download manually\n\nVisit [the Google Driver](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F10s5HXSFqd6UTOkW2OMNc27KGmMLkVc2L) to download the previous dependencies.\n\nVisit [the Hugging Face](https:\u002F\u002Fhuggingface.co\u002FOpenMotionLab) to download the pretrained models.\n\n\u003C\u002Fdetails>\n\n## ▶️ Demo\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Webui\u003C\u002Fb>\u003C\u002Fsummary>\n\nRun the following script to launch webui, then visit [0.0.0.0:8888](http:\u002F\u002F0.0.0.0:8888)\n\n```\npython app.py\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Batch demo\u003C\u002Fb>\u003C\u002Fsummary>\n\nWe support txt file input, the output motions are npy files and output texts are txt files. Please check the `configs\u002Fassets.yaml` for path config, TEST.FOLDER as output folder.\n\nThen, run the following script:\n\n```\npython demo.py --cfg .\u002Fconfigs\u002Fconfig_h3d_stage3.yaml --example .\u002Fdemos\u002Ft2m.txt\n```\n\nSome parameters:\n\n- `--example=.\u002Fdemo\u002Ft2m.txt`: input file as text prompts\n- `--task=t2m`: evaluation tasks including t2m, m2t, pred, inbetween\n\nThe outputs:\n\n- `npy file`: the generated motions with the shape of (nframe, 22, 3)\n- `txt file`: the input text prompt or text output\n\u003C\u002Fdetails>\n\n## 💻 Train your own models\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Training guidance\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. Prepare the datasets\n\n1. Please refer to [HumanML3D](https:\u002F\u002Fgithub.com\u002FEricGuo5513\u002FHumanML3D) for text-to-motion dataset setup.\n\n2. Put the instructions data in `prepare\u002Finstructions` to the same folder of HumanML3D dataset.\n\n### 2.1. Ready to train motion tokenizer model\n\nPlease first check the parameters in `configs\u002Fconfig_h3d_stage1.yaml`, e.g. `NAME`,`DEBUG`.\n\nThen, run the following command:\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage1.yaml --nodebug\n```\n\n### 2.2. Ready to pretrain MotionGPT model\n\nPlease update the parameters in `configs\u002Fconfig_h3d_stage2.yaml`, e.g. `NAME`,`DEBUG`,`PRETRAINED_VAE` (change to your `latest ckpt model path` in previous step)\n\nThen, run the following command to store all motion tokens of training set for convenience\n\n```\npython -m scripts.get_motion_code --cfg configs\u002Fconfig_h3d_stage2.yaml\n```\n\nAfter that, run the following command:\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage2.yaml --nodebug\n```\n\n### 2.3. Ready to instruct-tuning MotionGPT model\n\nPlease update the parameters in `configs\u002Fconfig_h3d_stage3.yaml`, e.g. `NAME`,`DEBUG`,`PRETRAINED` (change to your `latest ckpt model path` in previous step)\n\nThen, run the following command:\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage3.yaml --nodebug\n```\n\n### 3. Evaluate the model\n\nPlease first put the tained model checkpoint path to `TEST.CHECKPOINT` in `configs\u002Fconfig_h3d_stage3.yaml`.\n\nThen, run the following command:\n\n```\npython -m test --cfg configs\u002Fconfig_h3d_stage3.yaml --task t2m\n```\n\nSome parameters:\n\n- `--task`: evaluation tasks including t2m(Text-to-Motion), m2t(Motion translation), pred(Motion prediction), inbetween(Motion inbetween)\n\nDue to the python package conflit, the released implement of linguistic metrics in motion translation task is by [nlg-metricverse](https:\u002F\u002Fgithub.com\u002Fdisi-unibo-nlp\u002Fnlg-metricverse), which may not be consistent to the results implemented by [nlg-eval](https:\u002F\u002Fgithub.com\u002FMaluuba\u002Fnlg-eval). We will fix this in the future.\n\n\u003C\u002Fdetails>\n\n## 👀 Visualization\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>Render SMPL\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. Set up blender - WIP\n\nRefer to [TEMOS-Rendering motions](https:\u002F\u002Fgithub.com\u002FMathux\u002FTEMOS) for blender setup, then install the following dependencies.\n\n```\nYOUR_BLENDER_PYTHON_PATH\u002Fpython -m pip install -r prepare\u002Frequirements_render.txt\n```\n\n### 2. (Optional) Render rigged cylinders\n\nRun the following command using blender:\n\n```\nYOUR_BLENDER_PATH\u002Fblender --background --python render.py -- --cfg=.\u002Fconfigs\u002Frender.yaml --dir=YOUR_NPY_FOLDER --mode=video\n```\n\n### 2. Create SMPL meshes with:\n\n```\npython -m fit --dir YOUR_NPY_FOLDER --save_folder TEMP_PLY_FOLDER --cuda\n```\n\nThis outputs:\n\n- `mesh npy file`: the generate SMPL vertices with the shape of (nframe, 6893, 3)\n- `ply files`: the ply mesh file for blender or meshlab\n\n### 3. Render SMPL meshes\n\nRun the following command to render SMPL using blender:\n\n```\nYOUR_BLENDER_PATH\u002Fblender --background --python render.py -- --cfg=.\u002Fconfigs\u002Frender.yaml --dir=YOUR_NPY_FOLDER --mode=video\n```\n\noptional parameters:\n\n- `--mode=video`: render mp4 video\n- `--mode=sequence`: render the whole motion in a png image.\n\u003C\u002Fdetails>\n\n## ⚠️ FAQ\n\n\u003Cdetails> \u003Csummary>\u003Cb>Question-and-Answer\u003C\u002Fb>\u003C\u002Fsummary>\n    \n### The purpose and ability of MotionGPT\n\u003Cdetails>\n    \u003Csummary>The motivation of MotionGPT.\u003C\u002Fsummary>\n\n**Answer:** We present MotionGPT **to address various human motion-related tasks within one single unified model**, by unifying motion modeling with language through a shared vocabulary. To train this unified model, we propose **an instructional training scheme under the protocols for multiple motion-language**, which further reveals the potential of Large Language Models (LLMs) in motion tasks beyond the success of language generation. However, it is non-trivial for this combination since it needs to model and generate two distinct modes from scratch. Contrary to the previous work leveraging CLIP to extract text embedding as motion generation conditions, like T2M-GPT, MotionGPT introduces **the motion-language pre-training on LLM** so it can leverage the strong language generation and zero-shot transfer abilities of pre-trained language models, as well as generates human language and motion in a unified model.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>Instruction tuning and zero-shot learning.\u003C\u002Fsummary>\n\u003Cimg width=\"853\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**Answer:** We propose instruction tuning to **train a single MotionGPT across all motion-related tasks**, while task-specific tuning is to train and evaluate MotionGPTs on a single task. We employ these two training schemes to study the ability of MotionGPT across multi-tasks. As shown in this figure, we provide **zero-shot cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like \"**scuttling**\" and \"**barriers**\", and generate correct motions based on the meaning of sentences. However, it still struggles to generate **unseen motions**, like gymnastics, even if MotionGPTs understand the text inputs.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> In view of the recent success of LLMs, MotionGPT should pay attention to unifying current available datasets to exploit the scalable potential of language models when processing large-scale data besides increasing model size.\u003C\u002Fsummary>\n\n**Answer:** We have faced this **limited dataset issue** while implementing MotionGPT and in our further research. It is a hard but valuable work to unify and collect a larger motion dataset. Fortunately, some researchers are working on this problem, as seen in recent work like [Motion-X](https:\u002F\u002Fmotion-x-dataset.github.io\u002F) and other datasets, which hold promise for advancing large-scale motion models. We intend to further evaluate MotionGPT on these larger datasets once they become available.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>How well MotionGPT learns the relationship between motion and language?\u003C\u002Fsummary>\n\u003Cimg width=\"300\" alt=\"figure10\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_087cf283bb34.png\">\u003Cimg width=\"600\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**Answer:** **Unlike** the previous motion generators using the **text encoder of CLIP** for conditions, please note that MotionGPTs leverage language models to learn the motion-language relationship, instead of relying on text features from CLIP. According to our zero-shot results (cf. **Fig. 12**) and performances on multi-tasks (cf. **Fig. 10**), MotionGPTs establish robust connections between simple\u002Fcomplex texts and simple motions in evaluations, but they fall short when it comes to complex-text to **complex motion translation**.\n\n\u003C\u002Fdetails>\n\n### More technical details\n\n\u003Cdetails>\n    \u003Csummary>Why choose T5, an encoder-decoder architecture, as the base model? How about a decoder-only model, like LLaMA?\u003C\u002Fsummary>\n\u003Cimg width=\"866\" alt=\"table15\" src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002F8f58ee1e-6a10-4b5c-9939-f79ba2ecccae\">\n\n**Answer:** The **first language model that we used** to build MotionGPTs is **LLaMA-13B**. However, it shows insufficient performance and low training efficiency. We assume the reason is the limited dataset size compared to the large parameters and language data of LLaMA. We tried a smaller size decoder-only backbone **GPT2-Medium** and provide the results in **Tab. 15**. Then, we thus chose **T5-770M**, a small but common language model, as our final backbone, because many previous vision-language multimodal works, like **Unified-IO** and **BLIP**, have chosen T5, this encoder-decoder architecture. It shows a strong power to address multi-modal tasks. In addition, the decoder-only model has the advantage for self-supervised without pair data while we have paired data which this advance is greatly weakened. We are still working on collecting a large motion dataset for larger motion-language models.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>How to merge the text vocab and motion vocab in detail? concatenating them together?\u003C\u002Fsummary>\n\n**Answer:** To ensure **a shared distribution between language and motion**, we initialize the motion tokens separately and concatenate them alongside the language tokens. This step ensures a balanced representation that encompasses both modalities. Besides the token embeddings are actively trained during the entirety of **stages 2 and 3**, ensuring a comprehensive fusion of language and motion knowledge.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>For tuning on each task, tune the entire model or just part of it?\u003C\u002Fsummary>\n\n**Answer:** To address individual tasks, we adopt a focused approach where the entire model is fine-tuned. Our rationale lies in the fact that, for each specific task, our emphasis is on optimizing task-specific performance, without retaining an excessive amount of intelligence learned from other tasks. Besides, we only exclusively fine-tune the text-to-motion task, while other tasks are reported without specific tuning.\n\n\u003C\u002Fdetails>\n\n### More experimental details\n\n\u003Cdetails>\n    \u003Csummary>Can MotionGPT perform motion editing or motion composition similar to MotionDiffuse and MDM?\u003C\u002Fsummary>\n\n| Method               | FID $\\downarrow$ | DIV $\\rightarrow$ | ADE $\\downarrow$ | FDE $\\downarrow$ |\n| :------------------- | :--------------- | :---------------- | :--------------- | :--------------- |\n| Real                 | 0.002            | 9.503             | -                | -                |\n| MDM                  | 6.031            | 7.813             | 5.446            | 8.561            |\n| T2M-GPT              | 2.056            | 8.635             | 6.161            | 8.302            |\n| **MotionGPT (Ours)** | **0.905**        | **8.972**         | **4.745**        | **6.040**        |\n\n**Comparison of motion prediction on HumanML3D dataset using motion data only.**\n\n**Answer:** Referring to MDM, motion editing has two categories: **body part editing** and **motion completion** in the temporal domain. MotionGPT is capable of the latter, which includes **motion prediction** and **motion in-between**. It outperforms both **MDM** and **T2M-GPT** in the table above. However, when it comes to body part editing, the vector quantization(VQ)-based methods, like MotionGPT and T2M-GPT, are not as suitable as diffusion-based models that utilize diffusion inpainting on raw motion data. Editing body parts with LLM and prompts is a promising direction but still needs exploration.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>How to implement the MDM on the motion prediction and in-between tasks?\u003C\u002Fsummary>\n\n**Answer:** Please follow the approach outlined in **Appendix B.4** and **Line-296** of our paper, where we highlight that MDM achieves the motion in-between task using a masked motion \"in-painting\" technique. Specifically, this involves fixing the initial and final portions of the motion and allowing the model to generate the central portion. To adapt this concept for motion prediction, we similarly fix a portion of the motion – in our case, **the first 20%** – and generate the subsequent sequence.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> Motion down-sample, if only given a start frame and an end frame as the in-between input, would the model perform well?\u003C\u002Fsummary>\n\n**Answer:** VQ-based methods, such as MotionGPT and T2M-GPT, employ downsampling tricky to enhance the density of the codebook or tokens and reduce computing costs. This indeed becomes a constraint when the operation granularity is smaller than the down-sample rate. However, to address this issue, only the start and end frames are provided as in-between inputs. Some technical tricks can be used, such as repeating a single start or end frame up to the window size as inputs and removing the redundant parts in outputs. This does not significantly impact the effectiveness of the model, as there are often static beginnings or endings in the ground truth (GT) motion data.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>How is the down-sample rate chosen? It is a fundamental hyper-parameter that decides the overall granularity of the model.\u003C\u002Fsummary>\n    \n| Downsampling | MPJPE $\\downarrow$ | MPJPE $\\downarrow$ | ACCL $\\downarrow$ | FID $\\downarrow$ | DIV $\\rightarrow$ |\n| ------------ | ------------------ | ------------------ | ----------------- | ---------------- | ----------------- |\n| $l=1$        | 76.2               | 49.5               | 19.5              | 0.421            | 9.613             |\n| $l=2$        | **52.6**           | **37.7**           | **9.5**           | 0.135            | 9.722             |\n| $l=4$        | 55.8               | 40.1               | 7.5               | **0.067**        | 9.675             |\n| $l=8$        | 62.7               | 45.3               | 8.7               | 0.223            | **9.584**         |\n\n**Answer:** We selected the down-sample rate based on the frames-per-second (FPS) of the HumanML3D and KIT-ML datasets, which is **20 fps**. Therefore, down-sampling by a factor of 4 to achieve **5 fps** can ensure distinctiveness in motion frames, and prevents redundancy, and acceleration training. This choice was also made to ensure a fair comparison, as we utilized the same down-sample rate as T2M-GPT. As shown in the above table, we provide an ablation study on these parameters, where a factor of 4 achieves the best Frechet Inception Distance (FID) in motion reconstructions.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> Failure analysis. Zero-shot ability to handle words that have semantic meaning but could be unseen.\u003C\u002Fsummary>\n\u003Cimg width=\"853\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**Answer:** As shown in **Fig. 12**, we provide both **zero-shot cases** and **failure cases**. Benefitting from strong language models, MotionGPTs can understand unseen works in the text-to-motion training set, like \"**scuttling**\" and \"**barriers**\", and generate correct motions based on the meaning of sentences. However, it still struggles to generate unseen motions, like gymnastics, even if MotionGPTs understand the text inputs.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> Do TM2T, T2M, and poseGPT capture all human motion in their training dataset's discrete latent code?\u003C\u002Fsummary>\n\n| Method           | MPJPE$\\downarrow$ | MPJPE $\\downarrow$ | ACCL $\\downarrow$ | FID $\\downarrow$ | DIV $\\rightarrow$ |\n| ---------------- | ----------------- | ------------------ | ----------------- | ---------------- | ----------------- |\n| VPoser-t         | 75.6              | 48.6               | 9.3               | 1.430            | 8.336             |\n| ACTOR            | 65.3              | 41.0               | **7.0**           | 0.341            | **9.569**         |\n| MLD-1            | **54.4**          | 41.6               | 8.3               | 0.247            | 9.630             |\n| MotionGPT (Ours) | 55.8              | **40.1**           | 7.5               | **0.067**        | 9.675             |\n\n**Motion reconstruciton comparision.**\n\n| Method           | FID $\\downarrow$               |\n| ---------------- | ------------------------------ |\n| MotionGPT (Ours) | $0.510^{\\pm.016}$              |\n| T2M-GPT          | $0.514^{\\pm.029}$              |\n| MLD              | $\\boldsymbol{0.404}^{\\pm.027}$ |\n\n**Comparison of FID in text-to-motion task on KIT-ML dataset.**\n\n**Answer:** Given sufficient training or testing data from the same dataset, motion reconstruction is not a challenging task for both VAE and VQ-VAE. We have provided the evaluation on motion reconstruction in **Tab.8**. However, when dealing with a **limited amount of motion data**, like the KIT dataset, **the VAE model shows better ability in motion interpolation, surpassing VQ-VAE**.\nA relevant evaluation is shown above (also in **Tab.7**), where MLD (VAE) outperforms MotionGPT and T2M-GPT (VQ-VAEs) on FID.\nThe real challenge lies in reconstructing complex motions, such as diving or gymnastics sports. Existing motion generators struggle to accurately reconstruct **complex motions** using a codebook extracted from daily motion datasets. Collecting these complex yet valuable motions is still a significant challenge to the motion research community.\n\n\u003C\u002Fdetails>\n\n### About performances\n\n\u003Cdetails>\n    \u003Csummary> Motion quality and performance gain.\u003C\u002Fsummary>\n\n| Method    | FID $\\downarrow$               |\n| :-------- | :----------------------------- |\n| MDM       | $0.544^{\\pm.044}$              |\n| MotionGPT | $0.160^{\\pm.008}$              |\n| T2M-GPT   | $\\boldsymbol{0.116}^{\\pm.004}$ |\n\n**Comparison of FID in text-to-motion task on HumanML3D dataset.**\n\n| Method    | FID $\\downarrow$               |\n| :-------- | :----------------------------- |\n| T2M-GPT   | $0.514^{\\pm.029}$              |\n| MotionGPT | $0.510^{\\pm.016}$              |\n| MDM       | $\\boldsymbol{0.497}^{\\pm.021}$ |\n\n**Comparison of FID in text-to-motion task on KIT-ML dataset.**\n\n**Answer:** The FID metrics primarily focus on the motion quality rather than the correlation between motion and text. While MDM serves as a successful benchmark for motion generation, both MotionGPT and T2M-GPT outperform MDM by a margin of 0.38~0.43 on the FID scale. **However**, **the difference in motion quality among these three works is not significant in video supply**. Additionally, MDM outperforms two vector quantized methods, MotionGPT and T2M-GPT, in terms of FID on the KIT dataset. This can be attributed to the limited number of 3,911 motion sequences, which makes it **challenging to construct a comprehensive motion codebook**. More importantly, MotionGPT contributes to multiple motion tasks with LLM, particularly in generating both text and motion within a single model, rather than aiming to improve the FID metric.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>Limited performance gain with strong language models.\u003C\u002Fsummary>\n\n**Answer:** We thought MotionGPT, using a **significantly larger language model**, would surpass all existing methods in all tasks. **However**, the evaluation shows MotionGPT achieves SOTA results in 18 out of 23 metrics, where many improvements are only small gains. This can be attributed to the limited size of the dataset. Both **HumanML3D (14,616 motions) and KIT (3,911 motions)** are **limited** in vocabulary size and overall dataset size, particularly when compared to billion-level language datasets, which affects the efficacy of large-scale models. Benefitting from recent dataset works, like [Motion-X](https:\u002F\u002Fmotion-x-dataset.github.io\u002F), we will evaluate the performance gain of MotionGPT in larger datasets once they become available.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> Performance Gain on R-Precision in KIT.\u003C\u002Fsummary>\n\n**Answer:** The evaluation of R-Precision in the KIT dataset relies on the text encoder, which is built using a limited set of 6,353 textual descriptions. In contrast, MotionGPTs benefit from LLM and large language data, enabling them to **generate longer and more natural language descriptions** for motion. However, this leads to **a discrepancy between the generated descriptions and the GT descriptions**, resulting in a lower R-Precision.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> MotionGPT seems to sacrifice accuracy in exchange for additional functionalities.\u003C\u002Fsummary> \n\u003Cimg width=\"447\" alt=\"figure10\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_087cf283bb34.png\">\n\n**Answer:** As shown in **Fig. 10**, MotionGPT achieves SOTA on **18 out of 23** metrics across four motion-related tasks. Additionally, both HumanML3D and KIT are limited in overall dataset size, particularly when compared to billion-level language datasets. This affects the efficacy of large-scale models. We will further employ a larger motion-text dataset to evaluate MotionGPT. Besides, MotionGPTs introduce motion-language pre-training, as well as its zero-shot ability, which is a promising direction worth exploring and could stimulate self-training procedures for further research.\n\n\u003C\u002Fdetails>\n\n### About illustrations\n\n\u003Cdetails>\n    \u003Csummary>Visualize some of the tokens in the vocabulary that VQ-VAE learned.\u003C\u002Fsummary>\n\u003Cimg width=\"857\" alt=\"figure13\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_b35d9ac48661.png\">\n\n**Answer:** As shown in **Fig.13**, we visualize these **motion tokens** in **motion vocabulary $V_m$** and their corresponding localized spatial-temporal contexts, depicted within **4-frame motion segments**. However, MotionGPT falls short in generating descriptions for each individual token, as the training is conducted on token sequences.\n\nYou can run the script below to visualize more tokens:\n\n```\npython -m scripts.get_code_visual --cfg configs\u002Fconfig_h3d_stage2.yaml\n```\n\n\u003C\u002Fdetails>\n\u003C\u002Fdetails>\n\n## 📖 Citation\n\nIf you find our code or paper helps, please consider citing:\n\n```bibtex\n@article{jiang2024motiongpt,\n  title={Motiongpt: Human motion as a foreign language},\n  author={Jiang, Biao and Chen, Xin and Liu, Wen and Yu, Jingyi and Yu, Gang and Chen, Tao},\n  journal={Advances in Neural Information Processing Systems},\n  volume={36},\n  year={2024}\n}\n\n@inproceedings{chen2023executing,\n  title={Executing your Commands via Motion Diffusion in Latent Space},\n  author={Chen, Xin and Jiang, Biao and Liu, Wen and Huang, Zilong and Fu, Bin and Chen, Tao and Yu, Gang},\n  booktitle={Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n  pages={18000--18010},\n  year={2023}\n}\n```\n\n## Acknowledgments\n\nThanks to [Motion-latent-diffusion](https:\u002F\u002Fgithub.com\u002FChenFengYe\u002Fmotion-latent-diffusion), [T2m-gpt](https:\u002F\u002Fgithub.com\u002FMael-zys\u002FT2M-GPT), [TEMOS](https:\u002F\u002Fgithub.com\u002FMathux\u002FTEMOS), [ACTOR](https:\u002F\u002Fgithub.com\u002FMathux\u002FACTOR), [HumanML3D](https:\u002F\u002Fgithub.com\u002FEricGuo5513\u002FHumanML3D) and [joints2smpl](https:\u002F\u002Fgithub.com\u002Fwangsen1312\u002Fjoints2smpl), our code is partially borrowing from them.\n\n## License\n\nThis code is distributed under an [MIT LICENSE](LICENSE).\n\nNote that our code depends on other libraries, including SMPL, SMPL-X, PyTorch3D, and uses datasets which each have their own respective licenses that must also be followed.\n","\u003Cdiv align= \"center\">\n    \u003Ch1> MotionGPT 官方仓库 \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_aeddd3544f46.jpg\" width=\"35px\">\u003C\u002Fh1>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n    \u003Ch2> \u003Ca href=\"https:\u002F\u002Fmotion-gpt.github.io\u002F\">MotionGPT：将人类运动视为一门外语\u003C\u002Fa>\u003C\u002Fh2>\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fmotion-gpt.github.io\u002F\">项目页面\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14795\">ArXiv 论文\u003C\u002Fa> •\n  \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMotionLab\u002FMotionGPT\">HuggingFace 演示\u003C\u002Fa> •\n  \u003Ca href=\"#️-faq\">常见问题解答\u003C\u002Fa> •\n  \u003Ca href=\"#-citation\">引用\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\n\u003C!-- \u003Cimg src=\"https:\u002F\u002Fcdn.discordapp.com\u002Fattachments\u002F941582479117127680\u002F1111543600879259749\u002F20230526075532.png\" width=\"350px\"> -->\n\n|                                                   预告视频                                                   |                                                    演示视频                                                    |\n| :--------------------------------------------------------------------------------------------------------------: | :--------------------------------------------------------------------------------------------------------------: |\n| \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002Fa741e162-b2f4-4f65-af8e-aa19c4115a9e\" \u002F> | \u003Cvideo src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002Fae966d17-6326-43e6-8d5b-8562cf3ffd52\" \u002F> |\n\n\u003C\u002Fdiv>\n\n\u003C!-- ### [MotionGPT：将人类运动视为一门外语](https:\u002F\u002Fmotion-gpt.github.io\u002F) -->\n\u003C!-- ### [项目页面](https:\u002F\u002Fmotion-gpt.github.io\u002F) | [ArXiv 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.14795) | [HuggingFace 演示](xxx) -->\n\n## 🏃 简介 MotionGPT\n\nMotionGPT 是一个**统一**且**用户友好**的运动-语言模型，用于学习两种模态之间的语义关联，并在**多种运动任务**上生成高质量的运动序列及文本描述。\n\n\u003Cdetails>\n    \u003Csummary>\u003Cb>技术细节\u003C\u002Fb>\u003C\u002Fsummary>\n\n尽管预训练大型语言模型取得了显著进展，但构建能够处理语言与其他多模态数据（如运动）的统一模型，至今仍面临挑战且鲜有探索。幸运的是，人类运动表现出与人类语言相似的语义耦合性，常被视为一种肢体语言。通过将语言数据与大规模运动模型相结合，实现运动-语言预训练以提升运动相关任务的性能成为可能。基于这一洞察，我们提出了 MotionGPT，一个统一、多功能且用户友好的运动-语言模型，可应对多种与运动相关的任务。具体而言，我们采用离散向量量化技术对人类运动进行编码，将其转换为运动标记，类似于单词标记的生成过程。在此“运动词汇表”的基础上，我们以统一的方式对运动和文本进行语言建模，将人类运动视为一种特定的语言。此外，受提示学习启发，我们使用混合的运动-语言数据对 MotionGPT 进行预训练，并在基于提示的问答任务上进行微调。大量实验表明，MotionGPT 在多项运动任务上均达到了最先进的性能，包括文本驱动的运动生成、运动字幕生成、运动预测以及运动插值等。\n\n\u003Cimg width=\"1194\" alt=\"pipeline\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_f0c43573ffaf.png\">\n\u003C\u002Fdetails>\n\n## 🚩 最新消息\n\n- [2025\u002F06\u002F30] 发布 🔥\u003Ca href=\"https:\u002F\u002Fmotiongpt3.github.io\u002F\"> MotionGPT3 \u003C\u002Fa>🔥 **一个基于 MoT 架构的双模态运动-语言框架。**\n- [2023\u002F09\u002F22] MotionGPT 被 NeurIPS 2023 接收\n- [2023\u002F09\u002F11] 发布 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FOpenMotionLab\u002FMotionGPT\">HuggingFace 演示\u003C\u002Fa> 🔥🔥🔥\n- [2023\u002F09\u002F09] 开始训练 MotionGPT V1.0 🔥🔥🔥\n- [2023\u002F06\u002F20] 上传论文并启动项目\n\n## ⚡ 快速入门\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>环境搭建与下载\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. Conda 环境\n\n```\nconda create python=3.10 --name mgpt\nconda activate mgpt\n```\n\n安装 `requirements.txt` 中的依赖包，并安装 [PyTorch 2.0](https:\u002F\u002Fpytorch.org\u002F)\n\n```\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n\n我们已在 Python 3.10.6 和 PyTorch 2.0.0 上测试过代码。\n\n### 2. 依赖项\n\n运行脚本下载所需材料：\n\n```\nbash prepare\u002Fdownload_smpl_model.sh\nbash prepare\u002Fprepare_t5.sh\n```\n\n用于文本到运动评估：\n\n```\nbash prepare\u002Fdownload_t2m_evaluators.sh\n```\n\n### 3. 预训练模型\n\n运行脚本下载预训练模型：\n\n```\nbash prepare\u002Fdownload_pretrained_models.sh\n```\n\n### 4. （可选）手动下载\n\n访问 [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F10s5HXSFqd6UTOkW2OMNc27KGmMLkVc2L)，下载之前的依赖文件。\n\n访问 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FOpenMotionLab)，下载预训练模型。\n\n\u003C\u002Fdetails>\n\n## ▶️ 演示\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>WebUI\u003C\u002Fb>\u003C\u002Fsummary>\n\n运行以下脚本启动 WebUI，然后访问 [0.0.0.0:8888](http:\u002F\u002F0.0.0.0:8888)\n\n```\npython app.py\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>批量演示\u003C\u002Fb>\u003C\u002Fsummary>\n\n我们支持文本文件输入，输出的运动数据为 npy 文件，文本输出为 txt 文件。请检查 `configs\u002Fassets.yaml` 中的路径配置，其中 TEST.FOLDER 为输出文件夹。\n\n然后，运行以下脚本：\n\n```\npython demo.py --cfg .\u002Fconfigs\u002Fconfig_h3d_stage3.yaml --example .\u002Fdemos\u002Ft2m.txt\n```\n\n一些参数：\n\n- `--example=.\u002Fdemo\u002Ft2m.txt`：输入文件为文本提示\n- `--task=t2m`：评估任务包括 t2m、m2t、预测和插值\n\n输出结果：\n\n- `npy 文件`：生成的运动序列，形状为 (nframe, 22, 3)\n- `txt 文件`：输入的文本提示或生成的文本输出\n\u003C\u002Fdetails>\n\n## 💻 训练您自己的模型\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>训练指南\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. 准备数据集\n\n1. 请参考 [HumanML3D](https:\u002F\u002Fgithub.com\u002FEricGuo5513\u002FHumanML3D) 了解文本到运动数据集的设置方法。\n\n2. 将说明文档放入 `prepare\u002Finstructions` 目录中，与 HumanML3D 数据集放在同一文件夹内。\n\n### 2.1. 准备训练运动标记化模型\n\n首先，请检查 `configs\u002Fconfig_h3d_stage1.yaml` 中的参数，例如 `NAME` 和 `DEBUG`。\n\n然后，运行以下命令：\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage1.yaml --nodebug\n```\n\n### 2.2. 准备预训练 MotionGPT 模型\n\n请更新 `configs\u002Fconfig_h3d_stage2.yaml` 中的参数，例如 `NAME`、`DEBUG` 和 `PRETRAINED_VAE`（替换为您在上一步中获得的最新检查点路径）。\n\n随后，运行以下命令以存储训练集中所有运动标记，方便后续使用：\n\n```\npython -m scripts.get_motion_code --cfg configs\u002Fconfig_h3d_stage2.yaml\n```\n\n之后，再运行以下命令：\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage2.yaml --nodebug\n```\n\n### 2.3. 准备指导微调 MotionGPT 模型\n\n请更新 `configs\u002Fconfig_h3d_stage3.yaml` 中的参数，例如 `NAME`、`DEBUG`、`PRETRAINED`（将其更改为上一步中您的 `最新检查点模型路径`）。\n\n然后运行以下命令：\n\n```\npython -m train --cfg configs\u002Fconfig_h3d_stage3.yaml --nodebug\n```\n\n### 3. 评估模型\n\n请先将训练好的模型检查点路径放入 `configs\u002Fconfig_h3d_stage3.yaml` 中的 `TEST.CHECKPOINT`。\n\n然后运行以下命令：\n\n```\npython -m test --cfg configs\u002Fconfig_h3d_stage3.yaml --task t2m\n```\n\n一些参数说明：\n\n- `--task`：评估任务包括 t2m（文本到运动）、m2t（运动翻译）、pred（运动预测）、inbetween（运动插值）。\n\n由于 Python 包冲突的问题，当前发布的运动翻译任务中的语言学指标实现来自 [nlg-metricverse](https:\u002F\u002Fgithub.com\u002Fdisi-unibo-nlp\u002Fnlg-metricverse)，这可能与 [nlg-eval](https:\u002F\u002Fgithub.com\u002FMaluuba\u002Fnlg-eval) 实现的结果不一致。我们将在未来修复这一问题。\n\n\u003C\u002Fdetails>\n\n## 👀 可视化\n\n\u003Cdetails>\n  \u003Csummary>\u003Cb>渲染 SMPL 模型\u003C\u002Fb>\u003C\u002Fsummary>\n\n### 1. 设置 Blender - 进行中\n\n请参考 [TEMOS-Rendering motions](https:\u002F\u002Fgithub.com\u002FMathux\u002FTEMOS) 获取 Blender 的设置方法，然后安装以下依赖项。\n\n```\nYOUR_BLENDER_PYTHON_PATH\u002Fpython -m pip install -r prepare\u002Frequirements_render.txt\n```\n\n### 2. （可选）渲染绑定的圆柱体\n\n使用 Blender 运行以下命令：\n\n```\nYOUR_BLENDER_PATH\u002Fblender --background --python render.py -- --cfg=.\u002Fconfigs\u002Frender.yaml --dir=YOUR_NPY_FOLDER --mode=video\n```\n\n### 2. 使用以下命令创建 SMPL 网格：\n\n```\npython -m fit --dir YOUR_NPY_FOLDER --save_folder TEMP_PLY_FOLDER --cuda\n```\n\n这将输出：\n\n- `mesh npy 文件`：生成的 SMPL 顶点数据，形状为 (nframe, 6893, 3)\n- `ply 文件`：用于 Blender 或 MeshLab 的网格文件\n\n### 3. 渲染 SMPL 网格\n\n运行以下命令以使用 Blender 渲染 SMPL：\n\n```\nYOUR_BLENDER_PATH\u002Fblender --background --python render.py -- --cfg=.\u002Fconfigs\u002Frender.yaml --dir=YOUR_NPY_FOLDER --mode=video\n```\n\n可选参数：\n\n- `--mode=video`：渲染 mp4 视频\n- `--mode=sequence`：将整个动作序列渲染成一张 PNG 图像。\n\u003C\u002Fdetails>\n\n## ⚠️ 常见问题解答\n\n\u003Cdetails> \u003Csummary>\u003Cb>问答\u003C\u002Fb>\u003C\u002Fsummary>\n    \n### MotionGPT 的目的与能力\n\u003Cdetails>\n    \u003Csummary>MotionGPT 的研究动机。\u003C\u002Fsummary>\n\n**答：** 我们提出 MotionGPT **旨在通过统一的语言词汇表，将运动建模与语言结合，从而在一个统一的模型中解决各种人类运动相关任务**。为了训练这一统一模型，我们提出了 **基于多模态运动-语言协议的指令式训练方案**，进一步揭示了大型语言模型在运动任务中的潜力，而不仅仅局限于语言生成领域。然而，这种结合并非易事，因为它需要从头开始对两种截然不同的模态进行建模和生成。与之前利用 CLIP 提取文本嵌入作为运动生成条件的工作不同，例如 T2M-GPT，MotionGPT 引入了 **基于 LLM 的运动-语言预训练**，这样它就可以充分利用预训练语言模型强大的语言生成能力和零样本迁移能力，并在统一的模型中同时生成人类语言和运动。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>指令式微调与零样本学习。\u003C\u002Fsummary>\n\u003Cimg width=\"853\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**答：** 我们提出指令式微调是为了 **让单一的 MotionGPT 模型能够跨所有运动相关任务进行训练**，而针对特定任务的微调则是分别在单个任务上训练和评估 MotionGPT。我们采用这两种训练方式来研究 MotionGPT 在多任务上的表现。如图所示，我们提供了 **零样本案例**。得益于强大的语言模型，MotionGPT 能够理解文本到运动训练集中未见过的词汇，例如“scuttling”和“barriers”，并根据句子含义生成正确的动作。然而，即使 MotionGPT 能够理解输入文本，它仍然难以生成 **未见过的动作**，比如体操动作。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>鉴于近期 LLM 的成功，MotionGPT 应当关注如何整合现有的数据集，以便在处理大规模数据时充分发挥语言模型的可扩展性，而不仅仅是单纯地增加模型规模。\u003C\u002Fsummary>\n\n**答：** 在实施 MotionGPT 以及后续研究过程中，我们确实遇到了 **数据集有限的问题**。整合并收集更大规模的运动数据集是一项艰巨但极具价值的工作。幸运的是，目前已有研究人员致力于解决这一问题，例如最近推出的 [Motion-X](https:\u002F\u002Fmotion-x-dataset.github.io\u002F) 等数据集，这些数据集有望推动大规模运动模型的发展。一旦这些更大的数据集可用，我们将进一步在这些数据集中评估 MotionGPT。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>MotionGPT 对运动与语言之间关系的学习效果如何？\u003C\u002Fsummary>\n\u003Cimg width=\"300\" alt=\"figure10\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_087cf283bb34.png\">\u003Cimg width=\"600\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**答：** **不同于**以往那些使用 **CLIP 的文本编码器**作为条件的运动生成模型，需要注意的是，MotionGPT 是通过语言模型来学习运动与语言之间的关系，而不是依赖 CLIP 提供的文本特征。根据我们的零样本实验结果（参见 **图 12**）以及在多任务上的表现（参见 **图 10**），MotionGPT 在简单或复杂文本与简单动作之间的关联上建立了较为稳固的联系，但在处理复杂文本到 **复杂运动翻译**的任务时仍显不足。\n\n\u003C\u002Fdetails>\n\n### 更多技术细节\n\n\u003Cdetails>\n    \u003Csummary>为什么选择T5这种编码器-解码器架构作为基础模型？为什么不选择像LLaMA这样的纯解码器模型呢？\u003C\u002Fsummary>\n\u003Cimg width=\"866\" alt=\"table15\" src=\"https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fassets\u002F120085716\u002F8f58ee1e-6a10-4b5c-9939-f79ba2ecccae\">\n\n**回答：** 我们用来构建MotionGPTs的**第一个语言模型**是**LLaMA-13B**。然而，它表现出性能不足和训练效率低的问题。我们认为原因在于，与LLaMA庞大的参数量和语言数据相比，我们的数据集规模有限。随后，我们尝试了更小规模的纯解码器骨干模型**GPT2-Medium**，结果见**表15**。因此，我们最终选择了小型但通用的语言模型**T5-770M**作为骨干模型，因为许多先前的视觉-语言多模态工作，如**Unified-IO**和**BLIP**，都选择了T5这种编码器-解码器架构。它在处理多模态任务方面表现出强大的能力。此外，纯解码器模型的优势在于无需配对数据即可进行自监督学习，而我们拥有配对数据，这一优势因此被大大削弱。目前，我们仍在努力收集大规模的动作数据，以支持更大规模的动作-语言模型。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>文本词汇表和动作词汇表是如何详细合并的？是简单地将它们拼接在一起吗？\u003C\u002Fsummary>\n\n**回答：** 为了确保**语言和动作之间共享统一的分布**，我们分别初始化动作标记，并将其与语言标记拼接在一起。这一步骤确保了两种模态的均衡表示。此外，这些标记嵌入在整个**阶段2和阶段3**中都会被主动训练，从而确保语言和动作知识的全面融合。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>针对每个任务的微调，是微调整个模型，还是只微调其中一部分？\u003C\u002Fsummary>\n\n**回答：** 为了应对不同的具体任务，我们采用的是对整个模型进行微调的方法。我们的理由在于，对于每个特定的任务，我们的重点是优化该任务的性能，而不保留从其他任务中学到的过多知识。另外，我们仅对文本到动作的任务进行了专门的微调，而其他任务则未进行专门的微调。\n\n\u003C\u002Fdetails>\n\n### 更多实验细节\n\n\u003Cdetails>\n    \u003Csummary>MotionGPT能否像MotionDiffuse和MDM一样进行动作编辑或动作组合？\u003C\u002Fsummary>\n\n| 方法               | FID $\\downarrow$ | DIV $\\rightarrow$ | ADE $\\downarrow$ | FDE $\\downarrow$ |\n| :------------------- | :--------------- | :---------------- | :--------------- | :--------------- |\n| 真实                 | 0.002            | 9.503             | -                | -                |\n| MDM                  | 6.031            | 7.813             | 5.446            | 8.561            |\n| T2M-GPT              | 2.056            | 8.635             | 6.161            | 8.302            |\n| **MotionGPT (我们的)** | **0.905**        | **8.972**         | **4.745**        | **6.040**        |\n\n**仅使用动作数据在HumanML3D数据集上进行动作预测的比较。**\n\n**回答：** 参照MDM，动作编辑可分为两类：**身体部位编辑**和时间域上的**动作补全**。MotionGPT能够完成后者，包括**动作预测**和**动作插值**。在上表中，它在这些任务上均优于**MDM**和**T2M-GPT**。然而，在身体部位编辑方面，基于向量量化（VQ）的方法，如MotionGPT和T2M-GPT，不如那些利用扩散去噪技术直接作用于原始动作数据的扩散模型。通过大语言模型和提示词来编辑身体部位是一个很有前景的方向，但仍需进一步探索。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>如何在动作预测和插值任务上实现MDM的技术方案？\u003C\u002Fsummary>\n\n**回答：** 请参考我们论文中的**附录B.4**和**第296行**，其中我们指出，MDM通过一种掩码式“修复”技术来实现动作插值任务。具体来说，就是固定动作的起始和结束部分，让模型生成中间部分。为了将这一概念应用于动作预测，我们同样固定动作的一部分——在我们的例子中是**前20%**——并生成后续序列。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>如果仅提供起始帧和结束帧作为插值输入，动作下采样是否会影响模型的表现？\u003C\u002Fsummary>\n\n**回答：** 基于VQ的方法，如MotionGPT和T2M-GPT，会采用下采样的技巧来提高代码本或标记的密度，并降低计算成本。当操作的粒度小于下采样率时，这确实会成为一个限制。不过，为了解决这个问题，我们可以只提供起始和结束帧作为插值输入。一些技术手段可以应用，比如将单一的起始或结束帧重复到窗口大小作为输入，并在输出中去除冗余部分。这样做并不会显著影响模型的效果，因为在真实动作数据中，常常存在静止的开始或结束部分。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>下采样率是如何选择的？这是一个决定模型整体粒度的基础超参数。\u003C\u002Fsummary>\n    \n| 下采样 | MPJPE $\\downarrow$ | MPJPE $\\downarrow$ | ACCL $\\downarrow$ | FID $\\downarrow$ | DIV $\\rightarrow$ |\n| -------- | ------------------ | ------------------ | ----------------- | ---------------- | ----------------- |\n| $l=1$    | 76.2               | 49.5               | 19.5              | 0.421            | 9.613             |\n| $l=2$    | **52.6**           | **37.7**           | **9.5**           | 0.135            | 9.722             |\n| $l=4$    | 55.8               | 40.1               | 7.5               | **0.067**        | 9.675             |\n| $l=8$    | 62.7               | 45.3               | 8.7               | 0.223            | **9.584**         |\n\n**回答：** 我们根据HumanML3D和KIT-ML数据集的帧率（FPS）来选择下采样率，这两个数据集的帧率均为**20 fps**。因此，采用4倍下采样达到**5 fps**，既能保证动作帧之间的清晰区分，避免冗余，又能加快训练速度。同时，这一选择也是为了确保公平比较，因为我们采用了与T2M-GPT相同的下采样率。如上表所示，我们对这些参数进行了消融实验研究，结果表明，4倍下采样在动作重建的弗雷歇初始距离（FID）指标上表现最佳。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary>失败分析。模型在零样本情况下处理具有语义意义但可能未曾见过的词语的能力。\u003C\u002Fsummary>\n\u003Cimg width=\"853\" alt=\"figure12\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_590081ad6938.png\">\n\n**回答：** 如 **图12** 所示，我们同时展示了**零样本案例**和**失败案例**。得益于强大的语言模型，MotionGPT能够理解文本到运动训练集中未见过的动作描述，例如“scuttling”和“barriers”，并根据句子的语义生成正确的动作。然而，即使MotionGPT能够理解输入文本，它仍然难以生成未见过的动作，比如体操动作。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> TM2T、T2M 和 poseGPT 是否在其训练数据集的离散潜在代码中捕捉到了所有的人体动作？\u003C\u002Fsummary>\n\n| 方法           | MPJPE$\\downarrow$ | MPJPE $\\downarrow$ | ACCL $\\downarrow$ | FID $\\downarrow$ | DIV $\\rightarrow$ |\n| ---------------- | ----------------- | ------------------ | ----------------- | ---------------- | ----------------- |\n| VPoser-t         | 75.6              | 48.6               | 9.3               | 1.430            | 8.336             |\n| ACTOR            | 65.3              | 41.0               | **7.0**           | 0.341            | **9.569**         |\n| MLD-1            | **54.4**          | 41.6               | 8.3               | 0.247            | 9.630             |\n| MotionGPT (我们的) | 55.8              | **40.1**           | 7.5               | **0.067**        | 9.675             |\n\n**运动重建比较。**\n\n| 方法           | FID $\\downarrow$               |\n| ---------------- | ------------------------------ |\n| MotionGPT (我们的) | $0.510^{\\pm.016}$              |\n| T2M-GPT          | $0.514^{\\pm.029}$              |\n| MLD              | $\\boldsymbol{0.404}^{\\pm.027}$ |\n\n**KIT-ML 数据集上文本到运动任务中的 FID 比较。**\n\n**回答：** 在拥有足够来自同一数据集的训练或测试数据的情况下，无论是 VAE 还是 VQ-VAE，运动重建都不是一项具有挑战性的任务。我们在 **表8** 中提供了运动重建的评估结果。然而，当处理**有限数量的运动数据**时，例如 KIT 数据集，**VAE 模型在运动插值方面表现出更强的能力，优于 VQ-VAE**。\n相关的评估如上所示（也在 **表7** 中），其中 MLD（VAE）在 FID 方面优于 MotionGPT 和 T2M-GPT（VQ-VAEs）。\n真正的挑战在于重建复杂的动作，例如跳水或体操运动。现有的运动生成器很难仅使用从日常运动数据集中提取的代码本准确地重建**复杂动作**。收集这些复杂但有价值的运动仍然是运动研究领域的一个重大挑战。\n\n\u003C\u002Fdetails>\n\n\n\n### 关于性能\n\n\u003Cdetails>\n    \u003Csummary> 运动质量和性能提升。\u003C\u002Fsummary>\n\n| 方法    | FID $\\downarrow$               |\n| :-------- | :----------------------------- |\n| MDM       | $0.544^{\\pm.044}$              |\n| MotionGPT | $0.160^{\\pm.008}$              |\n| T2M-GPT   | $\\boldsymbol{0.116}^{\\pm.004}$ |\n\n**HumanML3D 数据集上文本到运动任务中的 FID 比较。**\n\n| 方法    | FID $\\downarrow$               |\n| :-------- | :----------------------------- |\n| T2M-GPT   | $0.514^{\\pm.029}$              |\n| MotionGPT | $0.510^{\\pm.016}$              |\n| MDM       | $\\boldsymbol{0.497}^{\\pm.021}$ |\n\n**KIT-ML 数据集上文本到运动任务中的 FID 比较。**\n\n**回答：** FID 指标主要关注运动质量，而非运动与文本之间的相关性。尽管 MDM 是运动生成领域的成功基准，但 MotionGPT 和 T2M-GPT 在 FID 指标上仍比 MDM 高出 0.38 至 0.43。**然而**，**这三者在视频呈现上的运动质量差异并不显著**。此外，在 KIT 数据集上，MDM 的 FID 表现优于两种向量量化方法 MotionGPT 和 T2M-GPT。这可以归因于该数据集仅有 3,911 条运动序列，使得**构建全面的运动代码本非常困难**。更重要的是，MotionGPT 通过结合 LLM 实现了多种运动任务，尤其是在单个模型中同时生成文本和运动，而不是单纯追求 FID 指标的提升。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> 强大语言模型带来的性能提升有限。\u003C\u002Fsummary>\n\n**回答：** 我们原本认为，使用**规模显著更大的语言模型**的 MotionGPT 将在所有任务中超越现有方法。**然而**，评估结果显示，MotionGPT 在 23 项指标中取得了 18 项的 SOTA 成绩，但许多改进幅度较小。这主要是由于数据集规模有限所致。无论是 **HumanML3D（14,616 条运动）还是 KIT（3,911 条运动）**，其词汇量和整体数据集规模都相对**有限**，尤其与数十亿级别的语言数据集相比，这影响了大规模模型的效果。借助近期发布的数据集工作，例如 [Motion-X](https:\u002F\u002Fmotion-x-dataset.github.io\u002F)，一旦这些大型数据集可用，我们将进一步评估 MotionGPT 在更大规模数据集上的性能提升。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> KIT 数据集中 R-Precision 的性能提升。\u003C\u002Fsummary>\n\n**回答：** KIT 数据集中 R-Precision 的评估依赖于基于 6,353 条文本描述构建的文本编码器。相比之下，MotionGPT 则受益于 LLM 和海量语言数据，能够为运动生成**更长且更自然的语言描述**。然而，这导致**生成的描述与真实标注描述之间存在差异**，从而降低了 R-Precision 的得分。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n    \u003Csummary> MotionGPT 似乎以牺牲准确性为代价来实现额外的功能。\u003C\u002Fsummary> \n\u003Cimg width=\"447\" alt=\"figure10\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_087cf283bb34.png\">\n\n**回答：** 如 **图10** 所示，MotionGPT 在四项运动相关任务中，共23项指标中有18项达到了 SOTA 水平。此外，HumanML3D 和 KIT 数据集的整体规模都较为有限，尤其是与数十亿级别的语言数据集相比，这会影响大规模模型的效果。未来我们将使用更大的运动-文本数据集来进一步评估 MotionGPT。另外，MotionGPT 引入了运动-语言预训练以及零样本能力，这是一个值得探索的有前景方向，可能激发进一步的研究和自训练机制。\n\n\u003C\u002Fdetails>\n\n### 关于插图\n\n\u003Cdetails>\n    \u003Csummary>可视化 VQ-VAE 学习到的词汇表中的一些标记。\u003C\u002Fsummary>\n\u003Cimg width=\"857\" alt=\"figure13\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_readme_b35d9ac48661.png\">\n\n**回答:** 如 **图13** 所示，我们可视化了**运动词汇表 $V_m$** 中的这些**运动标记**及其对应的局部时空上下文，这些内容被描绘在**4帧的运动片段**中。然而，由于训练是在标记序列上进行的，MotionGPT 在为每个单独的标记生成描述方面存在不足。\n\n你可以运行下面的脚本以可视化更多标记：\n\n```\npython -m scripts.get_code_visual --cfg configs\u002Fconfig_h3d_stage2.yaml\n```\n\n\u003C\u002Fdetails>\n\u003C\u002Fdetails>\n\n## 📖 引用\n\n如果你觉得我们的代码或论文对你有所帮助，请考虑引用以下文献：\n\n```bibtex\n@article{jiang2024motiongpt,\n  title={Motiongpt: Human motion as a foreign language},\n  author={Jiang, Biao and Chen, Xin and Liu, Wen and Yu, Jingyi and Yu, Gang and Chen, Tao},\n  journal={Advances in Neural Information Processing Systems},\n  volume={36},\n  year={2024}\n}\n\n@inproceedings{chen2023executing,\n  title={Executing your Commands via Motion Diffusion in Latent Space},\n  author={Chen, Xin and Jiang, Biao and Liu, Wen and Huang, Zilong and Fu, Bin and Chen, Tao and Yu, Gang},\n  booktitle={Proceedings of the IEEE\u002FCVF Conference on Computer Vision and Pattern Recognition},\n  pages={18000--18010},\n  year={2023}\n}\n```\n\n## 致谢\n\n感谢 [Motion-latent-diffusion](https:\u002F\u002Fgithub.com\u002FChenFengYe\u002Fmotion-latent-diffusion)、[T2m-gpt](https:\u002F\u002Fgithub.com\u002FMael-zys\u002FT2M-GPT)、[TEMOS](https:\u002F\u002Fgithub.com\u002FMathux\u002FTEMOS)、[ACTOR](https:\u002F\u002Fgithub.com\u002FMathux\u002FACTOR)、[HumanML3D](https:\u002F\u002Fgithub.com\u002FEricGuo5513\u002FHumanML3D) 和 [joints2smpl](https:\u002F\u002Fgithub.com\u002Fwangsen1312\u002Fjoints2smpl)，我们的代码部分借鉴了它们的工作。\n\n## 许可证\n\n本代码采用 [MIT 许可证](LICENSE) 发布。\n\n请注意，我们的代码依赖于其他库，包括 SMPL、SMPL-X、PyTorch3D，并使用了一些数据集，而这些库和数据集各自有其相应的许可证，你也需要遵守这些许可证的规定。","# MotionGPT 快速上手指南\n\nMotionGPT 是一个统一且用户友好的动作 - 语言模型，能够将人体动作视为一种“外语”，实现文本生成动作、动作描述、动作预测等多种任务。\n\n## 1. 环境准备\n\n**系统要求：**\n- Python 3.10.6 (推荐)\n- PyTorch 2.0.0+\n- Linux \u002F macOS \u002F Windows (需配置相应编译环境)\n\n**前置依赖：**\n- Conda 包管理器\n- Git\n- Blender (仅可视化渲染需要，可选)\n\n## 2. 安装步骤\n\n### 2.1 创建虚拟环境\n```bash\nconda create python=3.10 --name mgpt\nconda activate mgpt\n```\n\n### 2.2 安装依赖包\n```bash\npip install -r requirements.txt\npython -m spacy download en_core_web_sm\n```\n> **提示**：国内用户建议使用清华或阿里镜像源加速安装：\n> `pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n\n### 2.3 下载必要资源\n运行以下脚本自动下载 SMPL 模型、T5 权重及评估器：\n```bash\nbash prepare\u002Fdownload_smpl_model.sh\nbash prepare\u002Fprepare_t5.sh\nbash prepare\u002Fdownload_t2m_evaluators.sh\n```\n\n### 2.4 下载预训练模型\n```bash\nbash prepare\u002Fdownload_pretrained_models.sh\n```\n> **备选方案**：若脚本下载失败，可手动从 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002FOpenMotionLab) 或 [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F10s5HXSFqd6UTOkW2OMNc27KGmMLkVc2L) 下载并放入对应目录。\n\n## 3. 基本使用\n\n### 方式一：启动 Web 界面（推荐新手）\n最简单的交互方式，启动后可在浏览器中通过文本提示生成动作。\n```bash\npython app.py\n```\n启动后访问：`http:\u002F\u002F0.0.0.0:8888`\n\n### 方式二：命令行批量推理\n支持读取 txt 文件中的文本提示，生成动作数据（.npy）和文本描述。\n\n**示例命令（文本生成动作）：**\n```bash\npython demo.py --cfg .\u002Fconfigs\u002Fconfig_h3d_stage3.yaml --example .\u002Fdemos\u002Ft2m.txt\n```\n\n**参数说明：**\n- `--example`: 输入文本文件路径（每行一个动作描述）。\n- `--task`: 指定任务类型，可选值包括：\n  - `t2m`: 文本生成动作 (Text-to-Motion)\n  - `m2t`: 动作生成文本描述 (Motion Captioning)\n  - `pred`: 动作预测 (Motion Prediction)\n  - `inbetween`: 动作补全 (Motion In-between)\n\n**输出结果：**\n- `.npy` 文件：生成的动作数据，形状为 `(帧数，22, 3)`。\n- `.txt` 文件：输入的文本提示或生成的文本描述。","某独立游戏开发团队正在为一款叙事驱动的动作 RPG 制作过场动画，需要让角色根据剧本对话自然地完成行走、战斗和互动动作。\n\n### 没有 MotionGPT 时\n- 动画师必须手动逐帧调整关键姿势来匹配每一句台词，耗时且难以保证动作与语义的精准对齐。\n- 想要修改剧本中的动作描述（如将“愤怒地挥拳”改为“沮丧地垂手”），需重新制作整套动画序列，迭代成本极高。\n- 缺乏统一工具处理多种任务，生成动作、编写动作描述文案、预测下一帧动作需分别使用不同软件或插件，工作流割裂。\n- 非专业动画出身的策划人员无法直接通过文字快速预览动作效果，沟通依赖大量口头描述和草图，效率低下。\n\n### 使用 MotionGPT 后\n- 策划人员直接输入剧本台词或动作指令（如“角色悲伤地转身离开”），MotionGPT 即刻生成流畅且语义匹配的 3D 动作序列。\n- 修改剧情时只需调整文本提示词，MotionGPT 能实时重新生成对应的新动作，将迭代时间从数天缩短至几分钟。\n- 利用其统一模型架构，同一套系统即可胜任动作生成、自动撰写动作说明文案、以及补全缺失的中间帧动作，工作流高度整合。\n- 团队成员无需具备深厚动画知识，通过自然语言问答即可让 MotionGPT 理解意图并输出结果，大幅降低协作门槛。\n\nMotionGPT 通过将人体运动转化为大语言模型可理解的“外语”，彻底打破了文本创意与 3D 动作执行之间的壁垒，实现了“所言即所得”的高效内容生产。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FOpenMotionLab_MotionGPT_f0c43573.png","OpenMotionLab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FOpenMotionLab_6159bb1d.jpg","",null,"https:\u002F\u002Fgithub.com\u002FOpenMotionLab",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",98.5,{"name":87,"color":88,"percentage":89},"CSS","#663399",1,{"name":91,"color":92,"percentage":93},"Shell","#89e051",0.5,1898,139,"2026-04-03T05:43:25","MIT","Linux","需要 NVIDIA GPU (训练和渲染步骤明确使用 --cuda 参数)，具体型号和显存未说明，需支持 PyTorch 2.0 的 CUDA 版本","未说明",{"notes":102,"python":103,"dependencies":104},"1. 官方测试环境为 Python 3.10.6 和 PyTorch 2.0.0。2. 首次运行需执行脚本下载 SMPL 模型、T5 权重、预训练模型及评估器（可通过 Google Drive 或 HuggingFace 手动下载）。3. 可视化渲染部分依赖 Blender，需单独配置 Blender 的 Python 环境并安装特定依赖。4. 运动翻译任务的评估指标依赖于 nlg-metricverse 包，可能与原论文使用的 nlg-eval 结果存在细微差异。","3.10.6",[105,106,107,108],"torch>=2.0.0","spacy","en_core_web_sm","nlg-metricverse",[54,26,15,52],[111,112,113,114,115,116,117,118,119,120],"3d-generation","chatgpt","gpt","language-model","motion","motion-generation","text-driven","text-to-motion","motiongpt","multi-modal","2026-03-27T02:49:30.150509","2026-04-06T06:54:21.144539",[124,129,134,139,144,149],{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},10337,"Stage1 训练在评估阶段崩溃，报错 'linalg.svd: input matrix contained non-finite values' 怎么办？","该问题通常由测试集（test split）中的数据损坏或非有限值引起。解决方案是将评估配置中的 SPLIT 从 'test' 改为 'val'。具体修改如下：\n在评估配置部分：\nEVAL:\n  BATCH_SIZE: 32\n  SPLIT: val  # 原为 test\n修改后训练即可正常运行。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F22",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},10338,"运行 WebUI 或生成动作时提示找不到 't2m\u002FVQVAEV3_CB1024_CMT_H1024_NRES3' 文件怎么办？","这是因为提供的数据包中缺少该特定模型文件。解决方法是将配置中的模型名称替换为 HumanML3D 仓库中提供的标准模型 'Decomp_SP001_SM001_H512'。此外，还需检查以下配置项：\n1. configs\u002Fwebui.yaml: 将 CHECKPOINTS 后缀从 .ckpt 改为 .tar。\n2. configs\u002Flm\u002Fdefault.yaml: 将 model_path 改为 'google\u002Fflan-t5-base'。\n3. configs\u002Fassets.yaml: 将 whisper_path 改为 'openai\u002Fwhisper-large-v2'。\n维护者表示该问题已在最新版本中修复，建议更新代码。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F14",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},10339,"为了复现论文结果，配置文件中的 END_EPOCH 参数应该设置为多少？默认值 999999 训练时间太长了。","不需要将 END_EPOCH 设置为 999999。为了复现论文中的实验并优化训练时间，建议按以下阶段设置 epoch 数量：\n- Stage 1: 3000 epochs\n- Stage 2: 300 epochs\n- Stage 3: 100 epochs\n这将与论文中提到的训练迭代次数（如 Stage2 的 300K）保持一致。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F89",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},10340,"项目中提到的 'raw motion data'（原始运动数据）具体指什么？是网格还是关键点？","'Raw motion data' 指的是原始的 3D 运动数据，包含所有关节的参数（包括旋转和位置）。这些数据通常是从动作捕捉（MoCap）系统优化得到的，可能仍包含一些抖动和噪声。在数学表示上，它可以是 SMPL 的姿态参数（每个关节的旋转向量 R^{1x3}）或旋转矩阵（R^{3x3}）。两者可视为等同。更多细节可参考 SOMA 项目相关论文。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F8",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},10341,"MotionGPT 的代码和微调教程什么时候发布？","根据维护者的回复，MotionGPT 的第一个版本（v1.0）计划于 9 月 9 日发布。未来还将引入更多的模型和插件。请关注官方仓库以获取最新的发布信息和微调指南。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F5",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},10342,"这篇论文投稿到了哪个会议或期刊？Arxiv 链接指向的标题不同。","由于会议评审规则的限制，作者在评审期间不能公开透露具体的投稿会议或期刊信息。Arxiv 上的标题可能是预印本标题。作者建议对计算机视觉会议感兴趣的用户自行搜索相关信息。通常在论文被正式接收后，相关信息会在主页更新。","https:\u002F\u002Fgithub.com\u002FOpenMotionLab\u002FMotionGPT\u002Fissues\u002F1",[155],{"id":156,"version":157,"summary_zh":158,"released_at":159},107560,"MotionGPT-V1.0","Release training and inference of MotionGPT V1.0 🔥🔥🔥\r\nThe hugging face demo will arrive in two days.","2023-09-09T15:59:03"]