[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Eurus-Holmes--Awesome-Multimodal-Research":3,"tool-Eurus-Holmes--Awesome-Multimodal-Research":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,2,"2026-04-06T11:09:19",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74963,"2026-04-06T11:16:39",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65644,"2026-04-06T10:25:08",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":83,"owner_website":84,"owner_url":85,"languages":86,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":46,"env_os":95,"env_gpu":96,"env_ram":96,"env_deps":97,"category_tags":100,"github_topics":101,"view_count":10,"oss_zip_url":106,"oss_zip_packed_at":106,"status":22,"created_at":107,"updated_at":108,"faqs":109,"releases":110},4423,"Eurus-Holmes\u002FAwesome-Multimodal-Research","Awesome-Multimodal-Research","A curated list of Multimodal Related Research.","Awesome-Multimodal-Research 是一个精心整理的多模态研究资源清单，旨在为关注人工智能前沿的社区提供一站式导航。随着 AI 技术从单一文本或图像处理向“多模态”融合演进，海量论文与技术动态让追踪变得困难。这份清单有效解决了信息分散、难以系统获取最新成果的痛点，将零散的研究论文、开源项目及行业大事件（如 GPT-4、PaLM-E、Kosmos-1 等模型的发布）汇聚成条理清晰的知识库。\n\n它特别适合 AI 研究人员、算法工程师、开发者以及对多模态机器学习感兴趣的学生使用。无论是需要快速调研领域现状、寻找灵感，还是希望跟进 Google、OpenAI、微软等顶尖机构的最新突破，这里都能提供高效指引。其独特亮点在于不仅罗列文献，还持续更新包含技术解读与官方链接的\"News\"板块，帮助用户直观理解多模态大模型如何结合视觉与语言能力，实现更复杂的指令遵循与跨域任务。作为由社区共同维护的开源项目，它鼓励协作贡献，是探索多模态智能未来不可或缺的实用指南。","# Awesome Multimodal Research [![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n\n![build](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbuild-passing-brightgreen.svg)\n![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-brightgreen.svg)\n![prs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg)\n\n> This repo is reorganized from [Paul Liang's repo: Reading List for Topics in Multimodal Machine Learning](https:\u002F\u002Fgithub.com\u002Fpliang279\u002Fawesome-multimodal-ml), feel free to raise pull requests! \n\n\n## News\n\n**[03\u002F2023] OpenAI:** *[ChatGPT plugins](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fplugins\u002Fintroduction) are tools designed specifically for language models with safety as a core principle, and help ChatGPT access up-to-date information, run computations, or use third-party services. https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins*\n> *\"We’re also hosting two plugins ourselves, a [web browser](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins#browsing) and [code interpreter](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins#code-interpreter). We’ve also open-sourced the code for a knowledge base [retrieval plugin](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fchatgpt-retrieval-plugin), to be self-hosted by any developer with information with which they’d like to augment ChatGPT.\"*\n\n**[03\u002F2023] Google Research:** *[Bard](https:\u002F\u002Fblog.google\u002Ftechnology\u002Fai\u002Ftry-bard\u002F) is an early experiment that lets you collaborate with generative AI, powered by a research large language model (LLM), specifically a lightweight and optimized version of LaMDA. https:\u002F\u002Fbard.google.com\u002F*\n\n**[03\u002F2023] OpenAI:** *[GPT-4](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fgpt-4.pdf) is a large multimodal model (accepting image and text inputs, emitting text outputs) that, while less capable than humans in many real-world scenarios, exhibits human-level performance on various professional and academic benchmarks. https:\u002F\u002Fopenai.com\u002Fresearch\u002Fgpt-4*\n\n**[03\u002F2023] Google Research:** *[PaLM-E](https:\u002F\u002Fpalm-e.github.io\u002Fassets\u002Fpalm-e.pdf) is a new generalist robotics model that overcomes these issues by transferring knowledge from varied visual and language domains to a robotics system. https:\u002F\u002Fai.googleblog.com\u002F2023\u002F03\u002Fpalm-e-embodied-multimodal-language.html*\n\n**[03\u002F2023] OpenAI:** *[ChatGPT and Whisper APIs](https:\u002F\u002Fopenai.com\u002Fblog\u002Fintroducing-chatgpt-and-whisper-apis), developers can now integrate ChatGPT and Whisper models into their apps and products through API. https:\u002F\u002Fopenai.com\u002Fblog\u002Fintroducing-chatgpt-and-whisper-apis*\n\n**[02\u002F2023] MSR:** *[Kosmos-1](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14045.pdf) is a multimodal large language model (MLLM) that is capable of perceiving multimodal input, following instructions, and performing in-context learning for not only language tasks but also multimodal tasks. https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm#llm--mllm-multimodal-llm*\n\n**[01\u002F2023] Google Research:** *[2022 & beyond: Language, vision and generative models](https:\u002F\u002Fai.googleblog.com\u002F2023\u002F01\u002Fgoogle-research-2022-beyond-language.html), a post of a series in which researchers across Google will highlight some exciting progress in 2022 and present the vision for 2023 and beyond. https:\u002F\u002Fai.googleblog.com\u002F2023\u002F01\u002Fgoogle-research-2022-beyond-language.html*\n\n**[11\u002F2022] OpenAI:** *[ChatGPT](https:\u002F\u002Fchat.openai.com\u002Fchat) is a sibling model to [InstructGPT](https:\u002F\u002Fopenai.com\u002Fresearch\u002Finstruction-following), which is trained to follow an instruction in a prompt and provide a detailed response. https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt*\n\n**[08\u002F2022] MSR:** *[Multimodal Pretraining](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm#multimodal-x--language): [BEiT-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.10442) is a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks. https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbeit*\n\n**[04\u002F2022] OpenAI:** *[DALL·E 2](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06125.pdf) is a new AI system that can create realistic images and art from a description in natural language. https:\u002F\u002Fopenai.com\u002Fdall-e-2\u002F*\n\n**[05\u002F2021] Google:** *[MuM](https:\u002F\u002Fblog.google\u002Fproducts\u002Fsearch\u002Fintroducing-mum\u002F), a new AI milestone for understanding information. https:\u002F\u002Fblog.google\u002Fproducts\u002Fsearch\u002Fintroducing-mum\u002F*\n\n**[03\u002F2021] OpenAI:** *[Multimodal Neurons](https:\u002F\u002Fdistill.pub\u002F2021\u002Fmultimodal-neurons\u002F) in Artificial Neural Networks, which may explain CLIP’s accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn. https:\u002F\u002Fopenai.com\u002Fblog\u002Fmultimodal-neurons\u002F*\n\n**[01\u002F2021] OpenAI:** *[CLIP](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F) maps images into categories described in text, and [DALL-E](https:\u002F\u002Fopenai.com\u002Fblog\u002Fdall-e\u002F) creates new images from text. A step toward systems with deeper understanding of the world. https:\u002F\u002Fopenai.com\u002Fmultimodal\u002F*\n\n\n## Research Papers\n\n* [Survey Papers](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FSurvey-Papers)\n* [Core Areas](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas)\n  * [Representation Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FRepresentation-Learning)\n  * [Multimodal Fusion](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Fusion)\n  * [Multimodal Alignment](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Alignment)\n  * [Multimodal Translation](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Translation)\n  * [Missing or Imperfect Modalities](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMissing-or-Imperfect-Modalities)\n  * [Knowledge Graphs and Knowledge Bases](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FKnowledge-Graphs-and-Knowledge-Bases)\n  * [Intepretable Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FIntepretable-Learning)\n  * [Generative Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FGenerative-Learning)\n  * [Semi-supervised Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FSemi-supervised-Learning)\n  * [Self-supervised Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FSelf-supervised-Learning)\n  * [Language Models](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FLanguage-Models)\n  * [Adversarial Attacks](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FAdversarial-Attacks)\n  * [Few-Shot Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FFew-Shot-Learning)\n  * [Bias and Fairness](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FBias-and-Fairness)\n* [Applications](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications)\n  * [Language and Visual QA](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-and-Visual-QA)\n  * [Language Grounding in Vision](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-Grounding-in-Vision)\n  * [Language Grouding in Navigation](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-Grouding-in-Navigation)\n  * [Multimodal Machine Translation](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Machine-Translation)\n  * [Multi-agent Communication](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMulti-agent-Communication)\n  * [Commonsense Reasoning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FCommonsense-Reasoning)\n  * [Multimodal Reinforcement Learning](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Reinforcement-Learning)\n  * [Multimodal Dialog](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Dialog)\n  * [Language and Audio](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-and-Audio)\n  * [Audio and Visual](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAudio-and-Visual)\n  * [Media Description](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMedia-Description)\n  * [Video Generation from Text](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FVideo-Generation-from-Text)\n  * [Affect Recognition and Multimodal Language](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAffect-Recognition-and-Multimodal-Language)\n  * [Healthcare](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FHealthcare)\n  * [Robotics](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FRobotics)\n  * [Autonomous Driving](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAutonomous-Driving)\n* [Workshops](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fworkshops)\n* [Tutorials](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Ftutorials)\n* [Courses](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fcourses)\n\n\n## Recent Workshop\n\n[Social Intelligence in Humans and Robots](https:\u002F\u002Fsocial-intelligence-human-ai.github.io\u002F), ICRA 2021\n\n[LANTERN 2021](https:\u002F\u002Fwww.lantern.uni-saarland.de\u002F2021\u002F): The Third Workshop Beyond Vision and LANguage: inTEgrating Real-world kNowledge, EACL 2021\n\nMultimodal workshops: [Multimodal Learning and Applications](https:\u002F\u002Fmula-workshop.github.io\u002F), [Sight and Sound](http:\u002F\u002Fsightsound.org\u002F), [Visual Question Answering](https:\u002F\u002Fvisualqa.org\u002Fworkshop), [Embodied AI](https:\u002F\u002Fembodied-ai.org\u002F), [Language for 3D Scenes](http:\u002F\u002Flanguage3dscenes.github.io\u002F), CVPR 2021\n\n[Advances in Language and Vision Research (ALVR)](https:\u002F\u002Falvr-workshop.github.io\u002F), NAACL 2021\n\n[Visually Grounded Interaction and Language (ViGIL)](https:\u002F\u002Fvigilworkshop.github.io\u002F), NAACL 2021\n\n[Wordplay: When Language Meets Games](https:\u002F\u002Fwordplay-workshop.github.io\u002F), NeurIPS 2020\n\n[NLP Beyond Text](https:\u002F\u002Fsites.google.com\u002Fview\u002Fnlpbt-2020), EMNLP 2020\n\n[International Challenge on Compositional and Multimodal Perception](https:\u002F\u002Fcamp-workshop.stanford.edu\u002F), ECCV 2020\n\n[Multimodal Video Analysis Workshop and Moments in Time Challenge](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmultimodalvideo-v2), ECCV 2020\n\n[Video Turing Test: Toward Human-Level Video Story Understanding](https:\u002F\u002Fdramaqa.snu.ac.kr\u002FWorkshop\u002F2020), ECCV 2020\n\n[Grand Challenge and Workshop on Human Multimodal Language](http:\u002F\u002Fmulticomp.cs.cmu.edu\u002Facl2020multimodalworkshop\u002F), ACL 2020\n\n[Workshop on Multimodal Learning](https:\u002F\u002Fmul-workshop.github.io\u002F), CVPR 2020\n\n[Language & Vision with applications to Video Understanding](https:\u002F\u002Flanguageandvision.github.io\u002F), CVPR 2020\n\n[International Challenge on Activity Recognition (ActivityNet)](http:\u002F\u002Factivity-net.org\u002Fchallenges\u002F2020\u002F), CVPR 2020\n\n[The End-of-End-to-End A Video Understanding Pentathlon](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fchallenges\u002Fvideo-pentathlon\u002F), CVPR 2020\n\n[Towards Human-Centric Image\u002FVideo Synthesis, and the 4th Look Into Person (LIP) Challenge](https:\u002F\u002Fvuhcs.github.io\u002F), CVPR 2020\n\n[Visual Question Answering and Dialog](https:\u002F\u002Fvisualqa.org\u002Fworkshop), CVPR 2020\n\n\n## Recent Tutorial\n\n[Tutorials on Multimodal Machine Learning](https:\u002F\u002Fcmu-multicomp-lab.github.io\u002Fmmml-tutorial\u002Fcvpr2022\u002F), CVPR 2022 && NAACL 2022\n\n[Multi-modal Information Extraction from Text, Semi-structured, and Tabular Data on the Web (Cutting-edge)](https:\u002F\u002Fsites.google.com\u002Fview\u002Facl-2020-multi-modal-ie), ACL 2020\n\n[Achieving Common Ground in Multi-modal Dialogue (Cutting-edge)](https:\u002F\u002Fgithub.com\u002Fmalihealikhani\u002FGrounding_in_Dialogue), ACL 2020\n\n[Recent Advances in Vision-and-Language Research](https:\u002F\u002Frohit497.github.io\u002FRecent-Advances-in-Vision-and-Language-Research\u002F), CVPR 2020\n\n[Neuro-Symbolic Visual Reasoning and Program Synthesis](http:\u002F\u002Fnscv.csail.mit.edu\u002F), CVPR 2020\n\n[Large Scale Holistic Video Understanding](https:\u002F\u002Fholistic-video-understanding.github.io\u002Ftutorials\u002Fcvpr2020.html), CVPR 2020\n\n[A Comprehensive Tutorial on Video Modeling](https:\u002F\u002Fcvpr20-video.mxnet.io\u002F), CVPR 2020\n\n\n\n","# 令人惊叹的多模态研究 [![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n\n![build](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fbuild-passing-brightgreen.svg)\n![license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-brightgreen.svg)\n![prs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPRs-welcome-brightgreen.svg)\n\n> 本仓库由 [Paul Liang 的仓库：多模态机器学习主题阅读清单](https:\u002F\u002Fgithub.com\u002Fpliang279\u002Fawesome-multimodal-ml) 重新整理而成，欢迎提交 Pull Request！\n\n\n## 最新动态\n\n**[2023年3月] OpenAI:** *[ChatGPT 插件](https:\u002F\u002Fplatform.openai.com\u002Fdocs\u002Fplugins\u002Fintroduction) 是专为语言模型设计的工具，以安全性为核心原则，帮助 ChatGPT 获取最新信息、执行计算或使用第三方服务。https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins*\n> *\"我们还自建了两个插件，分别是 [网页浏览器](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins#browsing) 和 [代码解释器](https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt-plugins#code-interpreter)。此外，我们还开源了一个知识库 [检索插件](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fchatgpt-retrieval-plugin) 的代码，供任何开发者自行部署，并用他们希望增强 ChatGPT 的信息来扩展其功能。\"*\n\n**[2023年3月] Google Research:** *[Bard](https:\u002F\u002Fblog.google\u002Ftechnology\u002Fai\u002Ftry-bard\u002F) 是一项早期实验，允许你与生成式 AI 合作，其背后是研究型大型语言模型（LLM），具体来说是 LaMDA 的轻量级优化版本。https:\u002F\u002Fbard.google.com\u002F*\n\n**[2023年3月] OpenAI:** *[GPT-4](https:\u002F\u002Fcdn.openai.com\u002Fpapers\u002Fgpt-4.pdf) 是一种大型多模态模型（可接受图像和文本输入，输出文本），尽管在许多现实场景中能力不及人类，但在各类专业和学术基准测试上表现出接近人类的水平。https:\u002F\u002Fopenai.com\u002Fresearch\u002Fgpt-4*\n\n**[2023年3月] Google Research:** *[PaLM-E](https:\u002F\u002Fpalm-e.github.io\u002Fassets\u002Fpalm-e.pdf) 是一种新型通用机器人模型，通过将来自不同视觉和语言领域的知识迁移到机器人系统中，从而克服现有问题。https:\u002F\u002Fai.googleblog.com\u002F2023\u002F03\u002Fpalm-e-embodied-multimodal-language.html*\n\n**[2023年3月] OpenAI:** *[ChatGPT 和 Whisper API](https:\u002F\u002Fopenai.com\u002Fblog\u002Fintroducing-chatgpt-and-whisper-apis)，开发者现在可以通过 API 将 ChatGPT 和 Whisper 模型集成到自己的应用和产品中。https:\u002F\u002Fopenai.com\u002Fblog\u002Fintroducing-chatgpt-and-whisper-apis*\n\n**[2023年2月] MSR:** *[Kosmos-1](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2302.14045.pdf) 是一种多模态大型语言模型（MLLM），能够感知多模态输入、遵循指令，并不仅限于语言任务，还能进行多模态任务的上下文学习。https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm#llm--mllm-multimodal-llm*\n\n**[2023年1月] Google Research:** *[2022 年及以后：语言、视觉与生成模型](https:\u002F\u002Fai.googleblog.com\u002F2023\u002F01\u002Fgoogle-research-2022-beyond-language.html)，这是一系列文章的一部分，谷歌的研究人员将在其中重点介绍 2022 年的一些令人振奋的进展，并展望 2023 年及未来的愿景。https:\u002F\u002Fai.googleblog.com\u002F2023\u002F01\u002Fgoogle-research-2022-beyond-language.html*\n\n**[2022年11月] OpenAI:** *[ChatGPT](https:\u002F\u002Fchat.openai.com\u002Fchat) 是 [InstructGPT](https:\u002F\u002Fopenai.com\u002Fresearch\u002Finstruction-following) 的姊妹模型，经过训练可以遵循提示中的指令并给出详细的回应。https:\u002F\u002Fopenai.com\u002Fblog\u002Fchatgpt*\n\n**[2022年8月] MSR:** *[多模态预训练](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm#multimodal-x--language): [BEiT-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.10442) 是一种通用的多模态基础模型，在视觉任务以及视觉-语言任务上均达到了最先进的迁移性能。https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Funilm\u002Ftree\u002Fmaster\u002Fbeit*\n\n**[2022年4月] OpenAI:** *[DALL·E 2](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2204.06125.pdf) 是一种新的 AI 系统，可以根据自然语言描述生成逼真的图像和艺术作品。https:\u002F\u002Fopenai.com\u002Fdall-e-2\u002F*\n\n**[2021年5月] Google:** *[MuM](https:\u002F\u002Fblog.google\u002Fproducts\u002Fsearch\u002Fintroducing-mum\u002F) 是理解信息方面的一个全新 AI 里程碑。https:\u002F\u002Fblog.google\u002Fproducts\u002Fsearch\u002Fintroducing-mum\u002F*\n\n**[2021年3月] OpenAI:** *[人工神经网络中的多模态神经元](https:\u002F\u002Fdistill.pub\u002F2021\u002Fmultimodal-neurons\u002F) 可能解释了 CLIP 在分类概念的意外视觉呈现时所展现出的高准确性，同时也是理解 CLIP 及其类似模型所学习到的关联性和偏见的重要一步。https:\u002F\u002Fopenai.com\u002Fblog\u002Fmultimodal-neurons\u002F*\n\n**[2021年1月] OpenAI:** *[CLIP](https:\u002F\u002Fopenai.com\u002Fblog\u002Fclip\u002F) 可以将图像映射到文本描述的类别中，而 [DALL-E](https:\u002F\u002Fopenai.com\u002Fblog\u002Fdall-e\u002F) 则可以根据文本生成新图像。这是迈向更深入理解世界系统的一步。https:\u002F\u002Fopenai.com\u002Fmultimodal\u002F*\n\n## 研究论文\n\n* [综述论文](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FSurvey-Papers)\n* [核心领域](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas)\n  * [表示学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FRepresentation-Learning)\n  * [多模态融合](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Fusion)\n  * [多模态对齐](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Alignment)\n  * [多模态翻译](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMultimodal-Translation)\n  * [缺失或不完善的模态](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FMissing-or-Imperfect-Modalities)\n  * [知识图谱与知识库](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FKnowledge-Graphs-and-Knowledge-Bases)\n  * [可解释学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FIntepretable-Learning)\n  * [生成式学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FGenerative-Learning)\n  * [半监督学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FSemi-supervised-Learning)\n  * [自监督学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FSelf-supervised-Learning)\n  * [语言模型](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FLanguage-Models)\n  * [对抗攻击](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FAdversarial-Attacks)\n  * [少样本学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FFew-Shot-Learning)\n  * [偏见与公平性](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FCore-Areas\u002FBias-and-Fairness)\n* [应用](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications)\n  * [语言与视觉问答](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-and-Visual-QA)\n  * [视觉中的语言接地](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-Grounding-in-Vision)\n  * [导航中的语言接地](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-Grouding-in-Navigation)\n  * [多模态机器翻译](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Machine-Translation)\n  * [多智能体通信](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMulti-agent-Communication)\n  * [常识推理](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FCommonsense-Reasoning)\n  * [多模态强化学习](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Reinforcement-Learning)\n  * [多模态对话](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMultimodal-Dialog)\n  * [语言与音频](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FLanguage-and-Audio)\n  * [音频与视觉](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAudio-and-Visual)\n  * [媒体描述](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FMedia-Description)\n  * [文本到视频生成](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FVideo-Generation-from-Text)\n  * [情感识别与多模态语言](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAffect-Recognition-and-Multimodal-Language)\n  * [医疗健康](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FHealthcare)\n  * [机器人技术](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FRobotics)\n  * [自动驾驶](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fpapers\u002FApplications\u002FAutonomous-Driving)\n* [研讨会](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fworkshops)\n* [教程](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Ftutorials)\n* [课程](https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\u002Ftree\u002Fmaster\u002Fcourses)\n\n## 最近的工作坊\n\n[人类与机器人的社会智能](https:\u002F\u002Fsocial-intelligence-human-ai.github.io\u002F)，ICRA 2021\n\n[LANTERN 2021](https:\u002F\u002Fwww.lantern.uni-saarland.de\u002F2021\u002F)：第三届超越视觉与语言的研讨会——整合现实世界知识，EACL 2021\n\n多模态工作坊：[多模态学习与应用](https:\u002F\u002Fmula-workshop.github.io\u002F)、[视与声](http:\u002F\u002Fsightsound.org\u002F)、[视觉问答](https:\u002F\u002Fvisualqa.org\u002Fworkshop)、[具身人工智能](https:\u002F\u002Fembodied-ai.org\u002F)、[面向3D场景的语言](http:\u002F\u002Flanguage3dscenes.github.io\u002F)，CVPR 2021\n\n[语言与视觉研究进展（ALVR）](https:\u002F\u002Falvr-workshop.github.io\u002F)，NAACL 2021\n\n[视觉 grounded 的交互与语言（ViGIL）](https:\u002F\u002Fvigilworkshop.github.io\u002F)，NAACL 2021\n\n[文字游戏：当语言遇上游戏](https:\u002F\u002Fwordplay-workshop.github.io\u002F)，NeurIPS 2020\n\n[NLP 超越文本](https:\u002F\u002Fsites.google.com\u002Fview\u002Fnlpbt-2020)，EMNLP 2020\n\n[组合式与多模态感知国际挑战赛](https:\u002F\u002Fcamp-workshop.stanford.edu\u002F)，ECCV 2020\n\n[多模态视频分析研讨会及 Moments in Time 挑战赛](https:\u002F\u002Fsites.google.com\u002Fview\u002Fmultimodalvideo-v2)，ECCV 2020\n\n[视频图灵测试：迈向人类级视频故事理解](https:\u002F\u002Fdramaqa.snu.ac.kr\u002FWorkshop\u002F2020)，ECCV 2020\n\n[人类多模态语言大型挑战赛及研讨会](http:\u002F\u002Fmulticomp.cs.cmu.edu\u002Facl2020multimodalworkshop\u002F)，ACL 2020\n\n[多模态学习研讨会](https:\u002F\u002Fmul-workshop.github.io\u002F)，CVPR 2020\n\n[语言与视觉及其在视频理解中的应用](https:\u002F\u002Flanguageandvision.github.io\u002F)，CVPR 2020\n\n[活动识别国际挑战赛（ActivityNet）](http:\u002F\u002Factivity-net.org\u002Fchallenges\u002F2020\u002F)，CVPR 2020\n\n[端到端视频理解五项全能赛](https:\u002F\u002Fwww.robots.ox.ac.uk\u002F~vgg\u002Fchallenges\u002Fvideo-pentathlon\u002F)，CVPR 2020\n\n[迈向以人为中心的图像\u002F视频合成，以及第四届 Look Into Person (LIP) 挑战赛](https:\u002F\u002Fvuhcs.github.io\u002F)，CVPR 2020\n\n[视觉问答与对话](https:\u002F\u002Fvisualqa.org\u002Fworkshop)，CVPR 2020\n\n\n## 最近的教程\n\n[多模态机器学习教程](https:\u002F\u002Fcmu-multicomp-lab.github.io\u002Fmmml-tutorial\u002Fcvpr2022\u002F)，CVPR 2022 && NAACL 2022\n\n[网络上文本、半结构化及表格数据中的多模态信息抽取（前沿技术）](https:\u002F\u002Fsites.google.com\u002Fview\u002Facl-2020-multi-modal-ie)，ACL 2020\n\n[在多模态对话中达成共同语境（前沿技术）](https:\u002F\u002Fgithub.com\u002Fmalihealikhani\u002FGrounding_in_Dialogue)，ACL 2020\n\n[视觉-语言研究的最新进展](https:\u002F\u002Frohit497.github.io\u002FRecent-Advances-in-Vision-and-Language-Research\u002F)，CVPR 2020\n\n[神经符号视觉推理与程序合成](http:\u002F\u002Fnscv.csail.mit.edu\u002F)，CVPR 2020\n\n[大规模整体视频理解教程](https:\u002F\u002Fholistic-video-understanding.github.io\u002Ftutorials\u002Fcvpr2020.html)，CVPR 2020\n\n[视频建模综合教程](https:\u002F\u002Fcvpr20-video.mxnet.io\u002F)，CVPR 2020","# Awesome-Multimodal-Research 快速上手指南\n\n**项目简介**：\n`Awesome-Multimodal-Research` 并非一个可直接运行的软件库或框架，而是一个**精心整理的多模态机器学习研究资源列表**。它汇集了该领域最新的综述论文、核心研究方向（如表示学习、多模态融合）、应用场景（如视觉问答、机器人学）以及相关研讨会和教程链接。本项目旨在为研究人员和开发者提供一站式的文献导航。\n\n因此，本“上手指南”将指导你如何高效地获取、浏览和利用这份宝贵的学术资源。\n\n## 环境准备\n\n本项目主要为文档和资源链接集合，**无需安装任何复杂的深度学习环境或依赖库**。只需具备以下条件即可开始使用：\n\n*   **操作系统**：Windows, macOS, 或 Linux 均可。\n*   **必备工具**：\n    *   现代 Web 浏览器（推荐 Chrome, Edge, Firefox）。\n    *   Git（可选，用于克隆仓库到本地以便离线浏览或贡献）。\n*   **网络环境**：\n    *   由于部分资源链接指向 GitHub、arXiv 或国外机构博客，建议确保网络连接通畅。\n    *   **国内加速方案**：如果访问 GitHub 原站速度较慢，推荐使用国内镜像源克隆代码：\n        ```bash\n        # 使用 Gitee 镜像（如果可用）或国内加速代理\n        git clone https:\u002F\u002Fgitee.com\u002Fmirrors\u002FAwesome-Multimodal-Research.git\n        # 或者使用标准的 git clone 配合国内代理设置\n        ```\n        *(注：若暂无特定 Gitee 镜像，直接使用标准 clone 命令配合科学上网或 GitHub 加速插件即可)*\n\n## 安装步骤（获取资源）\n\n你可以通过以下两种方式获取资源列表：\n\n### 方式一：在线浏览（推荐）\n直接访问 GitHub 仓库页面，利用左侧目录树快速跳转至感兴趣的板块。\n*   **地址**：https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research\n\n### 方式二：本地克隆\n如果你希望离线阅读或整理笔记，可以将仓库克隆到本地。\n\n1.  打开终端（Terminal 或 CMD）。\n2.  执行以下命令：\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FEurus-Holmes\u002FAwesome-Multimodal-Research.git\n    ```\n3.  进入项目目录：\n    ```bash\n    cd Awesome-Multimodal-Research\n    ```\n4.  使用 Markdown 阅读器（如 VS Code, Typora）打开 `README.md` 或直接查看 `papers\u002F` 目录下的分类文件。\n\n## 基本使用\n\n本项目的核心用法是**按图索骥**，根据你的研究兴趣查找对应的论文列表。\n\n### 1. 查找综述论文 (Survey Papers)\n如果你是刚入门多模态领域，建议先从综述开始。\n*   **路径**：点击仓库中的 `papers\u002FSurvey-Papers` 文件夹。\n*   **内容**：这里列出了涵盖多模态学习各个方面的综述文章，帮助你建立宏观知识体系。\n\n### 2. 探索核心研究领域 (Core Areas)\n针对具体技术点进行深入研究。\n*   **路径**：进入 `papers\u002FCore-Areas`。\n*   **示例**：\n    *   想研究图文匹配？查看 `Multimodal-Alignment`。\n    *   想研究生成模型？查看 `Generative-Learning`。\n    *   想研究大语言模型？查看 `Language-Models`。\n*   **操作**：点击进入相应子文件夹，即可看到按时间或重要性排序的论文列表及 arXiv\u002FPDF 链接。\n\n### 3. 寻找应用场景 (Applications)\n关注技术在具体场景的落地。\n*   **路径**：进入 `papers\u002FApplications`。\n*   **热门方向**：\n    *   `Robotics` (机器人学)\n    *   `Autonomous Driving` (自动驾驶)\n    *   `Healthcare` (医疗健康)\n    *   `Language-and-Visual-QA` (视觉问答)\n\n### 4. 追踪最新动态 (News)\n在 `README.md` 顶部的 **News** 章节，记录了如 GPT-4, PaLM-E, ChatGPT Plugins 等最新的多模态大模型进展。定期查看此部分可保持对前沿技术的敏感度。\n\n### 5. 参与贡献\n如果你发现了新的优质论文或资源，欢迎通过 Pull Request (PR) 贡献给社区：\n```bash\n# 1. Fork 仓库\n# 2. 克隆你的 Fork\ngit clone https:\u002F\u002Fgithub.com\u002F\u003C你的用户名>\u002FAwesome-Multimodal-Research.git\n\n# 3. 创建新分支\ngit checkout -b add-new-paper\n\n# 4. 在对应的 .md 文件中添加论文链接和简介\n# 5. 提交并推送\ngit commit -m \"Add new paper on [Topic]\"\ngit push origin add-new-paper\n# 6. 在 GitHub 上发起 Pull Request\n```","某高校人工智能实验室的博士生正在撰写关于“具身智能与多模态大模型”的综述论文，急需梳理 2022 至 2023 年间爆发的最新研究成果。\n\n### 没有 Awesome-Multimodal-Research 时\n- **信息检索碎片化**：需要在 arXiv、Google Scholar 和各公司博客间反复切换搜索，难以系统性捕捉如 PaLM-E 或 Kosmos-1 等跨领域的新模型。\n- **关键进展易遗漏**：面对海量论文，极易错过 OpenAI GPT-4 的多模态特性更新或 ChatGPT 插件机制等对研究方向有重大影响的技术细节。\n- **验证成本高昂**：找到的代码仓库往往缺乏维护或许可证不明，需花费大量时间确认项目是否开源可用及是否有活跃社区支持。\n- **脉络梳理困难**：难以快速理清从 BEiT-3 到 DALL·E 2 等技术演进的逻辑关系，导致文献综述部分结构松散，缺乏权威性引用支撑。\n\n### 使用 Awesome-Multimodal-Research 后\n- **资源一站式获取**：直接查阅按时间线整理的精选列表，瞬间定位到 Google Research 和 Microsoft 发布的最新具身智能与预训练模型论文。\n- **前沿动态零时差**：通过\"News\"板块快速掌握 GPT-4 视觉输入能力、Whisper API 集成等关键突破，确保研究视角紧跟行业最前沿。\n- **工程落地有保障**：每个条目均附带官方代码链接与 MIT 等清晰许可证标识，可立即克隆复现，大幅降低环境配置与合规风险。\n- **知识体系结构化**：借助分类清晰的目录，迅速构建起从基础预训练到机器人应用的技术演进图谱，显著提升论文写作的逻辑深度。\n\nAwesome-Multimodal-Research 将原本数周的文献调研工作压缩至数小时，让研究者能专注于核心创新而非在信息海洋中迷失方向。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FEurus-Holmes_Awesome-Multimodal-Research_bf643167.png","Eurus-Holmes","Feiyang(Vance) Chen ","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FEurus-Holmes_c04c0fc9.jpg","Founding Research Engineer @creatify-ai ","Creatify AI","San Francisco, CA","fychen@creatify.ai","vfychen","https:\u002F\u002Fvfychen.com\u002F","https:\u002F\u002Fgithub.com\u002FEurus-Holmes",[87],{"name":88,"color":89,"percentage":90},"Python","#3572A5",100,1390,147,"2026-03-24T13:03:00","MIT","","未说明",{"notes":98,"python":96,"dependencies":99},"该项目是一个多模态研究论文、研讨会和教程的精选列表（Awesome List），并非可执行的软件工具或代码库，因此没有具体的运行环境、依赖库或硬件需求。用户只需通过浏览器访问链接或阅读列出的文献即可。",[],[18],[102,103,104,105],"awesome","multimodal-research","multimodal-learning","multimodal",null,"2026-03-27T02:49:30.150509","2026-04-06T22:02:51.311352",[],[]]