[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Cadene--vqa.pytorch":3,"tool-Cadene--vqa.pytorch":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":78,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":105,"forks":106,"last_commit_at":107,"license":78,"difficulty_score":108,"env_os":109,"env_gpu":110,"env_ram":111,"env_deps":112,"category_tags":122,"github_topics":123,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":165},3859,"Cadene\u002Fvqa.pytorch","vqa.pytorch","Visual Question Answering in Pytorch","vqa.pytorch 是一个基于 PyTorch 框架开发的开源项目，专注于解决“视觉问答”（VQA）这一前沿人工智能任务。简单来说，它能让计算机像人一样“看懂”图片并回答相关问题：输入一张图片和一个自然语言问题（例如“图里有几只猫？”），模型便能输出简短准确的文字答案。\n\n该项目旨在降低复现顶尖研究成果的门槛，并为社区提供一个高效、模块化的代码库，以推动多模态数据集上的进一步研究。其核心亮点在于实现了名为 MUTAN（多模态 Tucker 融合）的先进算法，该方法在 VQA 1.0 数据集上曾达到业界领先的性能水平。代码架构灵活，支持替换不同的图像编码器（如 ResNet）、问题处理模型（如 LSTM）以及多种融合策略，方便研究者进行定制化实验。\n\nvqa.pytorch 主要面向 AI 研究人员、深度学习开发者以及相关领域的学生。如果你希望深入探索计算机视觉与自然语言处理的交叉领域，或者需要在一个成熟的基准上训练和评估自己的多模态模型，这个项目将提供从特征提取、模型训练到结果评估的全流程支持，是进入该研究领域的理想起点。","# Visual Question Answering in pytorch\n\n**\u002F!\\ New version of pytorch for VQA available here:** https:\u002F\u002Fgithub.com\u002FCadene\u002Fblock.bootstrap.pytorch\n\nThis repo was made by [Remi Cadene](http:\u002F\u002Fremicadene.com) (LIP6) and [Hedi Ben-Younes](https:\u002F\u002Ftwitter.com\u002Flabegne) (LIP6-Heuritech), two PhD Students working on VQA at [UPMC-LIP6](http:\u002F\u002Flip6.fr) and their professors [Matthieu Cord](http:\u002F\u002Fwebia.lip6.fr\u002F~cord) (LIP6) and [Nicolas Thome](http:\u002F\u002Fwebia.lip6.fr\u002F~thomen) (LIP6-CNAM). We developed this code in the frame of a research paper called [MUTAN: Multimodal Tucker Fusion for VQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676) which is (as far as we know) the current state-of-the-art on the [VQA 1.0 dataset](http:\u002F\u002Fvisualqa.org).\n\nThe goal of this repo is two folds:\n- to make it easier to reproduce our results,\n- to provide an efficient and modular code base to the community for further research on other VQA datasets.\n\nIf you have any questions about our code or model, don't hesitate to contact us or to submit any issues. Pull request are welcome!\n\n#### News:\n\n- 16th january 2018: a pretrained vqa2 model and web demo\n- 18th july 2017: VQA2, VisualGenome, FBResnet152 (for pytorch) added [v2.0 commit msg](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fcommit\u002F42391fd4a39c31e539eb6cb73ecd370bac0f010a)\n- 16th july 2017: paper accepted at ICCV2017\n- 30th may 2017: poster accepted at CVPR2017 (VQA Workshop)\n\n#### Summary:\n\n* [Introduction](#introduction)\n    * [What is the task about?](#what-is-the-task-about)\n    * [Quick insight about our method](#quick-insight-about-our-method)\n* [Installation](#installation)\n    * [Requirements](#requirements)\n    * [Submodules](#submodules)\n    * [Data](#data)\n* [Reproducing results on VQA 1.0](#reproducing-results-on-vqa-10)\n    * [Features](#features)\n    * [Pretrained models](#pretrained-models)\n* [Reproducing results on VQA 2.0](#reproducing-results-on-vqa-20)\n    * [Features](#features-20)\n    * [Pretrained models](#pretrained-models-20)\n* [Documentation](#documentation)\n    * [Architecture](#architecture)\n    * [Options](#options)\n    * [Datasets](#datasets)\n    * [Models](#models)\n* [Quick examples](#quick-examples)\n    * [Extract features from COCO](#extract-features-from-coco)\n    * [Extract features from VisualGenome](#extract-features-from-visualgenome)\n    * [Train models on VQA 1.0](#train-models-on-vqa-10)\n    * [Train models on VQA 2.0](#train-models-on-vqa-20)\n    * [Train models on VQA + VisualGenome](#train-models-on-vqa-10-or-20--visualgenome)\n    * [Monitor training](#monitor-training)\n    * [Restart training](#restart-training)\n    * [Evaluate models on VQA](#evaluate-models-on-vqa)\n    * [Web demo](#web-demo)\n* [Citation](#citation)\n* [Acknowledgment](#acknowledgment)\n\n## Introduction\n\n### What is the task about?\n\nThe task is about training models in a end-to-end fashion on a multimodal dataset made of triplets:\n\n- an **image** with no other information than the raw pixels,\n- a **question** about visual content(s) on the associated image,\n- a short **answer** to the question (one or a few words). \n\nAs you can see in the illustration bellow, two different triplets (but same image) of the VQA dataset are represented. The models need to learn rich multimodal representations to be able to give the right answers.\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_722eb5f1e2e0.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\nThe VQA task is still on active research. However, when it will be solved, it could be very useful to improve human-to-machine interfaces (especially for the blinds).\n\n### Quick insight about our method\n\nThe VQA community developped an approach based on four learnable components:\n\n- a question model which can be a LSTM, GRU, or pretrained Skipthoughts,\n- an image model which can be a pretrained VGG16 or ResNet-152,\n- a fusion scheme which can be an element-wise sum, concatenation, [MCB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01847), [MLB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325), or [Mutan](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676),\n- optionally, an attention scheme which may have several \"glimpses\".\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_f875c5c25b6a.png\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\nOne of our claim is that the multimodal fusion between the image and the question representations is a critical component. Thus, our proposed model uses a Tucker Decomposition of the correlation Tensor to model richer multimodal interactions in order to provide proper answers. Our best model is based on :\n\n- a pretrained Skipthoughts for the question model,\n- features from a pretrained Resnet-152 (with images of size 3x448x448) for the image model,\n- our proposed Mutan (based on a Tucker Decomposition) for the fusion scheme,\n- an attention scheme with two \"glimpses\".\n\n## Installation\n\n### Requirements\n\nFirst install python 3 (we don't provide support for python 2). We advise you to install python 3 and pytorch with Anaconda:\n\n- [python with anaconda](https:\u002F\u002Fwww.continuum.io\u002Fdownloads)\n- [pytorch with CUDA](http:\u002F\u002Fpytorch.org)\n\n```\nconda create --name vqa python=3\nsource activate vqa\nconda install pytorch torchvision cuda80 -c soumith\n```\n\nThen clone the repo (with the `--recursive` flag for submodules) and install the complementary requirements:\n\n```\ncd $HOME\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch.git \ncd vqa.pytorch\npip install -r requirements.txt\n```\n\n### Submodules\n\nOur code has two external dependencies:\n\n- [VQA](https:\u002F\u002Fgithub.com\u002FCadene\u002FVQA) is used to evaluate results files on the valset with the OpendEnded accuracy,\n- [skip-thoughts.torch](https:\u002F\u002Fgithub.com\u002FCadene\u002Fskip-thoughts.torch) is used to import pretrained GRUs and embeddings,\n- [pretrained-models.pytorch](https:\u002F\u002Fgithub.com\u002FCadene\u002Fpretrained-models.pytorch) is used to load pretrained convnets.\n\n### Data\n\nData will be automaticaly downloaded and preprocessed when needed. Links to data are stored in `vqa\u002Fdatasets\u002Fvqa.py`, `vqa\u002Fdatasets\u002Fcoco.py` and `vqa\u002Fdatasets\u002Fvgenome.py`.\n\n\n## Reproducing results on VQA 1.0\n\n### Features\n\nAs we first developped on Lua\u002FTorch7, we used the features of [ResNet-152 pretrained with Torch7](https:\u002F\u002Fgithub.com\u002Ffacebook\u002Ffb.resnet.torch). We ported the pretrained resnet152 trained with Torch7 in pytorch in the v2.0 release. We will provide all the extracted features soon. Meanwhile, you can download the coco features as following:\n\n```\nmkdir -p data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\ncd data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.txt\n```\n\n\u002F!\\ There are currently 3 versions of ResNet152:\n\n- fbresnet152torch which is the torch7 model,\n- fbresnet152 which is the porting of the torch7 in pytorch,\n- [resnet152](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fvision\u002Fblob\u002Fmaster\u002Ftorchvision\u002Fmodels\u002Fresnet.py) which is the pretrained model from torchvision (we've got lower results with it).\n\n### Pretrained VQA models\n\nWe currently provide three models trained with our old Torch7 code and ported to Pytorch:\n\n- MutanNoAtt trained on the VQA 1.0 trainset,\n- MLBAtt trained on the VQA 1.0 trainvalset and VisualGenome,\n- MutanAtt trained on the VQA 1.0 trainvalset and VisualGenome.\n\n```\nmkdir -p logs\u002Fvqa\ncd logs\u002Fvqa\nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmutan_noatt_train.zip \nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmlb_att_trainval.zip \nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmutan_att_trainval.zip \n```\n\nEven if we provide results files associated to our pretrained models, you can evaluate them once again on the valset, testset and testdevset using a single command:\n\n```\npython train.py -e --path_opt options\u002Fvqa\u002Fmutan_noatt_train.yaml --resume ckpt\npython train.py -e --path_opt options\u002Fvqa\u002Fmlb_noatt_trainval.yaml --resume ckpt\npython train.py -e --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml --resume ckpt\n```\n\nTo obtain test and testdev results on VQA 1.0, you will need to zip your result json file (name it as `results.zip`) and to submit it on the [evaluation server](https:\u002F\u002Fcompetitions.codalab.org\u002Fcompetitions\u002F6961).\n\n\n## Reproducing results on VQA 2.0\n\n### Features 2.0\n\nYou must download the coco dataset (and visual genome if needed) and then extract the features with a convolutional neural network.\n\n### Pretrained VQA models 2.0\n\nWe currently provide three models trained with our current pytorch code on VQA 2.0\n\n- MutanAtt trained on the trainset with the fbresnet152 features,\n- MutanAtt trained on thetrainvalset with the fbresnet152 features.\n\n```\ncd $VQAPYTORCH\nmkdir -p logs\u002Fvqa2\ncd logs\u002Fvqa2\nwget http:\u002F\u002Fdata.lip6.fr\u002Fcadene\u002Fvqa.pytorch\u002Fvqa2\u002Fmutan_att_train.zip \nwget http:\u002F\u002Fdata.lip6.fr\u002Fcadene\u002Fvqa.pytorch\u002Fvqa2\u002Fmutan_att_trainval.zip \n```\n\n## Documentation\n\n### Architecture\n\n```\n.\n├── options        # default options dir containing yaml files\n├── logs           # experiments dir containing directories of logs (one by experiment)\n├── data           # datasets directories\n|   ├── coco       # images and features\n|   ├── vqa        # raw, interim and processed data\n|   ├── vgenome    # raw, interim, processed data + images and features\n|   └── ...\n├── vqa            # vqa package dir\n|   ├── datasets   # datasets classes & functions dir (vqa, coco, vgenome, images, features, etc.)\n|   ├── external   # submodules dir (VQA, skip-thoughts.torch, pretrained-models.pytorch)\n|   ├── lib        # misc classes & func dir (engine, logger, dataloader, etc.)\n|   └── models     # models classes & func dir (att, fusion, notatt, seq2vec, convnets)\n|\n├── train.py       # train & eval models\n├── eval_res.py    # eval results files with OpenEnded metric\n├── extract.py     # extract features from coco with CNNs\n└── visu.py        # visualize logs and monitor training\n```\n\n### Options\n\nThere are three kind of options:\n\n- options from the yaml options files stored in the `options` directory which are used as default (path to directory, logs, model, features, etc.)\n- options from the ArgumentParser in the `train.py` file which are set to None and can overwrite default options (learning rate, batch size, etc.)\n- options from the ArgumentParser in the `train.py` file which are set to default values (print frequency, number of threads, resume model, evaluate model, etc.)\n\nYou can easly add new options in your custom yaml file if needed. Also, if you want to grid search a parameter, you can add an ArgumentParser option and modify the dictionnary in `train.py:L80`.\n\n### Datasets\n\nWe currently provide four datasets:\n\n- [COCOImages](http:\u002F\u002Fmscoco.org\u002F) currently used to extract features, it comes with three datasets: trainset, valset and testset\n- [VisualGenomeImages]() currently used to extract features, it comes with one split: trainset\n- [VQA 1.0](http:\u002F\u002Fwww.visualqa.org\u002Fvqa_v1_download.html) comes with four datasets: trainset, valset, testset (including test-std and test-dev) and \"trainvalset\" (concatenation of trainset and valset)\n- [VQA 2.0](http:\u002F\u002Fwww.visualqa.org) same but twice bigger (however same images than VQA 1.0)\n\nWe plan to add:\n\n- [CLEVR](http:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fjcjohns\u002Fclevr\u002F)\n\n### Models\n\nWe currently provide four models:\n\n- MLBNoAtt: a strong baseline (BayesianGRU + Element-wise product)\n- [MLBAtt](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325): the previous state-of-the-art which adds an attention strategy\n- MutanNoAtt: our proof of concept (BayesianGRU + Mutan Fusion)\n- MutanAtt: the current state-of-the-art\n\nWe plan to add several other strategies in the futur.\n\n## Quick examples\n\n### Extract features from COCO\n\nThe needed images will be automaticaly downloaded to `dir_data` and the features will be extracted with a resnet152 by default.\n\nThere are three options for `mode` :\n\n- `att`: features will be of size 2048x14x14,\n- `noatt`: features will be of size 2048,\n- `both`: default option.\n\nBeware, you will need some space on your SSD:\n\n- 32GB for the images,\n- 125GB for the train features,\n- 123GB for the test features,\n- 61GB for the val features.\n\n```\npython extract.py -h\npython extract.py --dir_data data\u002Fcoco --data_split train\npython extract.py --dir_data data\u002Fcoco --data_split val\npython extract.py --dir_data data\u002Fcoco --data_split test\n```\n\nNote: By default our code will share computations over all available GPUs. If you want to select only one or a few, use the following prefix:\n\n```\nCUDA_VISIBLE_DEVICES=0 python extract.py\nCUDA_VISIBLE_DEVICES=1,2 python extract.py\n```\n\n### Extract features from VisualGenome\n\nSame here, but only train is available:\n\n```\npython extract.py --dataset vgenome --dir_data data\u002Fvgenome --data_split train\n```\n\n\n### Train models on VQA 1.0\n\nDisplay help message, selected options and run default. The needed data will be automaticaly downloaded and processed using the options in `options\u002Fvqa\u002Fdefault.yaml`.\n\n```\npython train.py -h\npython train.py --help_opt\npython train.py\n``` \n\nRun a MutanNoAtt model with default options.\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt_train.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt_train\n```\n\nRun a MutanAtt model on the trainset and evaluate on the valset after each epoch.\n\n```\npython train.py --vqa_trainsplit train --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml \n``` \n\nRun a MutanAtt model on the trainset and valset (by default) and run throw the testset after each epoch (produce a results file that you can submit to the evaluation server).\n\n```\npython train.py --vqa_trainsplit trainval --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml\n``` \n\n### Train models on VQA 2.0\n\nSee options of [vqa2\u002Fmutan_att_trainval](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fblob\u002Fmaster\u002Foptions\u002Fvqa2\u002Fmutan_att_trainval.yaml):\n\n```\npython train.py --path_opt options\u002Fvqa2\u002Fmutan_att_trainval.yaml\n``` \n\n### Train models on VQA (1.0 or 2.0) + VisualGenome\n\nSee options of [vqa2\u002Fmutan_att_trainval_vg](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fblob\u002Fmaster\u002Foptions\u002Fvqa2\u002Fmutan_att_trainval_vg.yaml):\n\n```\npython train.py --path_opt options\u002Fvqa2\u002Fmutan_att_trainval_vg.yaml\n``` \n\n### Monitor training\n\nCreate a visualization of an experiment using `plotly` to monitor the training, just like the picture bellow (**click the image to access the html\u002Fjs file**):\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Frawgit.com\u002FCadene\u002Fvqa.pytorch\u002Fmaster\u002Fdoc\u002Fmutan_noatt.html\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_321cb83d4bc1.png\" width=\"600\"\u002F>\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\nNote that you have to wait until the first open ended accuracy has finished processing and then the html file will be created and will pop out on your default browser. The html will be refreshed every 60 seconds. However, you will currently need to press F5 on your browser to see the change.\n\n```\npython visu.py --dir_logs logs\u002Fvqa\u002Fmutan_noatt\n```\n\nCreate a visualization of multiple experiments to compare them or monitor them like the picture bellow (**click the image to access the html\u002Fjs file**):\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Frawgit.com\u002FCadene\u002Fvqa.pytorch\u002Fmaster\u002Fdoc\u002Fmutan_noatt_vs_att.html\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_1a1e11858e98.png\" width=\"600\"\u002F>\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n```\npython visu.py --dir_logs logs\u002Fvqa\u002Fmutan_noatt,logs\u002Fvqa\u002Fmutan_att\n```\n\n\n\n### Restart training\n\nRestart the model from the last checkpoint.\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt --resume ckpt\n```\n\nRestart the model from the best checkpoint.\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt --resume best\n```\n\n### Evaluate models on VQA\n\nEvaluate the model from the best checkpoint. If your model has been trained on the training set only (`vqa_trainsplit=train`), the model will be evaluate on the valset and will run throw the testset. If it was trained on the trainset + valset (`vqa_trainsplit=trainval`), it will not be evaluate on the valset.\n\n```\npython train.py --vqa_trainsplit train --path_opt options\u002Fvqa\u002Fmutan_att.yaml --dir_logs logs\u002Fvqa\u002Fmutan_att --resume best -e\n```\n\n### Web demo\n\nYou must set your local ip address and port in `demo_server.py`  line 169 and your global ip address and port in `demo_web\u002Fjs\u002Fcustom.js` line 51.\nThe port associated to the global ip address must redirect to your local ip address.\n\nLaunch your API:\n```\nCUDA_VISIBLE_DEVICES=0 python demo_server.py\n```\n\nOpen `demo_web\u002Findex.html` on your browser to access the API with a human interface.\n\n## Citation\n\nPlease cite the arXiv paper if you use Mutan in your work:\n\n```\n@article{benyounescadene2017mutan,\n  author = {Hedi Ben-Younes and \n    R{\\'{e}}mi Cad{\\`{e}}ne and\n    Nicolas Thome and\n    Matthieu Cord},\n  title = {MUTAN: Multimodal Tucker Fusion for Visual Question Answering},\n  journal = {ICCV},\n  year = {2017},\n  url = {http:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676}\n}\n```\n\n## Acknowledgment\n\nSpecial thanks to the authors of [MLB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325) for providing some [Torch7 code](https:\u002F\u002Fgithub.com\u002Fjnhwkim\u002FMulLowBiVQA), [MCB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01847) for providing some [Caffe code](https:\u002F\u002Fgithub.com\u002Fakirafukui\u002Fvqa-mcb), and our professors and friends from LIP6 for the perfect working atmosphere.\n","# PyTorch 中的视觉问答\n\n**\u002F!\\ VQA 的新版本 PyTorch 代码在此：** https:\u002F\u002Fgithub.com\u002FCadene\u002Fblock.bootstrap.pytorch\n\n本仓库由 [Remi Cadene](http:\u002F\u002Fremicadene.com)（LIP6）和 [Hedi Ben-Younes](https:\u002F\u002Ftwitter.com\u002Flabegne)（LIP6-Heuritech）两位在 [UPMC-LIP6](http:\u002F\u002Flip6.fr) 从事 VQA 研究的博士生，以及他们的导师 [Matthieu Cord](http:\u002F\u002Fwebia.lip6.fr\u002F~cord)（LIP6）和 [Nicolas Thome](http:\u002F\u002Fwebia.lip6.fr\u002F~thomen)（LIP6-CNAM）共同开发。我们是在一篇名为 [MUTAN: Multimodal Tucker Fusion for VQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676) 的研究论文框架下开发了这段代码，该论文目前（据我们所知）是 [VQA 1.0 数据集](http:\u002F\u002Fvisualqa.org) 上的最新最先进方法。\n\n本仓库的目标有两个：\n- 方便他人复现我们的实验结果；\n- 为社区提供一个高效且模块化的代码库，以支持在其他 VQA 数据集上的进一步研究。\n\n如果您对我们的代码或模型有任何疑问，请随时与我们联系或提交问题。欢迎提出 Pull 请求！\n\n#### 最新动态：\n\n- 2018年1月16日：预训练的 vqa2 模型及在线演示\n- 2017年7月18日：新增 VQA2、VisualGenome 和 FBResnet152（适用于 PyTorch）[v2.0 提交信息](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fcommit\u002F42391fd4a39c31e539eb6cb73ecd370bac0f010a)\n- 2017年7月16日：论文被 ICCV2017 接受\n- 2017年5月30日：海报被 CVPR2017（VQA Workshop）接受\n\n#### 目录：\n\n* [简介](#introduction)\n    * [任务是什么？](#what-is-the-task-about)\n    * [我们方法的简要介绍](#quick-insight-about-our-method)\n* [安装](#installation)\n    * [依赖项](#requirements)\n    * [子模块](#submodules)\n    * [数据](#data)\n* [复现 VQA 1.0 上的结果](#reproducing-results-on-vqa-10)\n    * [特征提取](#features)\n    * [预训练模型](#pretrained-models)\n* [复现 VQA 2.0 上的结果](#reproducing-results-on-vqa-20)\n    * [特征提取](#features-20)\n    * [预训练模型](#pretrained-models-20)\n* [文档](#documentation)\n    * [架构](#architecture)\n    * [选项](#options)\n    * [数据集](#datasets)\n    * [模型](#models)\n* [快速示例](#quick-examples)\n    * [从 COCO 数据集中提取特征](#extract-features-from-coco)\n    * [从 VisualGenome 数据集中提取特征](#extract-features-from-visualgenome)\n    * [在 VQA 1.0 上训练模型](#train-models-on-vqa-10)\n    * [在 VQA 2.0 上训练模型](#train-models-on-vqa-20)\n    * [在 VQA + VisualGenome 上训练模型](#train-models-on-vqa-10-or-20--visualgenome)\n    * [监控训练过程](#monitor-training)\n    * [重启训练](#restart-training)\n    * [在 VQA 上评估模型](#evaluate-models-on-vqa)\n    * [在线演示](#web-demo)\n* [引用](#citation)\n* [致谢](#acknowledgment)\n\n## 简介\n\n### 任务是什么？\n\n该任务是在一个多模态数据集上进行端到端的模型训练，数据集由三元组组成：\n\n- 一张仅包含原始像素信息的 **图像**，\n- 一张关于该图像中视觉内容的 **问题**，\n- 一个简短的 **答案**（一两个词）。\n\n如下图所示，展示了 VQA 数据集中两个不同的三元组（但使用同一张图像）。模型需要学习丰富的多模态表示，才能给出正确的答案。\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_722eb5f1e2e0.png\" width=\"600\"\u002F>\n\u003C\u002Fp>\n\nVQA 任务目前仍在积极研究中。然而，一旦这一任务得以解决，它将极大地改善人机交互界面，尤其是对视障人士而言。\n\n### 我们方法的简要介绍\n\nVQA 社区已经发展出一种基于四个可学习组件的方法：\n\n- 问题模型，可以是 LSTM、GRU 或预训练的 Skipthoughts；\n- 图像模型，可以是预训练的 VGG16 或 ResNet-152；\n- 融合方案，可以是逐元素相加、拼接、[MCB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01847)、[MLB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325) 或 [Mutan](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676)；\n- 可选的注意力机制，可能包含多个“视野”。\n\n\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_f875c5c25b6a.png\" width=\"400\"\u002F>\n\u003C\u002Fp>\n\n我们认为，图像和问题表示之间的多模态融合是关键组件之一。因此，我们提出的模型利用相关性张量的 Tucker 分解来建模更丰富的多模态交互，从而生成准确的答案。我们最好的模型基于以下配置：\n\n- 预训练的 Skipthoughts 作为问题模型；\n- 使用预训练 ResNet-152 提取特征（输入图像尺寸为 3x448x448）作为图像模型；\n- 我们的 Mutan 融合方案（基于 Tucker 分解）；\n- 带有两个“视野”的注意力机制。\n\n## 安装\n\n### 依赖项\n\n首先安装 Python 3（我们不支持 Python 2）。建议您使用 Anaconda 安装 Python 3 和 PyTorch：\n\n- [Anaconda 版本的 Python](https:\u002F\u002Fwww.continuum.io\u002Fdownloads)\n- [带有 CUDA 的 PyTorch](http:\u002F\u002Fpytorch.org)\n\n```\nconda create --name vqa python=3\nsource activate vqa\nconda install pytorch torchvision cuda80 -c soumith\n```\n\n然后克隆本仓库（使用 `--recursive` 标志以获取子模块），并安装其他依赖项：\n\n```\ncd $HOME\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch.git \ncd vqa.pytorch\npip install -r requirements.txt\n```\n\n### 子模块\n\n我们的代码有两个外部依赖：\n\n- [VQA](https:\u002F\u002Fgithub.com\u002FCadene\u002FVQA) 用于在验证集上使用 OpendEnded 准确率评估结果文件；\n- [skip-thoughts.torch](https:\u002F\u002Fgithub.com\u002FCadene\u002Fskip-thoughts.torch) 用于导入预训练的 GRU 和嵌入；\n- [pretrained-models.pytorch](https:\u002F\u002Fgithub.com\u002FCadene\u002Fpretrained-models.pytorch) 用于加载预训练的卷积神经网络。\n\n### 数据\n\n数据将在需要时自动下载并预处理。数据链接存储在 `vqa\u002Fdatasets\u002Fvqa.py`、`vqa\u002Fdatasets\u002Fcoco.py` 和 `vqa\u002Fdatasets\u002Fvgenome.py` 文件中。\n\n\n## 复现 VQA 1.0 上的结果\n\n### 特征\n\n最初我们在 Lua\u002FTorch7 上进行开发时，使用了 [Torch7 预训练的 ResNet-152](https:\u002F\u002Fgithub.com\u002Ffacebook\u002Ffb.resnet.torch) 的特征。在 v2.0 版本中，我们将用 Torch7 训练的预训练 ResNet152 模型移植到了 PyTorch。我们很快会提供所有提取的特征。与此同时，您可以按照以下方式下载 COCO 数据集的特征：\n\n```\nmkdir -p data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\ncd data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.txt\n```\n\n\u002F!\\ 目前有 3 种版本的 ResNet152：\n\n- fbresnet152torch 是 Torch7 模型；\n- fbresnet152 是 Torch7 模型在 PyTorch 中的移植版本；\n- [resnet152](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fvision\u002Fblob\u002Fmaster\u002Ftorchvision\u002Fmodels\u002Fresnet.py) 是 torchvision 中的预训练模型（我们使用该模型得到的结果较低）。\n\n### 预训练 VQA 模型\n\n目前我们提供了三个使用旧的 Torch7 代码训练并移植到 PyTorch 的模型：\n\n- MutanNoAtt：在 VQA 1.0 训练集上训练；\n- MLBAtt：在 VQA 1.0 训练验证集和 VisualGenome 数据集上训练；\n- MutanAtt：在 VQA 1.0 训练验证集和 VisualGenome 数据集上训练。\n\n```\nmkdir -p logs\u002Fvqa\ncd logs\u002Fvqa\nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmutan_noatt_train.zip \nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmlb_att_trainval.zip \nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmutan_att_trainval.zip \n```\n\n尽管我们提供了与预训练模型相关的结果文件，您仍然可以使用一条命令在验证集、测试集和测试开发集上再次评估这些模型：\n\n```\npython train.py -e --path_opt options\u002Fvqa\u002Fmutan_noatt_train.yaml --resume ckpt\npython train.py -e --path_opt options\u002Fvqa\u002Fmlb_noatt_trainval.yaml --resume ckpt\npython train.py -e --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml --resume ckpt\n```\n\n要获得 VQA 1.0 的测试和测试开发集结果，您需要将结果 JSON 文件打包成 `results.zip`，并提交到 [评估服务器](https:\u002F\u002Fcompetitions.codalab.org\u002Fcompetitions\u002F6961)。\n\n## 在 VQA 2.0 上复现结果\n\n### 特征 2.0\n\n您必须下载 COCO 数据集（如果需要，还要下载 Visual Genome 数据集），然后使用卷积神经网络提取特征。\n\n### 预训练 VQA 模型 2.0\n\n目前我们提供了三个使用当前 PyTorch 代码在 VQA 2.0 上训练的模型：\n\n- MutanAtt：使用 fbresnet152 特征在训练集上训练；\n- MutanAtt：使用 fbresnet152 特征在训练验证集上训练。\n\n```\ncd $VQAPYTORCH\nmkdir -p logs\u002Fvqa2\ncd logs\u002Fvqa2\nwget http:\u002F\u002Fdata.lip6.fr\u002Fcadene\u002Fvqa.pytorch\u002Fvqa2\u002Fmutan_att_train.zip \nwget http:\u002F\u002Fdata.lip6.fr\u002Fcadene\u002Fvqa.pytorch\u002Fvqa2\u002Fmutan_att_trainval.zip \n```\n\n## 文档\n\n### 架构\n\n```\n.\n├── options        # 默认选项目录，包含 YAML 文件\n├── logs           # 实验目录，每个实验对应一个日志目录\n├── data           # 数据集目录\n|   ├── coco       # 图像和特征\n|   ├── vqa        # 原始、中间和处理后的数据\n|   ├── vgenome    # 原始、中间、处理后的数据 + 图像和特征\n|   └── ...\n├── vqa            # VQA 包目录\n|   ├── datasets   # 数据集类和函数目录（VQA、COCO、Visual Genome、图像、特征等）\n|   ├── external   # 子模块目录（VQA、skip-thoughts.torch、pretrained-models.pytorch）\n|   ├── lib        # 杂项类和函数目录（引擎、日志记录器、数据加载器等）\n|   └── models     # 模型类和函数目录（注意力、融合、无注意力、序列到向量、卷积网络等）\n|\n├── train.py       # 模型训练和评估\n├── eval_res.py    # 使用 OpenEnded 指标评估结果文件\n├── extract.py     # 使用 CNN 从 COCO 提取特征\n└── visu.py        # 可视化日志并监控训练过程\n```\n\n### 选项\n\n有三种类型的选项：\n\n- `options` 目录中存储的 YAML 选项文件中的默认选项（路径、日志、模型、特征等）；\n- `train.py` 文件中 ArgumentParser 设置为 None 的选项，可覆盖默认选项（学习率、批量大小等）；\n- `train.py` 文件中 ArgumentParser 设置为默认值的选项（打印频率、线程数、恢复模型、评估模型等）。\n\n如有需要，您可以轻松地在自定义 YAML 文件中添加新选项。此外，如果您想对某个参数进行网格搜索，可以添加一个 ArgumentParser 选项，并修改 `train.py:L80` 中的字典。\n\n### 数据集\n\n目前我们提供了四个数据集：\n\n- [COCOImages](http:\u002F\u002Fmscoco.org\u002F) 目前用于提取特征，包含三个数据集：训练集、验证集和测试集；\n- [VisualGenomeImages]() 目前用于提取特征，仅有一个划分：训练集；\n- [VQA 1.0](http:\u002F\u002Fwww.visualqa.org\u002Fvqa_v1_download.html) 包含四个数据集：训练集、验证集、测试集（包括标准测试和开发测试）以及“训练验证集”（训练集和验证集的合并）；\n- [VQA 2.0](http:\u002F\u002Fwww.visualqa.org) 与之类似，但规模是其两倍（不过图像与 VQA 1.0 相同）；\n\n我们计划增加：\n\n- [CLEVR](http:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fjcjohns\u002Fclevr\u002F)\n\n### 模型\n\n目前我们提供了四个模型：\n\n- MLBNoAtt：一个强大的基线模型（BayesianGRU + 元素级乘法）；\n- [MLBAtt](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325)：之前的最先进模型，增加了注意力机制；\n- MutanNoAtt：我们的概念验证模型（BayesianGRU + Mutan 融合）；\n- MutanAtt：当前最先进的模型；\n\n我们计划在未来添加更多策略。\n\n## 快速示例\n\n### 从 COCO 提取特征\n\n所需的图像将自动下载到 `dir_data` 目录，特征将默认使用 ResNet152 提取。\n\n`mode` 有三种选项：\n\n- `att`：特征尺寸为 2048x14x14；\n- `noatt`：特征尺寸为 2048；\n- `both`：默认选项。\n\n请注意，您的 SSD 需要足够的空间：\n\n- 图像占用 32GB；\n- 训练集特征占用 125GB；\n- 测试集特征占用 123GB；\n- 验证集特征占用 61GB。\n\n```\npython extract.py -h\npython extract.py --dir_data data\u002Fcoco --data_split train\npython extract.py --dir_data data\u002Fcoco --data_split val\npython extract.py --dir_data data\u002Fcoco --data_split test\n```\n\n注意：默认情况下，我们的代码会在所有可用的 GPU 上共享计算。如果您只想选择一个或几个 GPU，请使用以下前缀：\n\n```\nCUDA_VISIBLE_DEVICES=0 python extract.py\nCUDA_VISIBLE_DEVICES=1,2 python extract.py\n```\n\n### 从 VisualGenome 提取特征\n\n同样，但只有训练集可用：\n\n```\npython extract.py --dataset vgenome --dir_data data\u002Fvgenome --data_split train\n```\n\n### 在 VQA 1.0 上训练模型\n\n显示帮助信息、所选选项并运行默认设置。所需数据将根据 `options\u002Fvqa\u002Fdefault.yaml` 中的选项自动下载并处理。\n\n```\npython train.py -h\npython train.py --help_opt\npython train.py\n```\n\n使用默认选项运行 MutanNoAtt 模型。\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt_train.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt_train\n```\n\n在训练集上运行 MutanAtt 模型，并在每个 epoch 结束后在验证集上进行评估。\n\n```\npython train.py --vqa_trainsplit train --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml\n```\n\n在训练集和验证集（默认）上运行 MutanAtt 模型，并在每个 epoch 结束后对测试集进行推理（生成可提交到评估服务器的结果文件）。\n\n```\npython train.py --vqa_trainsplit trainval --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml\n```\n\n### 在 VQA 2.0 上训练模型\n\n参阅 [vqa2\u002Fmutan_att_trainval](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fblob\u002Fmaster\u002Foptions\u002Fvqa2\u002Fmutan_att_trainval.yaml) 的选项：\n\n```\npython train.py --path_opt options\u002Fvqa2\u002Fmutan_att_trainval.yaml\n```\n\n### 在 VQA（1.0 或 2.0）+ VisualGenome 上训练模型\n\n参阅 [vqa2\u002Fmutan_att_trainval_vg](https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fblob\u002Fmaster\u002Foptions\u002Fvqa2\u002Fmutan_att_trainval_vg.yaml) 的选项：\n\n```\npython train.py --path_opt options\u002Fvqa2\u002Fmutan_att_trainval_vg.yaml\n```\n\n### 监控训练过程\n\n使用 `plotly` 创建实验可视化图，以监控训练过程，效果如图所示（**点击图片即可访问 HTML\u002FJS 文件**）：\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Frawgit.com\u002FCadene\u002Fvqa.pytorch\u002Fmaster\u002Fdoc\u002Fmutan_noatt.html\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_321cb83d4bc1.png\" width=\"600\"\u002F>\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n请注意，需等待首次开放性问题准确率计算完成，HTML 文件才会生成并在默认浏览器中打开。该页面每 60 秒会自动刷新，但您仍需手动按 F5 刷新浏览器才能看到更新。\n\n```\npython visu.py --dir_logs logs\u002Fvqa\u002Fmutan_noatt\n```\n\n创建多个实验的可视化图，以便比较或监控它们，效果如图所示（**点击图片即可访问 HTML\u002FJS 文件**）：\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Frawgit.com\u002FCadene\u002Fvqa.pytorch\u002Fmaster\u002Fdoc\u002Fmutan_noatt_vs_att.html\">\n        \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_readme_1a1e11858e98.png\" width=\"600\"\u002F>\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\n```\npython visu.py --dir_logs logs\u002Fvqa\u002Fmutan_noatt,logs\u002Fvqa\u002Fmutan_att\n```\n\n### 继续训练\n\n从最后一个检查点恢复模型训练。\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt --resume ckpt\n```\n\n从最佳检查点恢复模型训练。\n\n```\npython train.py --path_opt options\u002Fvqa\u002Fmutan_noatt.yaml --dir_logs logs\u002Fvqa\u002Fmutan_noatt --resume best\n```\n\n### 在 VQA 上评估模型\n\n从最佳检查点评估模型。如果您的模型仅在训练集上训练过（`vqa_trainsplit=train`），则会在验证集上评估，并继续对测试集进行推理；若是在训练集和验证集上都训练过（`vqa_trainsplit=trainval`），则不会在验证集上再次评估。\n\n```\npython train.py --vqa_trainsplit train --path_opt options\u002Fvqa\u002Fmutan_att.yaml --dir_logs logs\u002Fvqa\u002Fmutan_att --resume best -e\n```\n\n### Web 演示\n\n您需要在 `demo_server.py` 第 169 行设置本地 IP 地址和端口，在 `demo_web\u002Fjs\u002Fcustom.js` 第 51 行设置全局 IP 地址和端口。全局 IP 地址对应的端口必须重定向到您的本地 IP 地址。\n\n启动您的 API：\n\n```\nCUDA_VISIBLE_DEVICES=0 python demo_server.py\n```\n\n在浏览器中打开 `demo_web\u002Findex.html` 即可通过人机界面访问该 API。\n\n## 引用\n\n如果您在工作中使用了 Mutan，请引用以下 arXiv 论文：\n\n```\n@article{benyounescadene2017mutan,\n  author = {Hedi Ben-Younes and \n    R{\\'{e}}mi Cad{\\`{e}}ne and\n    Nicolas Thome and\n    Matthieu Cord},\n  title = {MUTAN: Multimodal Tucker Fusion for Visual Question Answering},\n  journal = {ICCV},\n  year = {2017},\n  url = {http:\u002F\u002Farxiv.org\u002Fabs\u002F1705.06676}\n}\n```\n\n## 致谢\n\n特别感谢 [MLB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1610.04325) 的作者提供了部分 [Torch7 代码](https:\u002F\u002Fgithub.com\u002Fjnhwkim\u002FMulLowBiVQA)，感谢 [MCB](https:\u002F\u002Farxiv.org\u002Fabs\u002F1606.01847) 的作者提供了部分 [Caffe 代码](https:\u002F\u002Fgithub.com\u002Fakirafukui\u002Fvqa-mcb)，同时也感谢 LIP6 的各位老师和朋友营造了良好的工作氛围。","# vqa.pytorch 快速上手指南\n\n`vqa.pytorch` 是一个基于 PyTorch 的视觉问答（VQA）开源框架，由 LIP6 实验室开发。该库实现了包括 MUTAN（当前 SOTA 方法之一）、MLB 等多种多模态融合模型，支持在 VQA 1.0\u002F2.0 和 VisualGenome 数据集上进行训练与评估。\n\n## 环境准备\n\n本项目仅支持 **Python 3**，推荐使用 Anaconda 进行环境管理。\n\n*   **操作系统**: Linux \u002F macOS (Windows 需自行配置 CUDA 环境)\n*   **核心依赖**:\n    *   Python 3.x\n    *   PyTorch (需支持 CUDA)\n    *   torchvision\n*   **硬件要求**: 建议使用配备 NVIDIA GPU 的机器以加速特征提取和模型训练。\n\n## 安装步骤\n\n### 1. 创建虚拟环境并安装 PyTorch\n使用 Conda 创建名为 `vqa` 的环境，并安装 PyTorch 及相关依赖。\n\n```bash\nconda create --name vqa python=3\nsource activate vqa\n# 安装 pytorch, torchvision 和 cuda80 (根据你的 CUDA 版本调整，如 cuda90, cuda100 等)\nconda install pytorch torchvision cuda80 -c soumith\n```\n\n> **提示**：国内用户若下载缓慢，可尝试使用清华或中科大镜像源配置 conda，或在 PyTorch 官网选择对应的国内镜像命令。\n\n### 2. 克隆仓库并安装依赖\n克隆代码时务必加上 `--recursive` 参数以获取必要的子模块（如 VQA 评估工具、Skip-thoughts 模型等）。\n\n```bash\ncd $HOME\ngit clone --recursive https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch.git \ncd vqa.pytorch\npip install -r requirements.txt\n```\n\n### 3. 数据准备\n代码会在首次运行时自动下载并预处理所需数据（COCO 图像、VQA 标注等）。数据链接配置在 `vqa\u002Fdatasets\u002F` 目录下的 Python 文件中。如需手动下载预提取的特征（针对 VQA 1.0），可参考以下命令：\n\n```bash\nmkdir -p data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\ncd data\u002Fcoco\u002Fextract\u002Farch,fbresnet152torch\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftrainset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Fvalset.txt\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.hdf5\nwget https:\u002F\u002Fdata.lip6.fr\u002Fcoco\u002Ftestset.txt\n```\n\n## 基本使用\n\n### 1. 下载预训练模型\n你可以直接下载作者在 VQA 1.0 或 2.0 上训练好的模型（如 MutanAtt）进行测试或微调。\n\n**VQA 1.0 模型示例：**\n```bash\nmkdir -p logs\u002Fvqa\ncd logs\u002Fvqa\nwget http:\u002F\u002Fwebia.lip6.fr\u002F~cadene\u002FDownloads\u002Fvqa.pytorch\u002Flogs\u002Fvqa\u002Fmutan_att_trainval.zip\nunzip mutan_att_trainval.zip\n```\n\n**VQA 2.0 模型示例：**\n```bash\nmkdir -p logs\u002Fvqa2\ncd logs\u002Fvqa2\nwget http:\u002F\u002Fdata.lip6.fr\u002Fcadene\u002Fvqa.pytorch\u002Fvqa2\u002Fmutan_att_trainval.zip\nunzip mutan_att_trainval.zip\n```\n\n### 2. 评估模型\n下载完成后，使用 `train.py` 脚本配合 `-e` 参数即可在验证集上评估模型效果。\n\n```bash\n# 评估 VQA 1.0 模型\npython train.py -e --path_opt options\u002Fvqa\u002Fmutan_att_trainval.yaml --resume ckpt\n\n# 评估 VQA 2.0 模型 (需进入对应目录或调整路径)\npython train.py -e --path_opt options\u002Fvqa2\u002Fmutan_att_trainval.yaml --resume ckpt\n```\n\n### 3. 训练新模型\n若要从头开始训练，需先提取图像特征（可选，视配置而定），然后运行训练命令。以下是在 VQA 2.0 上训练 MutanAtt 模型的示例：\n\n```bash\npython train.py --path_opt options\u002Fvqa2\u002Fmutan_att_train.yaml\n```\n\n### 4. 监控训练过程\n项目提供了可视化工具来监控训练日志和损失曲线：\n\n```bash\npython visu.py\n```\n\n### 5. 提取图像特征 (可选)\n如果需要使用自定义的 CNN  backbone 提取 COCO 数据集特征：\n\n```bash\npython extract.py --dataset coco --mode att\n```\n*注：`mode` 可选 `att` (输出 2048x14x14) 或其他配置，具体参见 `extract.py` 帮助信息。*","某计算机视觉实验室的研究团队正致力于开发一款能辅助视障人士理解周围环境的智能应用，需要训练模型准确回答关于图像内容的自然语言提问。\n\n### 没有 vqa.pytorch 时\n- 研究人员需从零搭建多模态融合架构，复现论文中先进的 MUTAN 模型耗时数月且极易出错。\n- 缺乏统一的模块化代码库，处理 VQA 1.0\u002F2.0 及 VisualGenome 等不同数据集的格式转换工作繁琐重复。\n- 难以高效提取和整合 ResNet-152 等预训练图像特征与 LSTM 问题特征，实验迭代周期漫长。\n- 缺少成熟的训练监控与评估脚本，调试模型收敛情况和对比基准结果十分困难。\n\n### 使用 vqa.pytorch 后\n- 直接调用内置的 MUTAN 状态最优模型架构，几天内即可完成基线复现并在此基础上进行改进。\n- 利用其高度模块化的设计，轻松切换并支持多种主流数据集，大幅减少了数据预处理的人力成本。\n- 一键执行脚本即可从 COCO 或 VisualGenome 中提取高质量特征，显著加速了端到端的模型训练流程。\n- 依托完善的文档和监控工具，实时追踪训练指标，快速定位问题并优化模型性能。\n\nvqa.pytorch 通过提供高效、可复现的科研级代码底座，将原本数月的算法验证周期缩短至数周，极大推动了多模态问答技术的落地应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FCadene_vqa.pytorch_62eca644.png","Cadene","Remi","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FCadene_79dfaf95.png",null,"@huggingface ","Paris","RemiCadene","http:\u002F\u002Fremicadene.com","https:\u002F\u002Fgithub.com\u002FCadene",[85,89,93,97,101],{"name":86,"color":87,"percentage":88},"Python","#3572A5",84.4,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",8.8,{"name":94,"color":95,"percentage":96},"HTML","#e34c26",4.9,{"name":98,"color":99,"percentage":100},"JavaScript","#f1e05a",1.7,{"name":102,"color":103,"percentage":104},"CSS","#663399",0.2,735,178,"2026-03-03T03:01:57",4,"Linux, macOS","需要 NVIDIA GPU (通过 conda 安装 pytorch torchvision cuda80)，显存需求未说明，CUDA 8.0+","未说明",{"notes":113,"python":114,"dependencies":115},"建议使用 Anaconda 创建虚拟环境。代码包含外部子模块，克隆时需使用 --recursive 参数。数据（COCO, VisualGenome, VQA）会在需要时自动下载和预处理。提供了基于 Torch7 移植的 ResNet-152 特征和预训练模型。注意区分不同版本的 ResNet-152 以获得最佳结果。","3.x (不支持 Python 2)",[116,117,118,119,120,121],"pytorch","torchvision","cuda80","VQA (submodule)","skip-thoughts.torch (submodule)","pretrained-models.pytorch (submodule)",[13],[124,125,126,127,116,128,129,130,131],"vqa","deep-learning","resnet","skipthoughts","clevr","coco","torch","vgenome","2026-03-27T02:49:30.150509","2026-04-06T09:46:13.340114",[135,140,145,150,155,160],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},17665,"如何在多个 GPU 上同时训练多个模型？遇到进程卡死或死锁怎么办？","在同一台机器上同时运行多个训练进程（即使是在不同的 GPU 上）可能会导致 CUDA 或 PyTorch 层面的已知问题，表现为进程卡死或数据加载器异常。维护者建议最稳妥的方法是“每个 GPU 运行一个实验”，避免同时启动多个训练进程。此外，使用压缩的 numpy 特征代替 HDF5 格式可以显著减少磁盘 I\u002FO 瓶颈（读取数据量减少约 10 倍），从而缓解部分资源竞争问题。如果必须强制终止卡死的进程，使用 `kill -9 PID` 可能会产生僵尸进程，需谨慎操作。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F6",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},17666,"数据加载速度很慢，有什么优化建议？HDF5 格式是否是瓶颈？","是的，h5py\u002FHDF5 格式可能不适合这种读密集型任务，会导致较高的数据加载时间。维护者建议使用预训练的 Caffe 模型并将张量存储为压缩的 numpy 数组（.npy），这样可以将整个训练集缓存到 RAM 中（例如仅需 19GB），从而大幅提升速度。对于包含注意力机制的模型，数据维度较大 (batch_size x 2048 x 14 x 14)，加载压力更大；而不含注意力的模型数据维度较小 (batch_size x 2048)，影响较小。如果遇到高加载延迟，建议使用 `atop` 或 `htop` 等监控工具定位瓶颈，也可以考虑尝试 LMDB 格式替代 HDF5。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F4",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},17667,"为什么仅使用 LSTM 和分类器训练问答模型时损失下降缓慢且准确率低？","这通常是因为 PyTorch 使用的 cuDNN 后端不包含序列化的 Dropout 层，导致自定义的 LSTM + Dropout 结构效果不佳。维护者指出，如果想要获得强大的基线结果，应该使用预训练的 Skip-Thoughts 模型作为纯语言模型，而不是从头训练 embedding 矩阵加 LSTM。用户反馈证实，切换到 Skip-Thoughts 后结果有显著提升。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F20",{"id":151,"question_zh":152,"answer_zh":153,"source_url":154},17668,"VQA 评估中的'val accuracy'和'open ended val accuracy'有什么区别？","根据论文公式 (13)，'Open Ended'准确度的计算逻辑是：如果预测的答案在真实答案列表中至少出现 3 次，则该样本的准确度计为 1。这种指标考虑了标注者之间的共识。而 'MC' (Multiple Choices) 指的是多选题模式，即答案作为输入的一部分提供，这与 'OpenEnded'（开放式问答）是两个不同的问题设定。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F8",{"id":156,"question_zh":157,"answer_zh":158,"source_url":159},17669,"如何从代码中生成的注意力列表 (list_att) 创建可视化的热力图？","维护者表示会对列表中的每个元素创建一个热力图。具体的注意力权重提取可以参考代码中的 `vqa\u002Fmodels\u002Fatt.py` 第 99 行附近。需要注意的是，注意力图有时难以进行定性解释，文档或 README 中展示的示例通常是经过挑选的（cherry-picked），以展示最佳效果，实际运行中生成的图可能因样本不同而有差异。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F51",{"id":161,"question_zh":162,"answer_zh":163,"source_url":164},17670,"运行 Demo 时遇到 'iteration over a 0-d tensor' 错误如何解决？","该错误通常发生在处理长度计算时，张量维度不符合预期（特别是在 PyTorch 0.4 版本中）。错误出现在 `skipthoughts.py` 的 `_process_lengths` 函数中，原代码尝试对一个 0 维张量进行迭代。虽然用户尝试通过添加列表括号修复报错，但这可能导致结果无意义。这通常意味着输入数据的维度处理逻辑与当前 PyTorch 版本不兼容，需要检查输入张量的 `.size()` 并确保在调用 `.sum()` 或 `.squeeze()` 后保持正确的维度，或者升级\u002F降级 PyTorch 版本以匹配代码库的要求。","https:\u002F\u002Fgithub.com\u002FCadene\u002Fvqa.pytorch\u002Fissues\u002F35",[166],{"id":167,"version":168,"summary_zh":169,"released_at":170},107979,"v2.0","#### 工厂\n\n- 可以通过工厂模式创建 VQA 模型、卷积神经网络和 VQA 数据集。\n\n#### VQA 2.0\n\n- 添加了 VQA2（AbstractVQA）。\n\n#### VisualGenome\n\n- 添加了 VisualGenome（AbstractVQADataset），用于与 VQA 数据集合并。\n- 添加了 VisualGenomeImages（AbstractImagesDataset），用于提取特征。\n- `extract.py` 现在支持提取 VisualGenome 特征。\n\n#### 变长特征尺寸\n\n- `extract.py` 现在可以通过命令行参数 `--size` 从尺寸不等于 448 的图像中提取特征。\n- FeaturesDataset 现在增加了一个可选的 `opt['size']` 参数。\n\n#### FBResNet152\n\n- `convnets.py` 不仅支持 torchvision 中的 ResNet，还支持外部预训练模型。\n- 特别是 FBResNet152，它是对之前一直使用的 torch7 版本 fbresnet152torch 的移植。","2017-07-18T13:44:43"]