[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-airsplay--lxmert":3,"tool-airsplay--lxmert":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151314,2,"2026-04-11T23:32:58",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":94,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":104,"github_topics":78,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":106,"updated_at":107,"faqs":108,"releases":109},6799,"airsplay\u002Flxmert","lxmert","PyTorch code for EMNLP 2019 paper \"LXMERT: Learning Cross-Modality Encoder Representations from Transformers\".","LXMERT 是一个基于 PyTorch 开发的开源深度学习框架，源自 EMNLP 2019 的学术论文。它致力于解决计算机视觉与自然语言处理之间的跨模态理解难题，让 AI 能够像人类一样同时“看”图并“读”懂文字描述，从而精准回答相关问题或进行逻辑推理。\n\n该工具的核心亮点在于其独特的 Transformer 架构设计，通过专门的交叉模态编码器，高效地融合图像特征与文本语义。在 VQA（视觉问答）、GQA 及 NLVR2 等权威基准测试中，LXMERT 曾取得顶尖成绩，证明了其强大的泛化能力与准确性。项目不仅提供了经过大规模预训练的模型权重，方便用户直接调用或微调，还完整复现了从预训练到下游任务适配的全流程代码。\n\nLXMERT 非常适合人工智能研究人员、算法工程师以及高校师生使用。如果你正在探索多模态学习领域，或者需要构建能够理解图文内容的智能应用，LXMERT 提供了一个坚实且透明的技术基线。尽管部署和训练过程需要一定的深度学习基础及 GPU 资源支持，但其详尽的文档和成熟的代码库能显著降低研究门槛，助力开发者快速验证想法并推进前沿探索。","# LXMERT: Learning Cross-Modality Encoder Representations from Transformers\n\n**Our servers break again :(. I have updated the links so that they should work fine now. Sorry for the inconvenience. Please let me for any further issues. Thanks! --Hao, Dec 03**\n\n## Introduction\nPyTorch code for the EMNLP 2019 paper [\"LXMERT: Learning Cross-Modality Encoder Representations from Transformers\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.07490). Slides of our EMNLP 2019 talk are avialable [here](http:\u002F\u002Fwww.cs.unc.edu\u002F~airsplay\u002FEMNLP_2019_LXMERT_slides.pdf). \n\n- To analyze the output of pre-trained model (instead of fine-tuning on downstreaming tasks), please load the weight `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert20\u002FEpoch20_LXRT.pth`, which is trained as in section [pre-training](#pre-training). The default weight [here](#pre-trained-models) is trained with a slightly different protocal as this code.\n\n\n## Results (with this Github version)\n\n| Split            | [VQA](https:\u002F\u002Fvisualqa.org\u002F)     | [GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F)     | [NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)  |\n|-----------       |:----:   |:---:    |:------:|\n| Local Validation | 69.90%  | 59.80%  | 74.95% |\n| Test-Dev         | 72.42%  | 60.00%  | 74.45% (Test-P) |\n| Test-Standard    | 72.54%  | 60.33%  | 76.18% (Test-U) |\n\nAll the results in the table are produced exactly with this code base.\nSince [VQA](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview) and [GQA](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview) test servers only allow limited number of 'Test-Standard' submissions,\nwe use our remaining submission entry from the [VQA](https:\u002F\u002Fvisualqa.org\u002Fchallenge.html)\u002F[GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fchallenge.html) challenges 2019 to get these results.\nFor [NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F), we only test once on the unpublished test set (test-U).\n\nWe use this code (with model ensemble) to participate in [VQA 2019](https:\u002F\u002Fvisualqa.org\u002Froe.html) and [GQA 2019](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy) challenge in May 2019.\nWe are the **only** team ranking **top-3** in both challenges.\n\n\n## Pre-trained models\nThe pre-trained model (870 MB) is available at http:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fmodel_LXRT.pth, and can be downloaded with:\n```bash\nmkdir -p snap\u002Fpretrained \nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fmodel_LXRT.pth -P snap\u002Fpretrained\n```\n\n\nIf download speed is slower than expected, the pre-trained model could also be downloaded from [other sources](#alternative-dataset-and-features-download-links).\nPlease help put the downloaded file at `snap\u002Fpretrained\u002Fmodel_LXRT.pth`.\n\nWe also provide data and commands to pre-train the model in [pre-training](#pre-training). The default setup needs 4 GPUs and takes around a week to finish. The pre-trained weights with this code base could be downloaded from `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert\u002FEpochXX_LXRT.pth`, `XX` from 01 to 12. It is pre-trained for 12 epochs (instead of 20 in EMNLP paper) thus the fine-tuned reuslts are about 0.3% lower on each datasets. \n\n\n\n## Fine-tune on Vision-and-Language Tasks\nWe fine-tune our LXMERT pre-trained model on each task with following hyper-parameters:\n\n|Dataset      | Batch Size   | Learning Rate   | Epochs  | Load Answers  |\n|---   |:---:|:---:   |:---:|:---:|\n|VQA   | 32  | 5e-5   | 4   | Yes |\n|GQA   | 32  | 1e-5   | 4   | Yes |\n|NLVR2 | 32  | 5e-5   | 4   | No  |\n\nAlthough the fine-tuning processes are almost the same except for different hyper-parameters,\nwe provide descriptions for each dataset to take care of all details.\n\n### General \nThe code requires **Python 3** and please install the Python dependencies with the command:\n```bash\npip install -r requirements.txt\n```\n\nBy the way, a Python 3 virtual environment could be set up and run with:\n```bash\nvirtualenv name_of_environment -p python3\nsource name_of_environment\u002Fbin\u002Factivate\n```\n### VQA\n#### Fine-tuning\n1. Please make sure the LXMERT pre-trained model is either [downloaded](#pre-trained-models) or [pre-trained](#pre-training).\n\n2. Download the re-distributed json files for VQA 2.0 dataset. The raw VQA 2.0 dataset could be downloaded from the [official website](https:\u002F\u002Fvisualqa.org\u002Fdownload.html).\n    ```bash\n    mkdir -p data\u002Fvqa\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Ftrain.json -P data\u002Fvqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fnominival.json -P  data\u002Fvqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fminival.json -P data\u002Fvqa\u002F\n    ```\n3. Download faster-rcnn features for MS COCO train2014 (17 GB) and val2014 (8 GB) images (VQA 2.0 is collected on MS COCO dataset).\nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -d data\u002Fmscoco_imgfeat && rm data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip\n    ```\n\n4. Before fine-tuning on whole VQA 2.0 training set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `vqa_lxr955_tiny` is the name of this experiment.\n    ```bash\n    bash run\u002Fvqa_finetune.bash 0 vqa_lxr955_tiny --tiny\n    ```\n5. If no bug came out, then the model is ready to be trained on the whole VQA corpus:\n    ```bash\n    bash run\u002Fvqa_finetune.bash 0 vqa_lxr955\n    ```\nIt takes around 8 hours (2 hours per epoch * 4 epochs) to converge. \nThe **logs** and **model snapshots** will be saved under folder `snap\u002Fvqa\u002Fvqa_lxr955`. \nThe validation result after training will be around **69.7%** to **70.2%**. \n\n#### Local Validation\nThe results on the validation set (our minival set) are printed while training.\nThe validation result is also saved to `snap\u002Fvqa\u002F[experiment-name]\u002Flog.log`.\nIf the log file was accidentally deleted, the validation result in training is also reproducible from the model snapshot:\n```bash\nbash run\u002Fvqa_test.bash 0 vqa_lxr955_results --test minival --load snap\u002Fvqa\u002Fvqa_lxr955\u002FBEST\n```\n#### Submitted to VQA test server\n1. Download our re-distributed json file containing VQA 2.0 test data.\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Ftest.json -P data\u002Fvqa\u002F\n    ```\n2. Download the faster rcnn features for MS COCO test2015 split (16 GB).\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip\n    ```\n3. Since VQA submission system requires submitting whole test data, we need to run inference over all test splits \n(i.e., test dev, test standard, test challenge, and test held-out). \nIt takes around 10~15 mins to run test inference (448K instances to run).\n    ```bash\n    bash run\u002Fvqa_test.bash 0 vqa_lxr955_results --test test --load snap\u002Fvqa\u002Fvqa_lxr955\u002FBEST\n    ```\n The test results will be saved in `snap\u002Fvqa_lxr955_results\u002Ftest_predict.json`. \nThe VQA 2.0 challenge for this year is host on [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002F) at [https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview)\nIt still allows submission after the challenge ended.\nPlease check the official website of [VQA Challenge](https:\u002F\u002Fvisualqa.org\u002Fchallenge.html) for detailed information and \nfollow the instructions in [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview) to submit.\nIn general, after registration, the only thing remaining is to upload the `test_predict.json` file and wait for the result back.\n\nThe testing accuracy with exact this code is **72.42%** for test-dev and **72.54%**  for test-standard.\nThe results with the code base are also publicly shown on the [VQA 2.0 leaderboard](\nhttps:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Fleaderboard\u002F498\n) with entry `LXMERT github version`.\n\n\n### GQA\n\n#### Fine-tuning\n1. Please make sure the LXMERT pre-trained model is either [downloaded](#pre-trained-models) or [pre-trained](#pre-training).\n\n2. Download the re-distributed json files for GQA balanced version dataset.\nThe original GQA dataset is available [in the Download section of its website](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fdownload.html)\nand the script to preprocess these datasets is under `data\u002Fgqa\u002Fprocess_raw_data_scripts`.\n    ```bash\n    mkdir -p data\u002Fgqa\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Ftrain.json -P data\u002Fgqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Fvalid.json -P data\u002Fgqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Ftestdev.json -P data\u002Fgqa\u002F\n    ```\n3. Download Faster R-CNN features for Visual Genome and GQA testing images (30 GB).\nGQA's training and validation data are collected from Visual Genome.\nIts testing images come from MS COCO test set (I have verified this with one of GQA authors [Drew A. Hudson](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdrew-a-hudson\u002F)).\nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\n    ```bash\n    mkdir -p data\u002Fvg_gqa_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip\n    ```\n\n4. Before fine-tuning on whole GQA training+validation set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `gqa_lxr955_tiny` is the name of this experiment.\n    ```bash\n    bash run\u002Fgqa_finetune.bash 0 gqa_lxr955_tiny --tiny\n    ```\n\n5. If no bug came out, then the model is ready to be trained on the whole GQA corpus (train + validation), and validate on \nthe testdev set:\n    ```bash\n    bash run\u002Fgqa_finetune.bash 0 gqa_lxr955\n    ```\nIt takes around 16 hours (4 hours per epoch * 4 epochs) to converge. \nThe **logs** and **model snapshots** will be saved under folder `snap\u002Fgqa\u002Fgqa_lxr955`. \nThe validation result after training will be around **59.8%** to **60.1%**. \n\n#### Local Validation\nThe results on testdev is printed out while training and saved in `snap\u002Fgqa\u002Fgqa_lxr955\u002Flog.log`.\nIt could be also re-calculated with command:\n```bash\nbash run\u002Fgqa_test.bash 0 gqa_lxr955_results --load snap\u002Fgqa\u002Fgqa_lxr955\u002FBEST --test testdev --batchSize 1024\n```\n\n> Note: Our local testdev result is usually 0.1% to 0.5% lower than the \nsubmitted testdev result. \nThe reason is that the test server takes an [advanced \nevaluation system](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fevaluate.html) while our local evaluator only \ncalculates the exact matching.\nPlease use [this official evaluator](https:\u002F\u002Fnlp.stanford.edu\u002Fdata\u002Fgqa\u002Feval.zip) (784 MB) if you \nwant to have the exact number without submitting.\n\n\n#### Submitted to GQA test server\n1. Download our re-distributed json file containing GQA test data.\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Fsubmit.json -P data\u002Fgqa\u002F\n    ```\n\n2. Since GQA submission system requires submitting the whole test data, \nwe need to run inference over all test splits.\nIt takes around 30~60 mins to run test inference (4.2M instances to run).\n    ```bash\n    bash run\u002Fgqa_test.bash 0 gqa_lxr955_results --load snap\u002Fgqa\u002Fgqa_lxr955\u002FBEST --test submit --batchSize 1024\n    ```\n\n3. After running test script, a json file `submit_predict.json` under `snap\u002Fgqa\u002Fgqa_lxr955_results` will contain \nall the prediction results and is ready to be submitted.\nThe GQA challenge 2019 is hosted by [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002F) at [https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview).\nAfter registering the account, uploading the `submit_predict.json` and waiting for the results are the only thing remained.\nPlease also check [GQA official website](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F) \nin case the test server is changed.\n\nThe testing accuracy with exactly this code is **60.00%** for test-dev and **60.33%**  for test-standard.\nThe results with the code base are also publicly shown on the [GQA leaderboard](\nhttps:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Fleaderboard\n) with entry `LXMERT github version`.\n\n### NLVR2\n\n#### Fine-tuning\n\n1. Download the NLVR2 data from the official [GitHub repo](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr).\n    ```bash\n    git submodule update --init\n    ```\n\n\n2. Process the NLVR2 data to json files.\n    ```bash\n    bash -c \"cd data\u002Fnlvr2\u002Fprocess_raw_data_scripts && python process_dataset.py\"\n    ```\n\n3. Download the NLVR2 image features for train (21 GB) & valid (1.6 GB) splits. \nThe image features are\nalso available on Google Drive and Baidu Drive (see [Alternative Download](#alternative-dataset-and-features-download-links) for details).\nTo access to the original images, please follow the instructions on [NLVR2 official Github](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2).\nThe images could either be downloaded with the urls or by signing an agreement form for data usage. And the feature could be extracted as described in [feature extraction](#faster-r-cnn-feature-extraction)\n    ```bash\n    mkdir -p data\u002Fnlvr2_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip -d data && rm data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip -d data && rm data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip\n    ```\n\n4. Before fine-tuning on whole NLVR2 training set, verifying the script and model on a small training set (512 images) is recommended. \nThe first argument `0` is GPU id. The second argument `nlvr2_lxr955_tiny` is the name of this experiment.\nDo not worry if the result is low (50~55) on this tiny split, \nthe whole training data would bring the performance back.\n    ```bash\n    bash run\u002Fnlvr2_finetune.bash 0 nlvr2_lxr955_tiny --tiny\n    ```\n\n5. If no bugs are popping up from the previous step, \nit means that the code, the data, and image features are ready.\nPlease use this command to train on the full training set. \nThe result on NLVR2 validation (dev) set would be around **74.0** to **74.5**.\n    ```bash\n    bash run\u002Fnlvr2_finetune.bash 0 nlvr2_lxr955\n    ```\n\n#### Inference on Public Test Split\n1. Download NLVR2 image features for the public test split (1.6 GB).\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip -d data\u002Fnlvr2_imgfeat && rm data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip\n    ```\n\n2. Test on the public test set (corresponding to 'test-P' on [NLVR2 leaderboard](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)) with:\n    ```bash\n    bash run\u002Fnlvr2_test.bash 0 nlvr2_lxr955_results --load snap\u002Fnlvr2\u002Fnlvr2_lxr955\u002FBEST --test test --batchSize 1024\n    ```\n\n3. The test accuracy would be shown on the screen after around 5~10 minutes.\nIt also saves the predictions in the file `test_predict.csv` \nunder `snap\u002Fnlvr2_lxr955_reuslts`, which is compatible to NLVR2 [official evaluation script](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2\u002Feval).\nThe official eval script also calculates consistency ('Cons') besides the accuracy.\nWe could use this official script to verify the results by running:\n    ```bash\n    python data\u002Fnlvr2\u002Fnlvr\u002Fnlvr2\u002Feval\u002Fmetrics.py snap\u002Fnlvr2\u002Fnlvr2_lxr955_results\u002Ftest_predict.csv data\u002Fnlvr2\u002Fnlvr\u002Fnlvr2\u002Fdata\u002Ftest1.json\n    ```\n\nThe accuracy of public test ('test-P') set should be almost same to the validation set ('dev'),\nwhich is around 74.0% to 74.5%.\n\n\n#### Unreleased Test Sets\nTo be tested on the unreleased held-out test set (test-U on the \n[leaderboard](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)\n),\nthe code needs to be sent.\nPlease check the [NLVR2 official github](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2) \nand [NLVR project website](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F) for details.\n\n\n### General Debugging Options\nSince it takes a few minutes to load the features, the code has an option to prototype with a small amount of\ntraining data. \n```bash\n# Training with 512 images:\nbash run\u002Fvqa_finetune.bash 0 --tiny \n# Training with 4096 images:\nbash run\u002Fvqa_finetune.bash 0 --fast\n```\n\n## Pre-training\n\n1. Download our aggregated LXMERT dataset from MS COCO, Visual Genome, VQA, and GQA (around 700MB in total). The joint answer labels are saved in `data\u002Flxmert\u002Fall_ans.json`.\n    ```bash\n    mkdir -p data\u002Flxmert\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_train.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_nominival.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fvgnococo.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_minival.json -P data\u002Flxmert\u002F\n    ```\n\n2. [*Skip this if you have run [VQA fine-tuning](#vqa).*] Download the detection features for MS COCO images.\n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -d data\u002Fmscoco_imgfeat && rm data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip\n    ```\n\n3. [*Skip this if you have run [GQA fine-tuning](#gqa).*] Download the detection features for Visual Genome images.\n    ```bash\n    mkdir -p data\u002Fvg_gqa_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip\n    ```\n\n4. Test on a small split of the MS COCO + Visual Genome datasets:\n    ```bash\n    bash run\u002Flxmert_pretrain.bash 0,1,2,3 --multiGPU --tiny\n    ```\n\n5. Run on the whole [MS COCO](http:\u002F\u002Fcocodataset.org) and [Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F) related datasets (i.e., [VQA](https:\u002F\u002Fvisualqa.org\u002F), [GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Findex.html), [COCO caption](http:\u002F\u002Fcocodataset.org\u002F#captions-2015), [VG Caption](https:\u002F\u002Fvisualgenome.org\u002F), [VG QA](https:\u002F\u002Fgithub.com\u002Fyukezhu\u002Fvisual7w-toolkit)). \nHere, we take a simple single-stage pre-training strategy (20 epochs with all pre-training tasks) rather than the two-stage strategy in our paper (10 epochs without image QA and 10 epochs with image QA).\nThe pre-training finishes in **8.5 days** on **4 GPUs**.  By the way, I hope that [my experience](experience_in_pretraining.md) in this project would help anyone with limited computational resources.\n    ```bash\n    bash run\u002Flxmert_pretrain.bash 0,1,2,3 --multiGPU\n    ```\n    > Multiple GPUs: Argument `0,1,2,3` indicates taking 4 GPUs to pre-train LXMERT. If the server does not have 4 GPUs (I am sorry to hear that), please consider halving the batch-size or using the [NVIDIA\u002Fapex](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex) library to support half-precision computation. \n    The code uses the default data parallelism in PyTorch and thus extensible to less\u002Fmore GPUs. The python main thread would take charge of the data loading. On 4 GPUs, we do not find that the data loading becomes a bottleneck (around 5% overhead). \n    >\n    > GPU Types: We find that either Titan XP, GTX 2080, and Titan V could support this pre-training. However, GTX 1080, with its 11G memory, is a little bit small thus please change the batch_size to 224 (instead of 256).\n\n6. I have **verified these pre-training commands** with 12 epochs. The pre-trained weights from previous process could be downloaded from `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert\u002FEpochXX_LXRT.pth`, XX from `01` to `12`. The results are roughly the same (around 0.3% lower in downstream tasks because of fewer epochs). \n\n7. Explanation of arguments in the pre-training script `run\u002Flxmert_pretrain.bash`:\n    ```bash\n    python src\u002Fpretrain\u002Flxmert_pretrain_new.py \\\n        # The pre-training tasks\n        --taskMaskLM --taskObjPredict --taskMatched --taskQA \\  \n        \n        # Vision subtasks\n        # obj \u002F attr: detected object\u002Fattribute label prediction.\n        # feat: RoI feature regression.\n        --visualLosses obj,attr,feat \\\n        \n        # Mask rate for words and objects\n        --wordMaskRate 0.15 --objMaskRate 0.15 \\\n        \n        # Training and validation sets\n        # mscoco_nominival + mscoco_minival = mscoco_val2014\n        # visual genome - mscoco = vgnococo\n        --train mscoco_train,mscoco_nominival,vgnococo --valid mscoco_minival \\\n        \n        # Number of layers in each encoder\n        --llayers 9 --xlayers 5 --rlayers 5 \\\n        \n        # Train from scratch (Using intialized weights) instead of loading BERT weights.\n        --fromScratch \\\n    \n        # Hyper parameters\n        --batchSize 256 --optim bert --lr 1e-4 --epochs 20 \\\n        --tqdm --output $output ${@:2}\n    ```\n\n\n## Alternative Dataset and Features Download Links \nAll default download links are provided by our servers in [UNC CS department](https:\u002F\u002Fcs.unc.edu) and under \nour [NLP group website](https:\u002F\u002Fnlp.cs.unc.edu) but the network bandwidth might be limited. \nWe thus provide a few other options with Google Drive and Baidu Drive.\n\nThe files in online drives are almost structured in the same way \nas our repo but have a few differences due to specific policies.\nAfter downloading the data and features from the drives, \nplease re-organize them under `data\u002F` folder according to the following example:\n```\nREPO ROOT\n |\n |-- data                  \n |    |-- vqa\n |    |    |-- train.json\n |    |    |-- minival.json\n |    |    |-- nominival.json\n |    |    |-- test.json\n |    |\n |    |-- mscoco_imgfeat\n |    |    |-- train2014_obj36.tsv\n |    |    |-- val2014_obj36.tsv\n |    |    |-- test2015_obj36.tsv\n |    |\n |    |-- vg_gqa_imgfeat -- *.tsv\n |    |-- gqa -- *.json\n |    |-- nlvr2_imgfeat -- *.tsv\n |    |-- nlvr2 -- *.json\n |    |-- lxmert -- *.json          # Pre-training data\n | \n |-- snap\n |-- src\n```\n\nPlease also kindly contact us if anything is missing!\n\n### Google Drive\nAs an alternative way to download feature from our UNC server,\nyou could also download the feature from google drive with link [https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing).\nThe structure of the folders on drive is:\n```\nGoogle Drive Root\n |-- data                  # The raw data and image features without compression\n |    |-- vqa\n |    |-- gqa\n |    |-- mscoco_imgfeat\n |    |-- ......\n |\n |-- image_feature_zips    # The image-feature zip files (Around 45% compressed)\n |    |-- mscoco_imgfeat.zip\n |    |-- nlvr2_imgfeat.zip\n |    |-- vg_gqa_imgfeat.zip\n |\n |-- snap -- pretrained -- model_LXRT.pth # The pytorch pre-trained model weights.\n```\nNote: image features in zip files (e.g., `mscoco_mgfeat.zip`) are the same to which in `data\u002F` (i.e., `data\u002Fmscoco_imgfeat`). \nIf you want to save network bandwidth, please download the feature zips and skip downloading the `*_imgfeat` folders under `data\u002F`.\n### Baidu Drive\n\nSince [Google Drive](\nhttps:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing\n) is not officially available across the world,\nwe also create a mirror on Baidu drive (i.e., Baidu PAN). \nThe dataset and features could be downloaded with shared link \n[https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m0mUVsq30rO6F1slxPZNHA](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m0mUVsq30rO6F1slxPZNHA) \nand access code `wwma`.\n```\nBaidu Drive Root\n |\n |-- vqa\n |    |-- train.json\n |    |-- minival.json\n |    |-- nominival.json\n |    |-- test.json\n |\n |-- mscoco_imgfeat\n |    |-- train2014_obj36.zip\n |    |-- val2014_obj36.zip\n |    |-- test2015_obj36.zip\n |\n |-- vg_gqa_imgfeat -- *.zip.*  # Please read README.txt under this folder\n |-- gqa -- *.json\n |-- nlvr2_imgfeat -- *.zip.*   # Please read README.txt under this folder\n |-- nlvr2 -- *.json\n |-- lxmert -- *.json\n | \n |-- pretrained -- model_LXRT.pth\n```\n\nSince Baidu Drive does not support extremely large files, \nwe `split` a few features zips into multiple small files. \nPlease follow the `README.txt` under `baidu_drive\u002Fvg_gqa_imgfeat` and \n`baidu_drive\u002Fnlvr2_imgfeat` to concatenate back to the feature zips with command `cat`.\n\n\n## Code and Project Explanation\n- All code is in folder `src`. The basics in `lxrt`.\nThe python files related to pre-training and fine-tuning are saved in `src\u002Fpretrain` and `src\u002Ftasks` respectively.\n- I kept folders containing image features (e.g., mscoco_imgfeat) separated from vision-and-language dataset (e.g., vqa, lxmert) because\nmultiple vision-and-language datasets would share common images.\n- We use the name `lxmert` for our framework and use the name `lxrt`\n(Language, Cross-Modality, and object-Relationship Transformers) to refer to our our models.\n- To be consistent with the name `lxrt` (Language, Cross-Modality, and object-Relationship Transformers), \nwe use `lxrXXX` to denote the number of layers.\nE.g., `lxr955` (used in current pre-trained model) indicates \na model with 9 Language layers, 5 cross-modality layers, and 5 object-Relationship layers. \nIf we consider a single-modality layer as a half of cross-modality layer, \nthe total number of layers is `(9 + 5) \u002F 2 + 5 = 12`, which is the same as `BERT_BASE`.\n- We share the weight between the two cross-modality attention sub-layers. Please check the [`visual_attention` variable](blob\u002Fmaster\u002Fsrc\u002Flxrt\u002Fmodeling.py#L521), which is used to compute both `lang->visn` attention and `visn->lang` attention. (I am sorry that the name `visual_attention` is misleading because I deleted the `lang_attention` there.) Sharing weights is mostly used for saving computational resources and it also (intuitively) helps forcing the features from visn\u002Flang into a joint subspace.\n- The box coordinates are not normalized from [0, 1] to [-1, 1], which looks like a typo but actually not ;). Normalizing the coordinate would not affect the output of box encoder (mathematically and almost numerically). ~~(Hint: consider the LayerNorm in positional encoding)~~\n\n\n## Faster R-CNN Feature Extraction\n\n\nWe use the Faster R-CNN feature extractor demonstrated in [\"Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering\", CVPR 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.07998)\nand its released code at [Bottom-Up-Attention github repo](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention).\nIt was trained on [Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F) dataset and implemented based on a specific [Caffe](https:\u002F\u002Fcaffe.berkeleyvision.org\u002F) version.\n\n\nTo extract features with this Caffe Faster R-CNN, we publicly release a docker image `airsplay\u002Fbottom-up-attention` on docker hub that takes care of all the dependencies and library installation . Instructions and examples are demonstrated below. You could also follow the installation instructions in the bottom-up attention github to setup the tool: [https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention). \n\nThe BUTD feature extractor is widely used in many other projects. If you want to reproduce the results from their paper, feel free to use our docker as a tool.\n\n\n### Feature Extraction with Docker\n[Docker](https:\u002F\u002Fwww.docker.com\u002F) is a easy-to-use virtualization tool which allows you to plug and play without installing libraries.\n\nThe built docker file for bottom-up-attention is released on [docker hub](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fairsplay\u002Fbottom-up-attention) and could be downloaded with command: \n```bash\nsudo docker pull airsplay\u002Fbottom-up-attention\n```\n> The `Dockerfile` could be downloaed [here](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1KJjwQtqisXvinWm8OORk-_3XYLBHYCIK\u002Fview?usp=sharing), which allows using other CUDA versions.\n\nAfter pulling the docker, you could test running the docker container with command:\n```bash\ndocker run --gpus all --rm -it airsplay\u002Fbottom-up-attention bash\n``` \n\n\nIf errors about `--gpus all` popped up, please read the next section.\n\n#### Docker GPU Access\nNote that the purpose of the argument `--gpus all` is to expose GPU devices to the docker container, and it requires Docker >= 19.03 along with `nvidia-container-toolkit`:\n1. [Docker CE 19.03](https:\u002F\u002Fdocs.docker.com\u002Finstall\u002Flinux\u002Fdocker-ce\u002Fubuntu\u002F)\n2. [nvidia-container-toolkit](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-docker)\n\nFor running Docker with an older version, either update it to 19.03 or use the flag `--runtime=nvidia` instead of `--gpus all`.\n\n#### An Example: Feature Extraction for NLVR2 \nWe demonstrate how to extract Faster R-CNN features of NLVR2 images.\n\n1. Please first follow the instructions on the [NLVR2 official repo](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2) to get the images.\n\n2. Download the pre-trained Faster R-CNN model. Instead of using the default pre-trained model (trained with 10 to 100 boxes), we use the ['alternative pretrained model'](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention#demo) which was trained with 36 boxes. \n    ```bash\n    wget 'https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F2h4hmgcvpaewizu\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data\u002Fnlvr2_imgfeat\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. Run docker container with command:\n    ```bash\n    docker run --gpus all -v \u002Fpath\u002Fto\u002Fnlvr2\u002Fimages:\u002Fworkspace\u002Fimages:ro -v \u002Fpath\u002Fto\u002Flxrt_public\u002Fdata\u002Fnlvr2_imgfeat:\u002Fworkspace\u002Ffeatures --rm -it airsplay\u002Fbottom-up-attention bash\n    ```\n    `-v` mounts the folders on host os to the docker image container.\n    > Note0: If it says something about 'privilege', add `sudo` before the command.\n    >\n    > Note1: If it says something about '--gpus all', it means that the GPU options are not correctly set. Please read [Docker GPU Access](#docker-gpu-access) for the instructions to allow GPU access.\n    >\n    > Note2: \u002Fpath\u002Fto\u002Fnlvr2\u002Fimages would contain subfolders `train`, `dev`, `test1` and `test2`.\n    >\n    > Note3: Both paths '\u002Fpath\u002Fto\u002Fnlvr2\u002Fimages\u002F' and '\u002Fpath\u002Fto\u002Flxrt_public' requires absolute paths.\n\n\n4. Extract the features **inside the docker container**. The extraction script is copied from [butd\u002Ftools\u002Fgenerate_tsv.py](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention\u002Fblob\u002Fmaster\u002Ftools\u002Fgenerate_tsv.py) and modified by [Jie Lei](http:\u002F\u002Fwww.cs.unc.edu\u002F~jielei\u002F) and me.\n    ```bash\n    cd \u002Fworkspace\u002Ffeatures\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split test\n    ```\n\n5. It would takes around 5 to 6 hours for the training split and 1 to 2 hours for the valid and test splits. Since it is slow, I recommend to run them parallelly if there are multiple GPUs. It could be achieved by changing the `gpu_id` in `CUDA_VISIBLE_DEVICES=$gpu_id`.\n\nThe features will be saved in `train.tsv`, `valid.tsv`, and `test.tsv` under the directory `data\u002Fnlvr2_imgfeat`, outside the docker container. I have verified the extracted image features are the same to the ones I provided in [NLVR2 fine-tuning](#nlvr2).\n\n#### Yet Another Example: Feature Extraction for MS COCO Images\n1. Download the MS COCO train2014, val2014, and test2015 images from [MS COCO official website](http:\u002F\u002Fcocodataset.org\u002F#download).\n\n2. Download the pre-trained Faster R-CNN model. \n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget 'https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F2h4hmgcvpaewizu\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data\u002Fmscoco_imgfeat\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. Run the docker container with the command:\n    ```bash\n    docker run --gpus all -v \u002Fpath\u002Fto\u002Fmscoco\u002Fimages:\u002Fworkspace\u002Fimages:ro -v $(pwd)\u002Fdata\u002Fmscoco_imgfeat:\u002Fworkspace\u002Ffeatures --rm -it airsplay\u002Fbottom-up-attention bash\n    ```\n    > Note: Option `-v` mounts the folders outside container to the paths inside the container.\n    > \n    > Note1: Please use the **absolute path** to the MS COCO images folder `images`. The `images` folder containing the `train2014`, `val2014`, and `test2015` sub-folders. (It's the standard way to save MS COCO images.)\n\n4. Extract the features **inside the docker container**.\n    ```bash\n    cd \u002Fworkspace\u002Ffeatures\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split test\n    ```\n \n5. Exit from the docker container (by executing `exit` command in bash). The extracted features would be saved under folder `data\u002Fmscoco_imgfeat`. \n\n\n## Reference\nIf you find this project helps, please cite our paper :)\n\n```bibtex\n@inproceedings{tan2019lxmert,\n  title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},\n  author={Tan, Hao and Bansal, Mohit},\n  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},\n  year={2019}\n}\n```\n\n## Acknowledgement\nWe thank the funding support from ARO-YIP Award #W911NF-18-1-0336, & awards from Google, Facebook, Salesforce, and Adobe.\n\nWe thank [Peter Anderson](https:\u002F\u002Fpanderson.me\u002F) for providing the faster R-CNN code and pre-trained models under\n[Bottom-Up-Attention Github Repo](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention).  We thank [Hengyuan Hu](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fhengyuan-hu-8963b313b) for his [PyTorch VQA](https:\u002F\u002Fgithub.com\u002Fhengyuan-hu\u002Fbottom-up-attention-vqa) implementation, our VQA implementation borrows its pre-processed answers.\nWe thank [hugginface](https:\u002F\u002Fgithub.com\u002Fhuggingface) for releasing the excellent PyTorch code \n[PyTorch Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-transformers).  \n\nWe thank [Drew A. Hudson](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdrew-a-hudson\u002F) to answer all our questions about GQA specification.\nWe thank [Alane Suhr](http:\u002F\u002Falanesuhr.com\u002F) for helping test LXMERT on NLVR2 unreleased test split and provide [a detailed analysis](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002FNLVR2BiasAnalysis.html).\n\nWe thank all the authors and annotators of vision-and-language datasets \n(i.e., \n[MS COCO](http:\u002F\u002Fcocodataset.org\u002F#home), \n[Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F),\n[VQA](https:\u002F\u002Fvisualqa.org\u002F),\n[GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F),\n[NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)\n), \nwhich allows us to develop a pre-trained model for vision-and-language tasks.\n\nWe thank [Jie Lei](http:\u002F\u002Fwww.cs.unc.edu\u002F~jielei\u002F) and [Licheng Yu](http:\u002F\u002Fwww.cs.unc.edu\u002F~licheng\u002F) for their helpful discussions. I also want to thank [Shaoqing Ren](https:\u002F\u002Fwww.shaoqingren.com\u002F) to teach me vision knowledge when I was in MSRA.  We also thank you to help look into our code. Please kindly contact us if you find any issue. Comments are always welcome.\n\nLXRThanks.\n","# LXMERT：从 Transformer 学习跨模态编码器表示\n\n**我们的服务器又出问题了 :(。我已经更新了链接，现在应该可以正常使用了。对于造成的不便，我们深感抱歉。如果还有其他问题，请随时告知我。谢谢！--Hao，12月03日**\n\n## 简介\n这是 EMNLP 2019 论文 [\"LXMERT: Learning Cross-Modality Encoder Representations from Transformers\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F1908.07490) 的 PyTorch 代码实现。我们在 EMNLP 2019 上的演讲幻灯片可在此 [查看](http:\u002F\u002Fwww.cs.unc.edu\u002F~airsplay\u002FEMNLP_2019_LXMERT_slides.pdf)。\n\n- 若要分析预训练模型的输出（而非在下游任务上进行微调），请加载权重 `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert20\u002FEpoch20_LXRT.pth`，该权重是按照[预训练]部分所述的方式训练得到的。默认权重[此处](#pre-trained-models)则是使用与此代码略有不同的协议进行训练的。\n\n\n## 结果（基于此 GitHub 版本）\n\n| 划分            | [VQA](https:\u002F\u002Fvisualqa.org\u002F)     | [GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F)     | [NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)  |\n|-----------       |:----:   |:---:    |:------:|\n| 局部验证 | 69.90%  | 59.80%  | 74.95% |\n| 测试-开发         | 72.42%  | 60.00%  | 74.45% (Test-P) |\n| 测试-标准    | 72.54%  | 60.33%  | 76.18% (Test-U) |\n\n表中所有结果均完全由本代码库生成。\n由于 [VQA](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview) 和 [GQA](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview) 的测试服务器仅允许有限次数的“测试-标准”提交，\n我们便利用自己在 2019 年 [VQA](https:\u002F\u002Fvisualqa.org\u002Fchallenge.html)\u002F[GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fchallenge.html) 挑战赛中剩余的提交名额来获得这些结果。\n至于 [NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)，我们仅在未公开的测试集（test-U）上进行了单次测试。\n\n我们使用此代码（结合模型集成）参加了 2019 年的 [VQA](https:\u002F\u002Fvisualqa.org\u002Froe.html) 和 [GQA](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1CtFk0ldbN5w2qhwvfKrNzAFEj-I9Tjgy) 挑战赛。\n我们是这两项挑战赛中唯一一支同时进入 **前三名** 的队伍。\n\n\n## 预训练模型\n预训练模型（870 MB）可在 http:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fmodel_LXRT.pth 获取，下载命令如下：\n```bash\nmkdir -p snap\u002Fpretrained \nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fmodel_LXRT.pth -P snap\u002Fpretrained\n```\n\n\n若下载速度较慢，也可从[其他来源](#alternative-dataset-and-features-download-links)下载预训练模型。请将下载的文件放置于 `snap\u002Fpretrained\u002Fmodel_LXRT.pth`。\n\n我们还在[预训练]部分提供了用于预训练模型的数据和命令。默认设置需要 4 张 GPU 卡，整个过程大约需要一周时间。使用本代码库训练的预训练权重可从 `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert\u002FEpochXX_LXRT.pth` 下载，其中 `XX` 为 01 至 12。这些权重仅预训练了 12 个 epoch（而 EMNLP 论文中为 20 个 epoch），因此在各数据集上的微调结果会低约 0.3%。\n\n\n\n## 在视觉-语言任务上进行微调\n我们使用以下超参数对 LXMERT 预训练模型进行各个任务的微调：\n\n|数据集      | 批量大小   | 学习率   | 轮数  | 加载答案  |\n|---   |:---:|:---:   |:---:|:---:|\n|VQA   | 32  | 5e-5   | 4   | 是 |\n|GQA   | 32  | 1e-5   | 4   | 是 |\n|NLVR2 | 32  | 5e-5   | 4   | 否  |\n\n尽管微调过程几乎相同，只是超参数有所不同，\n我们仍针对每个数据集提供了详细说明，以确保涵盖所有细节。\n\n### 一般说明 \n该代码需要 **Python 3**，请通过以下命令安装 Python 依赖：\n```bash\npip install -r requirements.txt\n```\n\n此外，您也可以创建并激活一个 Python 3 虚拟环境：\n```bash\nvirtualenv name_of_environment -p python3\nsource name_of_environment\u002Fbin\u002Factivate\n```\n\n### VQA\n#### 微调\n1. 请确保 LXMERT 预训练模型已 [下载](#pre-trained-models) 或 [预训练](#pre-training) 完成。\n\n2. 下载重新分发的 VQA 2.0 数据集 JSON 文件。原始 VQA 2.0 数据集可从 [官方网站](https:\u002F\u002Fvisualqa.org\u002Fdownload.html) 下载。\n    ```bash\n    mkdir -p data\u002Fvqa\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Ftrain.json -P data\u002Fvqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fnominival.json -P  data\u002Fvqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fminival.json -P data\u002Fvqa\u002F\n    ```\n3. 下载 MS COCO train2014（17 GB）和 val2014（8 GB）图像的 Faster R-CNN 特征（VQA 2.0 是基于 MS COCO 数据集收集的）。\n这些图像特征也可在 Google Drive 和 Baidu Drive 上获取（详情参见 [替代下载链接](#alternative-dataset-and-features-download-links)）。\n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -d data\u002Fmscoco_imgfeat && rm data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip\n    ```\n4. 在对整个 VQA 2.0 训练集进行微调之前，建议先在一个小型训练集（512 张图像）上验证脚本和模型。\n第一个参数 `0` 表示 GPU ID。第二个参数 `vqa_lxr955_tiny` 是本次实验的名称。\n    ```bash\n    bash run\u002Fvqa_finetune.bash 0 vqa_lxr955_tiny --tiny\n    ```\n5. 如果没有出现错误，则可以开始在完整 VQA 语料库上训练模型：\n    ```bash\n    bash run\u002Fvqa_finetune.bash 0 vqa_lxr955\n    ```\n大约需要 8 小时（每轮 2 小时 × 4 轮）才能收敛。**日志**和**模型快照**将保存在 `snap\u002Fvqa\u002Fvqa_lxr955` 文件夹下。训练后的验证结果约为 **69.7%** 至 **70.2%**。\n\n#### 本地验证\n在训练过程中会打印出验证集（我们的 minival 集）上的结果。验证结果也会保存到 `snap\u002Fvqa\u002F[experiment-name]\u002Flog.log` 文件中。如果日志文件被意外删除，也可以通过模型快照重现训练中的验证结果：\n```bash\nbash run\u002Fvqa_test.bash 0 vqa_lxr955_results --test minival --load snap\u002Fvqa\u002Fvqa_lxr955\u002FBEST\n```\n\n#### 提交至 VQA 测试服务器\n1. 下载我们重新分发的包含 VQA 2.0 测试数据的 JSON 文件。\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Ftest.json -P data\u002Fvqa\u002F\n    ```\n2. 下载 MS COCO test2015 分割的 Faster R-CNN 特征（16 GB）。\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Ftest2015_obj36.zip\n    ```\n3. 由于 VQA 提交系统要求提交完整的测试数据，我们需要对所有测试分割（即 test dev、test standard、test challenge 和 test held-out）运行推理。运行测试推理大约需要 10~15 分钟（共 448K 个实例）。\n    ```bash\n    bash run\u002Fvqa_test.bash 0 vqa_lxr955_results --test test --load snap\u002Fvqa\u002Fvqa_lxr955\u002FBEST\n    ```\n测试结果将保存在 `snap\u002Fvqa_lxr955_results\u002Ftest_predict.json` 中。今年的 VQA 2.0 挑战赛在 [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002F) 的 [https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview) 上举行。挑战赛结束后仍可继续提交。请查看 [VQA 挑战赛官网](https:\u002F\u002Fvisualqa.org\u002Fchallenge.html) 获取详细信息，并按照 [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Foverview) 上的说明进行提交。通常情况下，注册完成后，只需上传 `test_predict.json` 文件并等待结果即可。\n\n使用完全相同的代码，test-dev 的测试准确率为 **72.42%**，test-standard 的测试准确率为 **72.54%**。基于该代码库的结果也已在 [VQA 2.0 排行榜](\nhttps:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F163\u002Fleaderboard\u002F498\n) 上以条目 `LXMERT github version` 公开展示。\n\n### GQA\n\n#### 微调\n1. 请确保 LXMERT 预训练模型已[下载](#pre-trained-models)或[预训练](#pre-training)完成。\n\n2. 下载 GQA 平衡版本数据集的重新分发 JSON 文件。  \n   原始 GQA 数据集可在其官网的“下载”部分获取：[https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fdownload.html](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fdownload.html)，预处理脚本位于 `data\u002Fgqa\u002Fprocess_raw_data_scripts` 目录下。\n    ```bash\n    mkdir -p data\u002Fgqa\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Ftrain.json -P data\u002Fgqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Fvalid.json -P data\u002Fgqa\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Ftestdev.json -P data\u002Fgqa\u002F\n    ```\n3. 下载 Visual Genome 和 GQA 测试图像的 Faster R-CNN 特征（30 GB）。  \n   GQA 的训练和验证数据来自 Visual Genome，而其测试图像则来自 MS COCO 测试集（这一点已由 GQA 的作者之一 Drew A. Hudson 确认，LinkedIn 链接：[https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdrew-a-hudson\u002F](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdrew-a-hudson\u002F)）。  \n   图像特征也可通过 Google Drive 和百度网盘获取（详情见[替代下载链接](#alternative-dataset-and-features-download-links)）。\n    ```bash\n    mkdir -p data\u002Fvg_gqa_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fgqa_testdev_obj36.zip\n    ```\n4. 在对整个 GQA 训练+验证集进行微调之前，建议先在小规模训练集（512 张图像）上验证脚本和模型。  \n   第一个参数 `0` 表示 GPU ID，第二个参数 `gqa_lxr955_tiny` 是本次实验的名称。\n    ```bash\n    bash run\u002Fgqa_finetune.bash 0 gqa_lxr955_tiny --tiny\n    ```\n5. 如果未出现错误，则可以开始在完整的 GQA 语料库（训练集 + 验证集）上训练模型，并在 testdev 集上进行验证：\n    ```bash\n    bash run\u002Fgqa_finetune.bash 0 gqa_lxr955\n    ```\n   大约需要 16 小时（每轮 4 小时 × 4 轮）才能收敛。  \n   **日志**和**模型快照**将保存在 `snap\u002Fgqa\u002Fgqa_lxr955` 文件夹中。  \n   训练后的验证结果大约在 **59.8% 到 60.1%** 之间。\n\n#### 本地验证\n训练过程中会打印出 testdev 上的结果，并保存在 `snap\u002Fgqa\u002Fgqa_lxr955\u002Flog.log` 中。  \n也可以通过以下命令重新计算：\n```bash\nbash run\u002Fgqa_test.bash 0 gqa_lxr955_results --load snap\u002Fgqa\u002Fgqa_lxr955\u002FBEST --test testdev --batchSize 1024\n```\n\n> 注意：我们的本地 testdev 结果通常比提交的 testdev 结果低 0.1% 至 0.5%。  \n   这是因为测试服务器使用了[高级评估系统](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Fevaluate.html)，而我们本地的评估器仅计算精确匹配。  \n   如果您希望在不提交的情况下获得准确的分数，请使用[官方评估工具](https:\u002F\u002Fnlp.stanford.edu\u002Fdata\u002Fgqa\u002Feval.zip)（784 MB）。\n\n#### 提交至 GQA 测试服务器\n1. 下载我们重新分发的包含 GQA 测试数据的 JSON 文件。\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fgqa\u002Fsubmit.json -P data\u002Fgqa\u002F\n    ```\n\n2. 由于 GQA 提交系统要求提交完整的测试数据，因此我们需要对所有测试分割运行推理。  \n   运行测试推理大约需要 30–60 分钟（共需处理 420 万个实例）。\n    ```bash\n    bash run\u002Fgqa_test.bash 0 gqa_lxr955_results --load snap\u002Fgqa\u002Fgqa_lxr955\u002FBEST --test submit --batchSize 1024\n    ```\n\n3. 运行测试脚本后，`snap\u002Fgqa\u002Fgqa_lxr955_results` 目录下会生成一个名为 `submit_predict.json` 的文件，其中包含所有预测结果，可直接提交。  \n   GQA 挑战赛 2019 由 [EvalAI](https:\u002F\u002Fevalai.cloudcv.org\u002F) 托管，网址为：[https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Foverview)。  \n   注册账号、上传 `submit_predict.json` 文件并等待结果即可。  \n   请同时关注[GQA 官方网站](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F)，以防测试服务器发生变更。\n\n使用完全相同的代码，test-dev 的测试准确率为 **60.00%**，test-standard 的测试准确率为 **60.33%**。  \n基于该代码库的结果也已在[GQA 排行榜](https:\u002F\u002Fevalai.cloudcv.org\u002Fweb\u002Fchallenges\u002Fchallenge-page\u002F225\u002Fleaderboard)上公开显示，条目为 `LXMERT github version`。\n\n### NLVR2\n\n#### 微调\n\n1. 从官方 [GitHub 仓库](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr) 下载 NLVR2 数据。\n    ```bash\n    git submodule update --init\n    ```\n\n\n2. 将 NLVR2 数据处理为 JSON 文件。\n    ```bash\n    bash -c \"cd data\u002Fnlvr2\u002Fprocess_raw_data_scripts && python process_dataset.py\"\n    ```\n\n3. 下载 NLVR2 的训练集（21 GB）和验证集（1.6 GB）的图像特征。这些图像特征也可在 Google Drive 和百度网盘上获取（详情请参见 [替代下载链接](#alternative-dataset-and-features-download-links)）。如需访问原始图像，请按照 [NLVR2 官方 GitHub](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2) 上的说明操作。图像可以通过提供的 URL 直接下载，也可以通过签署数据使用协议来获取。特征提取方法请参考 [Faster R-CNN 特征提取](#faster-r-cnn-feature-extraction)。\n    ```bash\n    mkdir -p data\u002Fnlvr2_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip -d data && rm data\u002Fnlvr2_imgfeat\u002Ftrain_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip -d data && rm data\u002Fnlvr2_imgfeat\u002Fvalid_obj36.zip\n    ```\n\n4. 在对整个 NLVR2 训练集进行微调之前，建议先在一个小型训练集（512 张图像）上验证脚本和模型。第一个参数 `0` 是 GPU ID；第二个参数 `nlvr2_lxr955_tiny` 是本次实验的名称。如果在这个小规模数据集上的结果较低（50~55），也不必担心，因为使用完整训练数据后性能会回升。\n    ```bash\n    bash run\u002Fnlvr2_finetune.bash 0 nlvr2_lxr955_tiny --tiny\n    ```\n\n5. 如果上一步没有出现任何错误，则说明代码、数据和图像特征均已准备就绪。请使用以下命令在完整训练集上进行训练。在 NLVR2 验证集（dev）上的结果大约会在 **74.0** 到 **74.5** 之间。\n    ```bash\n    bash run\u002Fnlvr2_finetune.bash 0 nlvr2_lxr955\n    ```\n\n#### 在公开测试集上的推理\n1. 下载 NLVR2 公开测试集的图像特征（1.6 GB）。\n    ```bash\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip -P data\u002Fnlvr2_imgfeat\n    unzip data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip -d data\u002Fnlvr2_imgfeat && rm data\u002Fnlvr2_imgfeat\u002Ftest_obj36.zip\n    ```\n\n2. 使用以下命令在公开测试集（对应于 [NLVR2 排行榜](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F) 中的 'test-P'）上进行测试：\n    ```bash\n    bash run\u002Fnlvr2_test.bash 0 nlvr2_lxr955_results --load snap\u002Fnlvr2\u002Fnlvr2_lxr955\u002FBEST --test test --batchSize 1024\n    ```\n\n3. 大约 5~10 分钟后，屏幕上会显示测试准确率。预测结果也会保存在 `snap\u002Fnlvr2_lxr955_reuslts` 目录下的 `test_predict.csv` 文件中，该文件与 NLVR2 [官方评估脚本](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2\u002Feval) 兼容。除了准确率外，官方评估脚本还会计算一致性指标（'Cons'）。我们可以通过运行以下命令来验证结果：\n    ```bash\n    python data\u002Fnlvr2\u002Fnlvr\u002Fnlvr2\u002Feval\u002Fmetrics.py snap\u002Fnlvr2\u002Fnlvr2_lxr955_results\u002Ftest_predict.csv data\u002Fnlvr2\u002Fnlvr\u002Fnlvr2\u002Fdata\u002Ftest1.json\n    ```\n\n公开测试集（'test-P'）的准确率应与验证集（'dev'）大致相同，约为 74.0% 至 74.5%。\n\n\n#### 未发布的测试集\n若要在未发布的保留测试集（[排行榜](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F) 上的 'test-U'）上进行测试，需要提交代码。详情请查看 [NLVR2 官方 GitHub](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2) 和 [NLVR 项目官网](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)。\n\n### 常见调试选项\n由于加载特征需要几分钟时间，代码提供了一个选项，可以使用少量训练数据进行原型测试。\n```bash\n# 使用 512 张图像进行训练：\nbash run\u002Fvqa_finetune.bash 0 --tiny \n# 使用 4096 张图像进行训练：\nbash run\u002Fvqa_finetune.bash 0 --fast\n```\n\n## 预训练\n\n1. 从 MS COCO、Visual Genome、VQA 和 GQA 下载我们聚合的 LXMERT 数据集（总大小约 700MB）。联合答案标签保存在 `data\u002Flxmert\u002Fall_ans.json` 中。\n    ```bash\n    mkdir -p data\u002Flxmert\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_train.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_nominival.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fvgnococo.json -P data\u002Flxmert\u002F\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Flxmert\u002Fmscoco_minival.json -P data\u002Flxmert\u002F\n    ```\n\n2. [*如果您已经运行过 [VQA 微调](#vqa)，请跳过此步骤。*] 下载 MS COCO 图像的检测特征。\n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -d data\u002Fmscoco_imgfeat && rm data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -P data\u002Fmscoco_imgfeat\n    unzip data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip\n    ```\n\n3. [*如果您已经运行过 [GQA 微调](#gqa)，请跳过此步骤。*] 下载 Visual Genome 图像的检测特征。\n    ```bash\n    mkdir -p data\u002Fvg_gqa_imgfeat\n    wget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -P data\u002Fvg_gqa_imgfeat\n    unzip data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip -d data && rm data\u002Fvg_gqa_imgfeat\u002Fvg_gqa_obj36.zip\n    ```\n\n4. 在 MS COCO + Visual Genome 数据集的一个小子集上进行测试：\n    ```bash\n    bash run\u002Flxmert_pretrain.bash 0,1,2,3 --multiGPU --tiny\n    ```\n\n5. 在整个 [MS COCO](http:\u002F\u002Fcocodataset.org) 和 [Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F) 相关数据集（即 [VQA](https:\u002F\u002Fvisualqa.org\u002F)、[GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002Findex.html)、[COCO 字幕](http:\u002F\u002Fcocodataset.org\u002F#captions-2015)、[VG 字幕](https:\u002F\u002Fvisualgenome.org\u002F)、[VG QA](https:\u002F\u002Fgithub.com\u002Fyukezhu\u002Fvisual7w-toolkit)）上运行。这里我们采用简单的单阶段预训练策略（所有预训练任务共 20 个 epoch），而不是论文中的两阶段策略（无图像 QA 的 10 个 epoch 和有图像 QA 的 10 个 epoch）。在 **4 张 GPU** 上，预训练大约需要 **8.5 天** 完成。顺便说一句，我希望我在该项目中的 [经验分享](experience_in_pretraining.md) 能够帮助那些计算资源有限的人。\n    ```bash\n    bash run\u002Flxmert_pretrain.bash 0,1,2,3 --multiGPU\n    ```\n    > 多 GPU：参数 `0,1,2,3` 表示使用 4 张 GPU 来预训练 LXMERT。如果服务器没有 4 张 GPU（对此我深表遗憾），请考虑将 batch-size 减半，或者使用 [NVIDIA\u002Fapex](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex) 库来支持半精度计算。代码使用 PyTorch 默认的数据并行机制，因此可以扩展到更多或更少的 GPU。Python 主线程负责数据加载。在 4 张 GPU 上，我们并未发现数据加载成为瓶颈（约 5% 的开销）。\n    >\n    > GPU 类型：我们发现 Titan XP、GTX 2080 和 Titan V 都可以支持这项预训练。然而，GTX 1080 的显存只有 11G，略显不足，因此请将 batch_size 调整为 224（而不是 256）。\n\n6. 我已经用 12 个 epoch **验证了这些预训练命令**。之前过程中的预训练权重可以从 `https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fgithub_pretrain\u002Flxmert\u002FEpochXX_LXRT.pth` 下载，XX 从 `01` 到 `12`。结果大致相同（由于 epoch 数较少，在下游任务中准确率会低约 0.3%）。\n\n7. 对预训练脚本 `run\u002Flxmert_pretrain.bash` 中参数的解释：\n    ```bash\n    python src\u002Fpretrain\u002Flxmert_pretrain_new.py \\\n        # 预训练任务\n        --taskMaskLM --taskObjPredict --taskMatched --taskQA \\  \n        \n        # 视觉子任务\n        # obj \u002F attr：检测到的对象\u002F属性标签预测。\n        # feat：RoI 特征回归。\n        --visualLosses obj,attr,feat \\\n        \n        # 词语和物体的掩码率\n        --wordMaskRate 0.15 --objMaskRate 0.15 \\\n        \n        # 训练集和验证集\n        # mscoco_nominival + mscoco_minival = mscoco_val2014\n        # visual genome - mscoco = vgnococo\n        --train mscoco_train,mscoco_nominival,vgnococo --valid mscoco_minival \\\n        \n        # 每个编码器的层数\n        --llayers 9 --xlayers 5 --rlayers 5 \\\n        \n        # 从头开始训练（使用初始化权重），而不是加载 BERT 权重。\n        --fromScratch \\\n    \n        # 超参数\n        --batchSize 256 --optim bert --lr 1e-4 --epochs 20 \\\n        --tqdm --output $output ${@:2}\n    ```\n\n\n## 替代数据集和特征下载链接\n所有默认的下载链接都由我们在 [UNC CS 部门](https:\u002F\u002Fcs.unc.edu) 的服务器提供，并托管在我们的 [NLP 小组网站](https:\u002F\u002Fnlp.cs.unc.edu) 上，但网络带宽可能会有限。因此，我们提供了几个通过 Google Drive 和 Baidu Drive 的替代选项。\n\n在线存储中的文件结构与我们的仓库几乎相同，但由于具体政策的不同，存在一些差异。从这些存储中下载数据和特征后，请按照以下示例将其重新组织到 `data\u002F` 文件夹下：\n```\nREPO ROOT\n |\n |-- data                  \n |    |-- vqa\n |    |    |-- train.json\n |    |    |-- minival.json\n |    |    |-- nominival.json\n |    |    |-- test.json\n |    |\n |    |-- mscoco_imgfeat\n |    |    |-- train2014_obj36.tsv\n |    |    |-- val2014_obj36.tsv\n |    |    |-- test2015_obj36.tsv\n |    |\n |    |-- vg_gqa_imgfeat -- *.tsv\n |    |-- gqa -- *.json\n |    |-- nlvr2_imgfeat -- *.tsv\n |    |-- nlvr2 -- *.json\n |    |-- lxmert -- *.json          # 预训练数据\n | \n |-- snap\n |-- src\n```\n\n如果发现有任何缺失，请随时与我们联系！\n\n### Google Drive\n作为从我们的 UNC 服务器下载特征的一种替代方式，您也可以通过以下 Google Drive 链接下载特征：[https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing)。云端文件夹的结构如下：\n```\nGoogle Drive 根目录\n |-- data                  # 原始数据和图像特征，未压缩\n |    |-- vqa\n |    |-- gqa\n |    |-- mscoco_imgfeat\n |    |-- ......\n |\n |-- image_feature_zips    # 图像特征压缩包（压缩率约 45%）\n |    |-- mscoco_imgfeat.zip\n |    |-- nlvr2_imgfeat.zip\n |    |-- vg_gqa_imgfeat.zip\n |\n |-- snap -- pretrained -- model_LXRT.pth # PyTorch 预训练模型权重。\n```\n注意：压缩包中的图像特征（例如 `mscoco_mgfeat.zip`）与 `data\u002F` 中的特征（即 `data\u002Fmscoco_imgfeat`）完全相同。如果您希望节省网络带宽，可以直接下载这些压缩包，而无需再下载 `data\u002F` 下的 `*_imgfeat` 文件夹。\n\n### 百度网盘\n\n由于[Google Drive](\nhttps:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1Gq1uLUk6NdD0CcJOptXjxE6ssY5XAuat?usp=sharing\n) 并未在全球范围内正式开放，\n我们也在百度网盘（即百度盘）上创建了一个镜像。\n数据集和特征可以通过共享链接\n[https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m0mUVsq30rO6F1slxPZNHA](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1m0mUVsq30rO6F1slxPZNHA)\n以及提取码 `wwma` 下载。\n\n```\n百度网盘根目录\n |\n |-- vqa\n |    |-- train.json\n |    |-- minival.json\n |    |-- nominival.json\n |    |-- test.json\n |\n |-- mscoco_imgfeat\n |    |-- train2014_obj36.zip\n |    |-- val2014_obj36.zip\n |    |-- test2015_obj36.zip\n |\n |-- vg_gqa_imgfeat -- *.zip.*  # 请阅读该文件夹下的 README.txt\n |-- gqa -- *.json\n |-- nlvr2_imgfeat -- *.zip.*   # 请阅读该文件夹下的 README.txt\n |-- nlvr2 -- *.json\n |-- lxmert -- *.json\n | \n |-- pretrained -- model_LXRT.pth\n```\n\n由于百度网盘不支持超大文件，\n我们将部分特征压缩包拆分成了多个小文件。\n请按照 `baidu_drive\u002Fvg_gqa_imgfeat` 和 `baidu_drive\u002Fnlvr2_imgfeat` 文件夹下的 `README.txt` 说明，\n使用 `cat` 命令将这些小文件重新拼接成完整的特征压缩包。\n\n\n## 代码与项目说明\n- 所有代码均位于 `src` 文件夹中，基础部分在 `lxrt` 中。\n与预训练和微调相关的 Python 文件分别保存在 `src\u002Fpretrain` 和 `src\u002Ftasks` 文件夹内。\n- 我们将包含图像特征的文件夹（如 `mscoco_imgfeat`）与视觉-语言数据集（如 `vqa`, `lxmert`）分开存放，\n因为多个视觉-语言数据集会共享相同的图像。\n- 我们将框架命名为 `lxmert`，而将模型称为 `lxrt`（即语言、跨模态与对象关系Transformer）。\n- 为保持与 `lxrt` 名称的一致性，\n我们用 `lxrXXX` 来表示层数。\n例如，`lxr955`（当前预训练模型中使用的配置）表示一个具有 9 层语言模块、5 层跨模态模块和 5 层对象关系模块的模型。\n若将单模态层视为跨模态层的一半，则总层数为 `(9 + 5) \u002F 2 + 5 = 12`，这与 `BERT_BASE` 的层数相同。\n- 我们在两个跨模态注意力子层之间共享权重。请查看 [`visual_attention` 变量](blob\u002Fmaster\u002Fsrc\u002Flxrt\u002Fmodeling.py#L521)，它同时用于计算 `lang->visn` 和 `visn->lang` 的注意力。（很抱歉，`visual_attention` 这个名称容易引起误解，因为我当时删除了 `lang_attention`。）共享权重主要是为了节省计算资源，同时也（直观地）有助于将视觉和语言特征映射到一个共同的子空间。\n- 框架坐标并未从 [0, 1] 归一化到 [-1, 1]，这看似是一个笔误，但实际上并非如此；对坐标进行归一化并不会影响框编码器的输出结果（无论从数学还是数值计算的角度来看都是如此）。~~（提示：可以考虑位置编码中的 LayerNorm）~~\n\n\n## Faster R-CNN 特征提取\n\n\n我们使用的是“Bottom-Up and Top-Down Attention for Image Captioning and Visual Question Answering”, CVPR 2018”论文中展示的 Faster R-CNN 特征提取方法，\n其代码已在 [Bottom-Up-Attention GitHub 仓库](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention) 上公开。\n该方法基于 [Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F) 数据集进行训练，并且是在特定版本的 [Caffe](https:\u002F\u002Fcaffe.berkeleyvision.org\u002F) 框架上实现的。\n\n\n为了使用这套 Caffe Faster R-CNN 提取特征，我们在 Docker Hub 上公开发布了 `airsplay\u002Fbottom-up-attention` 镜像，其中已预先配置好所有依赖项和库的安装。下方提供了详细的说明和示例。您也可以参考 Bottom-Up Attention GitHub 仓库中的安装指南来搭建工具：[https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention)。\n\nBUTD 特征提取器已被广泛应用于许多其他项目中。如果您希望复现该论文中的实验结果，欢迎直接使用我们的 Docker 镜像作为工具。\n\n### 使用 Docker 进行特征提取\n[Docker](https:\u002F\u002Fwww.docker.com\u002F) 是一个易于使用的虚拟化工具，它允许你在无需安装库的情况下即插即用。\n\n用于 Bottom-Up Attention 的预构建 Docker 镜像已在 [Docker Hub](https:\u002F\u002Fhub.docker.com\u002Fr\u002Fairsplay\u002Fbottom-up-attention) 上发布，可以通过以下命令下载：\n```bash\nsudo docker pull airsplay\u002Fbottom-up-attention\n```\n> `Dockerfile` 可以从[这里](https:\u002F\u002Fdrive.google.com\u002Ffile\u002Fd\u002F1KJjwQtqisXvinWm8OORk-_3XYLBHYCIK\u002Fview?usp=sharing)下载，这使得你可以使用其他版本的 CUDA。\n\n拉取 Docker 镜像后，你可以通过以下命令测试运行容器：\n```bash\ndocker run --gpus all --rm -it airsplay\u002Fbottom-up-attention bash\n```\n\n\n如果出现关于 `--gpus all` 的错误，请阅读下一节。\n\n#### Docker GPU 访问\n请注意，参数 `--gpus all` 的作用是将 GPU 设备暴露给 Docker 容器，它需要 Docker >= 19.03 以及 `nvidia-container-toolkit`：\n1. [Docker CE 19.03](https:\u002F\u002Fdocs.docker.com\u002Finstall\u002Flinux\u002Fdocker-ce\u002Fubuntu\u002F)\n2. [nvidia-container-toolkit](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fnvidia-docker)\n\n对于使用较旧版本 Docker 的情况，可以将其更新到 19.03，或者使用 `--runtime=nvidia` 标志代替 `--gpus all`。\n\n#### 示例：为 NLVR2 提取特征\n我们演示如何提取 NLVR2 图像的 Faster R-CNN 特征。\n\n1. 请首先按照 [NLVR2 官方仓库](https:\u002F\u002Fgithub.com\u002Flil-lab\u002Fnlvr\u002Ftree\u002Fmaster\u002Fnlvr2) 中的说明获取图像。\n\n2. 下载预训练的 Faster R-CNN 模型。我们没有使用默认的预训练模型（该模型是在 10 到 100 个框上训练的），而是使用了 ['alternative pretrained model'](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention#demo)，该模型是在 36 个框上训练的。\n    ```bash\n    wget 'https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F2h4hmgcvpaewizu\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data\u002Fnlvr2_imgfeat\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. 使用以下命令运行 Docker 容器：\n    ```bash\n    docker run --gpus all -v \u002Fpath\u002Fto\u002Fnlvr2\u002Fimages:\u002Fworkspace\u002Fimages:ro -v \u002Fpath\u002Fto\u002Flxrt_public\u002Fdata\u002Fnlvr2_imgfeat:\u002Fworkspace\u002Ffeatures --rm -it airsplay\u002Fbottom-up-attention bash\n    ```\n    `-v` 将主机上的文件夹挂载到 Docker 容器中。\n    > 注意0：如果提示权限相关问题，请在命令前加上 `sudo`。\n    >\n    > 注意1：如果提示 `--gpus all` 相关错误，则表示 GPU 选项未正确设置。请参阅 [Docker GPU 访问](#docker-gpu-access) 获取允许 GPU 访问的说明。\n    >\n    > 注意2：`\u002Fpath\u002Fto\u002Fnlvr2\u002Fimages` 文件夹应包含 `train`、`dev`、`test1` 和 `test2` 子文件夹。\n    >\n    > 注意3：`\u002Fpath\u002Fto\u002Fnlvr2\u002Fimages\u002F` 和 `\u002Fpath\u002Fto\u002Flxrt_public` 均需使用绝对路径。\n\n\n4. 在 **Docker 容器内**提取特征。提取脚本源自 [butd\u002Ftools\u002Fgenerate_tsv.py](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention\u002Fblob\u002Fmaster\u002Ftools\u002Fgenerate_tsv.py)，并由 [Jie Lei](http:\u002F\u002Fwww.cs.unc.edu\u002F~jielei\u002F) 和我进行了修改。\n    ```bash\n    cd \u002Fworkspace\u002Ffeatures\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_nlvr2_image.py --split test\n    ```\n\n5. 训练集的提取大约需要 5 到 6 小时，而验证集和测试集则分别需要 1 到 2 小时。由于速度较慢，建议如果有多个 GPU，可以并行运行这些任务。可以通过更改 `CUDA_VISIBLE_DEVICES=$gpu_id` 中的 `gpu_id` 来实现。\n\n提取的特征将保存在 Docker 容器外部的 `data\u002Fnlvr2_imgfeat` 目录下的 `train.tsv`、`valid.tsv` 和 `test.tsv` 文件中。我已经验证过，提取的图像特征与我在 [NLVR2 微调](#nlvr2) 中提供的特征一致。\n\n#### 另一个示例：为 MS COCO 图像提取特征\n1. 从 [MS COCO 官方网站](http:\u002F\u002Fcocodataset.org\u002F#download) 下载 MS COCO 的 train2014、val2014 和 test2015 图像。\n\n2. 下载预训练的 Faster R-CNN 模型。\n    ```bash\n    mkdir -p data\u002Fmscoco_imgfeat\n    wget 'https:\u002F\u002Fwww.dropbox.com\u002Fs\u002F2h4hmgcvpaewizu\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel?dl=1' -O data\u002Fmscoco_imgfeat\u002Fresnet101_faster_rcnn_final_iter_320000.caffemodel\n    ```\n\n3. 使用以下命令运行 Docker 容器：\n    ```bash\n    docker run --gpus all -v \u002Fpath\u002Fto\u002Fmscoco\u002Fimages:\u002Fworkspace\u002Fimages:ro -v $(pwd)\u002Fdata\u002Fmscoco_imgfeat:\u002Fworkspace\u002Ffeatures --rm -it airsplay\u002Fbottom-up-attention bash\n    ```\n    > 注意：`-v` 选项会将容器外的文件夹挂载到容器内的路径。\n    > \n    > 注意1：请使用 MS COCO 图像文件夹 `images` 的 **绝对路径**。该文件夹包含 `train2014`、`val2014` 和 `test2015` 子文件夹。（这是保存 MS COCO 图像的标准方式。）\n\n4. 在 **Docker 容器内**提取特征。\n    ```bash\n    cd \u002Fworkspace\u002Ffeatures\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split train \n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split valid\n    CUDA_VISIBLE_DEVICES=0 python extract_coco_image.py --split test\n    ```\n \n5. 退出 Docker 容器（在 bash 中执行 `exit` 命令）。提取的特征将保存在 `data\u002Fmscoco_imgfeat` 文件夹下。\n\n\n## 参考文献\n如果你觉得这个项目对你有帮助，请引用我们的论文 :)\n\n```bibtex\n@inproceedings{tan2019lxmert,\n  title={LXMERT: Learning Cross-Modality Encoder Representations from Transformers},\n  author={Tan, Hao and Bansal, Mohit},\n  booktitle={Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing},\n  year={2019}\n}\n```\n\n## 致谢\n我们感谢ARO-YIP奖#W911NF-18-1-0336，以及来自Google、Facebook、Salesforce和Adobe的资助支持。\n\n我们感谢[Peter Anderson](https:\u002F\u002Fpanderson.me\u002F)在[Bottom-Up-Attention GitHub仓库](https:\u002F\u002Fgithub.com\u002Fpeteanderson80\u002Fbottom-up-attention)中提供的Faster R-CNN代码和预训练模型。同时，我们也感谢[Hengyuan Hu](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fhengyuan-hu-8963b313b)实现的[PyTorch VQA](https:\u002F\u002Fgithub.com\u002Fhengyuan-hu\u002Fbottom-up-attention-vqa)，我们的VQA实现借鉴了其中预处理的答案。\n\n此外，我们感谢[hugginface](https:\u002F\u002Fgithub.com\u002Fhuggingface)发布的优秀PyTorch代码库[PyTorch Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpytorch-transformers)。\n\n我们特别感谢[Drew A. Hudson](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fdrew-a-hudson\u002F)解答了我们关于GQA规范的所有问题。同时，我们也感谢[Alane Suhr](http:\u002F\u002Falanesuhr.com\u002F)协助我们在NLVR2未公开的测试集上测试LXMERT，并提供了[详细分析](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002FNLVR2BiasAnalysis.html)。\n\n我们还要感谢所有视觉与语言数据集的作者和标注者（例如：[MS COCO](http:\u002F\u002Fcocodataset.org\u002F#home)、[Visual Genome](https:\u002F\u002Fvisualgenome.org\u002F)、[VQA](https:\u002F\u002Fvisualqa.org\u002F)、[GQA](https:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fdorarad\u002Fgqa\u002F)、[NLVR2](http:\u002F\u002Flil.nlp.cornell.edu\u002Fnlvr\u002F)），正是他们的工作使我们能够开发出用于视觉与语言任务的预训练模型。\n\n最后，我们感谢[Jie Lei](http:\u002F\u002Fwww.cs.unc.edu\u002F~jielei\u002F)和[Licheng Yu](http:\u002F\u002Fwww.cs.unc.edu\u002F~licheng\u002F)的有益讨论。我也要特别感谢[Shaoqing Ren](https:\u002F\u002Fwww.shaoqingren.com\u002F)在我于MSRA期间传授的视觉知识，并感谢他帮助审阅我们的代码。如果您发现任何问题，请随时与我们联系，我们非常欢迎您的反馈。\n\nLXR致谢。","# LXMERT 快速上手指南\n\nLXMERT 是一个用于学习跨模态编码器表示的 Transformer 模型，广泛应用于视觉问答（VQA）、GQA 和 NLVR2 等视觉 - 语言任务。本指南将帮助你快速搭建环境并运行微调示例。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu)\n- **Python**: Python 3.x\n- **GPU**: 微调任务建议至少 1 张 GPU；预训练需要 4 张 GPU\n- **磁盘空间**: 约 50GB+（用于存放数据集和图像特征）\n\n### 前置依赖\n安装 Python 依赖包：\n```bash\npip install -r requirements.txt\n```\n\n建议创建虚拟环境以避免冲突：\n```bash\nvirtualenv lxmert_env -p python3\nsource lxmert_env\u002Fbin\u002Factivate\n```\n\n## 安装步骤\n\n### 1. 下载预训练模型\n下载官方提供的预训练权重（约 870 MB）：\n```bash\nmkdir -p snap\u002Fpretrained \nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Fmodel_LXRT.pth -P snap\u002Fpretrained\n```\n> **提示**: 如果下载速度慢，可尝试寻找国内镜像源或通过百度网盘等替代方式下载，并将文件放置于 `snap\u002Fpretrained\u002Fmodel_LXRT.pth`。\n\n### 2. 准备数据（以 VQA 任务为例）\n创建数据目录并下载处理好的 JSON 文件：\n```bash\nmkdir -p data\u002Fvqa\nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Ftrain.json -P data\u002Fvqa\u002F\nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fnominival.json -P  data\u002Fvqa\u002F\nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fvqa\u002Fminival.json -P data\u002Fvqa\u002F\n```\n\n下载 MS COCO 图像的 Faster R-CNN 特征（训练集和验证集）：\n```bash\nmkdir -p data\u002Fmscoco_imgfeat\nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -P data\u002Fmscoco_imgfeat\nunzip data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip -d data\u002Fmscoco_imgfeat && rm data\u002Fmscoco_imgfeat\u002Ftrain2014_obj36.zip\nwget https:\u002F\u002Fnlp.cs.unc.edu\u002Fdata\u002Flxmert_data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -P data\u002Fmscoco_imgfeat\nunzip data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip -d data && rm data\u002Fmscoco_imgfeat\u002Fval2014_obj36.zip\n```\n\n## 基本使用\n\n以下以 **VQA 2.0** 任务的微调为例，展示最简使用流程。\n\n### 1. 小规模测试（推荐）\n在正式训练前，建议使用小数据集（512 张图片）验证脚本和模型是否正常运转。\n*   `0`: GPU 编号\n*   `vqa_lxr955_tiny`: 实验名称\n\n```bash\nbash run\u002Fvqa_finetune.bash 0 vqa_lxr955_tiny --tiny\n```\n\n### 2. 正式微调\n确认无误后，使用完整数据集进行训练。预计耗时约 8 小时（4 个 epoch）。\n```bash\nbash run\u002Fvqa_finetune.bash 0 vqa_lxr955\n```\n训练完成后，日志和模型快照将保存在 `snap\u002Fvqa\u002Fvqa_lxr955` 目录下，验证集准确率通常在 **69.7% - 70.2%** 之间。\n\n### 3. 本地验证\n使用训练好的最佳模型在本地验证集上进行测试：\n```bash\nbash run\u002Fvqa_test.bash 0 vqa_lxr955_results --test minival --load snap\u002Fvqa\u002Fvqa_lxr955\u002FBEST\n```\n\n> **注意**: 若需运行 GQA 或 NLVR2 任务，请参照 README 中对应章节下载特定数据集，并将命令中的 `vqa` 替换为 `gqa` 或 `nlvr2`，同时调整相应的超参数。","某电商平台的智能客服团队正在开发一款能同时理解商品图片和用户文字提问的自动问答机器人，以处理如“这件红色连衣裙有口袋吗？”这类复杂的跨模态查询。\n\n### 没有 lxmert 时\n- **模型架构割裂**：开发人员需要分别训练独立的图像识别模型和文本理解模型，再通过繁琐的规则或简单的拼接层强行融合，导致图文语义对齐效果差。\n- **研发周期漫长**：从零开始构建跨模态注意力机制需要深厚的算法功底，团队需花费数周时间复现论文逻辑并调试底层代码，严重拖慢项目进度。\n- **冷启动性能低下**：由于缺乏大规模预训练权重支持，模型在少量标注数据上难以收敛，面对未见过的商品组合时，回答准确率远低于预期。\n- **多任务适配困难**：每新增一个视觉语言任务（如从看图说话切换到视觉推理），都需要重新设计网络结构并从头训练，维护成本极高。\n\n### 使用 lxmert 后\n- **原生跨模态融合**：lxmert 内置的 Transformer 编码器能直接在底层实现图像区域与文本单词的深度交互，精准捕捉“红色”与“连衣裙”之间的细粒度关联。\n- **开箱即用的高效部署**：团队直接加载 lxmert 提供的 870MB 预训练权重，仅需针对电商数据微调 4 个 epoch 即可上线，将研发周期从数周缩短至几天。\n- **卓越的泛化能力**：得益于在大规模数据集上的预训练，lxmert 在零样本或少样本场景下依然保持高准确率，有效解决了新商品上架时的冷启动问题。\n- **统一的多任务框架**：无论是视觉问答还是自然语言视觉推理，lxmert 只需调整少量超参数即可复用同一套代码库，极大降低了后续功能迭代的复杂度。\n\nlxmert 通过提供成熟的跨模态预训练底座，让开发者无需重复造轮子，即可快速构建出具备深度图文理解能力的生产级 AI 应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fairsplay_lxmert_7bbd6b7c.png","airsplay","Hao Tan","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fairsplay_ad3bed03.jpg","NLP @ UNC Chapel Hill","UNC","Chapel Hill",null,"http:\u002F\u002Fwww.cs.unc.edu\u002F~airsplay\u002F","https:\u002F\u002Fgithub.com\u002Fairsplay",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",98.1,{"name":87,"color":88,"percentage":89},"Shell","#89e051",1.9,966,162,"2026-01-26T11:31:36","MIT",4,"Linux","必需 NVIDIA GPU。预训练默认设置需要 4 张 GPU；微调过程指定了 GPU ID（如 bash 脚本中的参数 0），未明确具体型号，但需支持 Faster R-CNN 特征处理及大模型训练，建议显存充足（通常此类任务需 16GB+ 以支持较大 Batch Size）。CUDA 版本未说明。","未说明（但下载图像特征文件需约 17GB-30GB 磁盘空间，处理大规模数据建议大内存）",{"notes":99,"python":100,"dependencies":101},"1. 代码库基于 PyTorch，需通过 'pip install -r requirements.txt' 安装依赖，但 README 未列出具体库版本。2. 预训练默认配置需要 4 张 GPU，耗时约一周；微调 VQA 任务约需 8 小时，GQA 任务约需 16 小时。3. 必须下载外部预处理数据：包括 Faster R-CNN 提取的图像特征（MS COCO 和 Visual Genome 等，总计约 50GB+）以及重新分发的 JSON 数据文件。4. 支持使用 virtualenv 创建 Python 3 虚拟环境。5. 提供了预训练模型（870MB）及不同 epoch 的检查点下载链接。","Python 3",[102,103],"torch (PyTorch)","requirements.txt 中列出的其他依赖",[35,15,105],"其他","2026-03-27T02:49:30.150509","2026-04-12T14:00:08.650575",[],[]]