[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-facebookresearch--atlas":3,"tool-facebookresearch--atlas":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":96,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":109,"github_topics":80,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":110,"updated_at":111,"faqs":112,"releases":142},2426,"facebookresearch\u002Fatlas","atlas","Code repository for supporting the paper \"Atlas Few-shot Learning with Retrieval Augmented Language Models\",(https\u002F\u002Farxiv.org\u002Fabs\u002F2208.03299)","Atlas 是一个用于支持“检索增强语言模型”研究的开源代码库，核心目标是实现高效的少样本学习。它通过联合预训练一个基于段落的密集检索器和一个编码器-解码器语言模型，让 AI 能够在回答问题时主动“查阅”外部知识库，从而显著提升准确性。\n\nAtlas 主要解决了传统大语言模型依赖海量参数且知识更新滞后的痛点。研究表明，即便仅使用 64 个示例进行训练，Atlas 在自然问答任务上的准确率也能超越参数量大其 50 倍的模型；而在全量数据微调后，它更是刷新了当时的行业最佳纪录。此外，Atlas 证明了文档索引内容可以轻松更新，这意味着模型无需重新训练即可获取最新知识，有效缓解了模型“幻觉”和知识过时的问题。\n\n这款工具特别适合人工智能研究人员、NLP 工程师以及对检索增强生成（RAG）技术感兴趣的开发者使用。它支持从预训练、微调到检索评估的全流程操作，能够处理包括开放域问答、多项选择、事实核查及 KILT 基准测试在内的多种任务。\n\n在技术亮点方面，Atlas 支持高达 110 亿参数的 Fusion-in-Decoder 模型训练，并具备在训练循环中进行端到端检索的能力。它内置了快速的并","Atlas 是一个用于支持“检索增强语言模型”研究的开源代码库，核心目标是实现高效的少样本学习。它通过联合预训练一个基于段落的密集检索器和一个编码器-解码器语言模型，让 AI 能够在回答问题时主动“查阅”外部知识库，从而显著提升准确性。\n\nAtlas 主要解决了传统大语言模型依赖海量参数且知识更新滞后的痛点。研究表明，即便仅使用 64 个示例进行训练，Atlas 在自然问答任务上的准确率也能超越参数量大其 50 倍的模型；而在全量数据微调后，它更是刷新了当时的行业最佳纪录。此外，Atlas 证明了文档索引内容可以轻松更新，这意味着模型无需重新训练即可获取最新知识，有效缓解了模型“幻觉”和知识过时的问题。\n\n这款工具特别适合人工智能研究人员、NLP 工程师以及对检索增强生成（RAG）技术感兴趣的开发者使用。它支持从预训练、微调到检索评估的全流程操作，能够处理包括开放域问答、多项选择、事实核查及 KILT 基准测试在内的多种任务。\n\n在技术亮点方面，Atlas 支持高达 110 亿参数的 Fusion-in-Decoder 模型训练，并具备在训练循环中进行端到端检索的能力。它内置了快速的并行分布式 GPU 检索功能，支持对多达 4 亿个段落进行精确或近似搜索，同时提供了内存优化方案和索引原地快速刷新机制，确保在处理大规模语料库时依然保持高效与灵活。需要注意的是，该仓库目前不再维护，代码按原样提供，适合用于学术研究和技术参考。","# Atlas: Few-shot Learning with Retrieval Augmented Language Models\n\nREPO NO LONGER MAINTAINED, RESEARCH CODE PROVIDED AS IT IS\n\nThis repository contains pre-trained models, corpora, indices, and code for pre-training, finetuning, retrieving and evaluating for the paper [Atlas: Few-shot Learning with Retrieval Augmented Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03299.pdf)\n\nRead our [Atlas blog post](https:\u002F\u002Fresearch.facebook.com\u002Fblog\u002F2023\u002F1\u002Fatlas-few-shot-learning-with-retrieval-augmented-language-models\u002F) for a quick overview of the project and how to run the code with torchrun (slurm free option).\n\nWe jointly pretrain a retrieval-augmented seq2seq language model, comprised of a passage-based dense retriever and a encoder-decoder language model. \nWe perform evaluations on a wide range of tasks, including MMLU, KILT and NaturalQuestions, and\nstudy the impact of the content of the document index, showing that it can easily be updated.\nNotably, Atlas reaches over 45% accuracy on Natural Questions using only 64 examples when supplied with wikipedia index from 2018,\noutperforming a 540B parameters model by 6% despite having 50x fewer parameters.\nAtlas also works very well when finetuned on larger datasets - when finetuned on the full Natural Questions data, Atlas sets a new state-of-the-art of 64%, 8 points higher than the current state of the art.\n\nThis repository supports pretraining and finetuning, for *both* large and small datasets. This repository can be supports the following features:\n* Training large fusion-in-decoder seq2seq models, tested up to 11B parameters\n* Distilling relevance signals from fusion-in-decoder models into dense retrieval models using a variety of different distillation approaches.\n* Performing end-to-end retrieval-augmented training over a user-supplied corpus of passages (tested with up to 400M passages, ~40B words) with retrieval-in-the-training-loop\n* Support for training on Masked-Language modelling, prefix-language modelling, wikipedia section generation, Open-Domain Question Answering, Multiple Choice Question Answering, Fact checking, and KILT (arbitrary seq2seq tasks can also be supported)\n* A fast, parallel distributed GPU-based exact and approximate maximum inner product search for dense vector retrieval\n* Support for fast in-place index refreshes\n* Various memory optimizations and methods for maintaining fast and accurate retrieval while training retrievers in-the-loop.\n* plus more, see the command line arguments or the readme for additional features\n\n## Table of Contents\n\n* [Installation](#installation)\n* [Getting Started and Codebase at a Glance](#getting-started-and-codebase-at-a-glance)\n* [Available Data and Models for download](#available-data-and-Models-for-download)\n  * [Corpora](#corpora)\n  * [Models](#models)\n  * [Pre-built Indices](#prebuilt-indices)\n* [Tasks](#tasks)\n  * [Basic](#base-task)\n  * [Masked Language Modelling](#mlm-task)\n  * [Wikipedia Section Generation](#section-task)\n  * [Open-Domain Question Answering (e.g. NaturalQuestions, TriviaQA, TempLama)](#qa-task)\n  * [Multiple Choice Question Answering (e.g. MMLU)](#mcqa-task)\n  * [Fact Checking](#fever-task)\n  * [KILT](#kilt-task)\n* [Retrieval and Index Details](#retrieval-and-index-details)\n  * [Flat vs Faiss](#flat-vs-faiss)\n  * [Index Saving and Loading](#index-saving-and-loading)\n  * [Strategies for dealing with stale indices](#strategies-for-dealing-with-stale-indices)\n    * [Index Refresh](#strategies-for-dealing-with-stale-indices)\n    * [Over-Retrieve with Reranking](#strategies-for-dealing-with-stale-indices)\n    * [Query-Side Finetuning](#strategies-for-dealing-with-stale-indices)\n  * [Retrieve-only mode](#retrieve-only-mode)\n  * [Using pre-retrieved or cached passages](#using-pre-retrieved-or-cached-passages)\n* [Other features](#other-features)\n  * [Closed book mode](#closed-book-mode)\n  * [Specifying formats](#specifying-formats)\n  * [Implementing your own task](#implementing-your-own-task)\n* [Full list of command line flags](#full-list-of-command-line-flags)\n* [Citing](#citing)\n* [LICENSE](#license)\n  * [Code License:](#code-license)\n  * [Data License:](#data-license)\n\n\n## Installation\n\nThe Atlas codebase uses the following dependencies:\n\n* python 3 (tested with 3.8)\n* fairscale (tested with 0.4.6)\n* transformers (tested with 4.18.0)\n* numpy (tested with 1.22.4)\n* faiss (tested with 1.7.2)\n\nWe recommend installing using conda. The following will install all dependencies:\n```\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas.git\ncd atlas\nconda create --name atlas-env python=3.8\nconda activate atlas-env\nconda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch\nconda install -c pytorch faiss-gpu=1.7.2 cudatoolkit=11.3\npip install -r requirements.txt\n```\n\n## Getting Started and Codebase at a Glance\n\nThe Atlas repository provides functionality for training and evaluating retrieval-augmented generation models, comprised of an encoder-decoder language model, and dense-vector retriever.\n\u003C!-- We current  functionality for *jointly* training an encoder-decoder language model and retrieval-augmented with a dense retriever. -->\nWe currently support T5 architectures for the encoder-decoder language model and Contriever architectures for the retriever (Support for other architectures is not currently planned, but PRs are welcome).\nAtlas models are comprised of a Contriever retriever and fusion-in-decoder (FID) architecture (which uses T5). You can learn more about the Contriever and FID [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FFiD) and [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcontriever) respectively if desired, but all required functionality has been reimplemented in this codebase.\n\nThe biggest difference to most standard NLP training codebases is that Atlas performs retrieval on-the-fly, and can refresh its retrieval embeddings index in-place.\nThis is achieved using a custom-designed distributed GPU index, which automatically handles fast and scale-able retrieval.\n \n**A note on how retrieval is accomplished:** \nWhen launching a training or evaluation run, the codebase will first load pretrained models, then each GPU worker will load a shard of the supplied passages to retrieve from -- if there are N GPUs, each will load a shard of 1\u002FN passages. \nEach worker will then embed its shard of the passages using the retriever embedder, and keep the passage embedding shard in GPU memory (and optionally build a FAISS index).\nAt this point, the passage and embedding shards (referred to as \"the index\") can be optionally saved to disk to avoid the need to recompute indices for every run. \nRetrieval is performed in parallel, with each GPU worker performing an exact maximum inner product search for all the queries for its shard. \nMore details on retrieval are given in the [Retrieval and Index Details](#retrieval-and-index-details) section.\n*Note that all of the above is all handled automatically by the codebase*, so users should not need to know or worry too much about how embedding, index refresh or retrieval is accomplished, other than \n1) noting that they can easily retrieve from any set of passages that they like by just passing in paths to suitably-formatted passages on disk (or any saved index) \n2) noting that embedding, index refresh retrieving will get faster with more GPU workers.\n3) Depending on how many GPUs and CPU memory is available, Atlas can support training models with 11B+ parameters and indices of 400M+ vectors, or ~40 billion words (assuming ~100 words a passage)\n\nTraining and Evaluation uses a data-parallel model: for N GPU workers, each processes 1\u002FN of the total mini-batch of data. To save memory at training time, optimizer state and gradients can be sharded using fairscale's ShardedDataParallel. \n\nAll data files (retriever passages and train\u002Fdev\u002Ftest data) should be supplied in the form of [jsonlines](https:\u002F\u002Fjsonlines.org\u002F) (\"jsonl\") files.\nPassages to retrieve from should consist of json-serialized objects with `text` and `title` text fields, one passage per line.\nExample passage files are available for wikipedia (see [corpora](#corpora)).\nTrain\u002Fdev\u002Ftest data files should be json-serialized objects, one instance per line. The name of the fields is task dependent (covered in detail in [Tasks](#tasks)), but e.g. for NaturalQuestions, the required fields are `question` (a question string) and `answers` (a list of reference answer strings)\n\nThe codebase has two entrypoint scripts: `train.py` for training, and `evaluate.py` for test-time evaluation (and [stand-alone retrieval](#retriever-only-mode), if you want).\nYou can list the full Atlas functionality by printing the command-line flags using `python train.py -h` (full output [here](#full-list-of-command-line-flags))\n\n*The easiest way to illustrate the codebase is with an example:*\n\nThe following example shows an example use case: few-shot finetuning and evaluating on NaturalQuestions with Atlas-large (which are also available as a runnable sbatch scripts in `example_scripts\u002Fnq\u002F`), retrieving from a wikipedia dump from 2018 (of about 30M passages)\n\n```bash\n# assumes 4 nodes, each with 8 GPUs\nDATA_DIR=.\u002Fatlas_data\nSIZE=large # lets use large, (slower than base, but still quite fast and accessible, but less accurate than xl or xxl)\n\n# download the NQ data\npython preprocessing\u002Fprepare_qa.py --output_directory ${DATA_DIR}\u002Fdata\u002F\n# download the Wikipedia 2018 corpus\npython preprocessing\u002Fdownload_corpus.py --corpus corpora\u002Fwiki\u002Fenwiki-dec2018 --output_directory ${DATA_DIR} \n# downloads pretrained Atlas-large\npython preprocessing\u002Fdownload_model.py --model models\u002Fatlas\u002F${SIZE} --output_directory ${DATA_DIR}  \n\nport=$(shuf -i 15000-16000 -n 1)\nTRAIN_FILE=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftrain.64-shot.jsonl\"\nEVAL_FILES=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl\"\nSAVE_DIR=${DATA_DIR}\u002Fexperiments\u002F\nEXPERIMENT_NAME=my-nq-64-shot-example\nTRAIN_STEPS=30\n\nsrun python train.py \\\n    --shuffle \\\n    --train_retriever \\\n    --gold_score_mode pdist \\ # loss function for retriever (see paper)\n    --use_gradient_checkpoint_reader --use_gradient_checkpoint_retriever\\ # save GPU memory with gradient checkpointing at expense of speed\n    --precision fp32 \\ # use \"bf16\" if supported by your GPUs, fp16 is usually unstable\n    --shard_optim --shard_grads \\ # Save GPU memory using these optimizations\n    --temperature_gold 0.01 --temperature_score 0.01 \\ \n    --refresh_index -1 \\ # for fewshot finetune, refreshing the index (i.e. recomputing the embeddings) is expensive and not really worth it\n    --query_side_retriever_training\\ # instead, for fewshot runs, finetuning only the query-encoder of Contriever works well. Remove this flag to finetune whole retriever\n    --target_maxlength 16 \\ # max length of generation\n    --reader_model_type google\u002Ft5-${SIZE}-lm-adapt \\ # architecture of Atlas\n    --dropout 0.1 --weight_decay 0.01 --lr 4e-5 --lr_retriever 4e-5 --scheduler linear \\ # optimization flags\n    --text_maxlength 512 \\ # max length of question + passage when concatenated\n    --model_path \"${DATA_DIR}\u002Fmodels\u002Fatlas\u002F${SIZE}\" \\ # path to the pretrained Atlas model we just downloaded (pass 'none' to init from plain t5 and Contriever)\n    --train_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftrain.64-shot.jsonl\" \\ # path the 64-shot train dataset we just downloaded \n    --eval_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl\" \\ # path the NQ dev dataset we just downloaded, to evaluate on when training is done\n    --per_gpu_batch_size 1 \\\n    --n_context 40 \\ # pass the top 40 passages from the retriever to the language model\n    --retriever_n_context 40 \\ # finetune the retriever with the top 40 passages\n    --name ${EXPERIMENT_NAME} \\ # name of experiment (also the name of the directory the logs and models will be saved to) \n    --checkpoint_dir ${SAVE_DIR} \\ # logs and model checkpoints will be saved to ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\n    --eval_freq ${TRAIN_STEPS} \\ # eval after we finish training\n    --log_freq 4 \\ # log stats every 4 training steps. Logs will write to ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Frun.log but will also write tensorboard logs if installed\n    --total_steps ${TRAIN_STEPS} \\ # train for this many steps\n    --warmup_steps 5 \\\n    --save_freq ${TRAIN_STEPS} \\ # for this example, we'll save one checkpoint, after the training is complete\n    --main_port $port \\ # for distributed training\n    --write_results \\ # write predictions - they will get saved in the checkpoint folder, ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\n    --task qa \\ # we're doing the QA task\n    --index_mode flat \\ # don't use faiss, keep index flat (recommended unless using very large indices or very constrained on GPU memory)\n    --passages \"${DATA_DIR}\u002Fcorpora\u002Fwiki\u002Fenwiki-dec2018\u002Ftext-list-100-sec.jsonl\" \"${DATA_DIR}\u002Fcorpora\u002Fwiki\u002Fenwiki-dec2018\u002Finfobox.jsonl\"\\ # pass in the wikipedia passages to index and retrieve from (we use both the text and infoboxes)\n    --save_index_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fsaved_index # save the index we built to this path\n```\n\nThe training script will first embed an index for wikipedia 2018, and then save it under the checkpoint folder (`${SAVE_DIR}\u002F${EXPERIMENT_NAME}`). \nThe training script will then fewshot-finetune an Atlas-large NQ model for 30 steps, retrieving from all of wikipedia 2018. \nThis particular script finetunes the query encoder of the retriever and the FID, whilst keeping the passage encoder frozen (see the paper, or [below](#strategies-for-dealing-with-stale-indices) for further details).\nTh script will then evaluate on the dev set and save the checkpoint. \nYou can inspect the experiment logs at `${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Frun.log` and observe a NQ-dev Exact match score of ~38 has been logged (our run was 38.4), and written predictions which can be inspected.\n\nTo evaluate the model, (e.g. on heldout test data) we can use the `evaluate.py` entrypoint script:\n\n```bash\nsrun python evaluate.py \\\n    --name 'my-nq-64-shot-example-evaluation' \\\n    --generation_max_length 16 \\\n    --gold_score_mode \"pdist\" \\\n    --precision fp32 \\\n    --reader_model_type google\u002Ft5-${size}-lm-adapt \\\n    --text_maxlength 512 \\\n    --model_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fcheckpoint\u002Fstep-30 \\ #now, we point this to the model we just trained\n    --eval_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl ${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftest.jsonl\" \\ # lets evaluate on the dev data and the test data this time\n    --per_gpu_batch_size 1 \\\n    --n_context 40 --retriever_n_context 40 \\\n    --checkpoint_dir ${SAVE_DIR} \\\n    --main_port $port \\\n    --index_mode \"flat\"  \\\n    --task \"qa\" \\\n    --load_index_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fsaved_index\\ # rather than re-embed all the wikipedia passages again, lets load them from the index we just saved above\n    --write_results # write the inference results\n```\nThis script will load the model, and since we specified to load a saved index via `--load_index_path`, it will load an index rather than embed from passages as before. \nIt will then evaluate the development and test sets.\nInspecting the saved logs at `${SAVE_DIR}\u002Fmy-nq-64-shot-example-evaluation\u002Frun.log`, we will see the same exact match score for the dev set that we got before, and a test score of ~38 (in our case 38.8 EM).\n\nThe rest of this readme describes data, code and functionality in detail.\n\n## Available Data and Models for download\n\nAtlas's wikipedia corpora, the pretrained models and pre-built wikipedia indices are available for download at this time. \n\nClick to expand:\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"corpora\">Corpora\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nThe preprocessed wikipedia dumps we use for retrieving and pretraining Atlas can be downloaded as follows:\n\n```bash\npython preprocessing\u002Fdownload_corpus.py --corpus {corpus download key} --output_directory ${DATA_DIR} \n```\nThe above string will download a corpus and unzip it to `${DATA_DIR}\u002F{corpus download key}` \n\nThe available corpora are given below:\n\n| Corpus Name      | Corpus Download Key | Description | Size |\n| ----------- | ----------- | --------|  ---- |\n| enwiki-dec2017      | `corpora\u002Fwiki\u002Fenwiki-dec2017` | Wikipedia dump from Dec 2017, preprocessed into passages       |  30.4M (26.9M text, 2.7M  infobox)| \n| enwiki-dec2018      | `corpora\u002Fwiki\u002Fenwiki-dec2018` | Wikipedia dump from Dec 2018, preprocessed into passages (recommended for NQ, TriviaQA) | 32.1M (28.4M text, 3.7M infobox) |\n| enwiki-aug2019      | `corpora\u002Fwiki\u002Fenwiki-aug2019` |  Wikipedia dump from August 2019, preprocessed into passages       | 33.1M (29.4M text, 3.8M infobox)  |\n| enwiki-dec2020      | `corpora\u002Fwiki\u002Fenwiki-dec2020` |  Wikipedia dump from Dec 2020, preprocessed into passages       | 35.6M (31.5M text, 4.1M infobox) |\n| enwiki-dec2021      | `corpora\u002Fwiki\u002Fenwiki-dec2021` | Wikipedia dump from Dec 2021, preprocessed into passages       | 37.5M (33.1M text, 4.3M infobox) |\n\nPassage files are jsonl formatted, with one passage serialized as a json object per line. By default, each passage should be formatted as follows:\n\n```python\n{\n    \"id\": \"0\", # passages should have a unique id\n    \"title\": \"Orchid\", # should specify the title of the page passage comes from (can be empty string if there's no good title)\n    \"text\": \"Orchids are easily distinguished from other plants, as they share some very evident derived characteristics or synapomorphies. Among these are: bilateral symmetry of the flower (zygomorphism), many resupinate flowers, a nearly always highly modified petal (labellum), fused stamens and carpels, and extremely small seeds.\", # main text of passage\n    \"section\": \"Description\" # Optional, section title, if non empty this field is appended to the title as {title}: {section} by default\n    ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\n\nCreating your own passage files to use with Atlas should be straightforward if you follow the above formatting.\n\nWe cannot open-source the common-crawl indices used in the paper at this time.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"models\">Models\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nWe are open-sourcing pretrained Atlas models at base, large, xl and xxl sizes. These include both the pretrained retriever and reader weights.\nIn addition, we're open-sourcing our strongest-performing fully-finetuned NaturalQuestions Atlas models, for users who want to perform state-of-the-art QA inference (or finetune them on other QA tasks).\nModels can be downloaded as follows:\n\n```bash\npython preprocessing\u002Fdownload_model.py --model {model download key} --output_directory ${DATA_DIR} \n```\n\nThis will download the requested model to `${DATA_DIR}\u002F{model download key}`, and it can then be used in scripts by passing `${DATA_DIR}\u002F{model download key}` to `--model_path`.\nThe following table details the available models:\n\n| Model | Model Download Key | Description | Parameters (reader \u002F retriever) |\n| ----------- | ----------- | --------| ----|\n| Atlas-xxl | `models\u002Fatlas\u002Fxxl` | Pretrained Atlas XXL model | 11B \u002F 110M |\n| Atlas-xl | `models\u002Fatlas\u002Fxl` | Pretrained Atlas XL model | 3B \u002F 110M |\n| Atlas-large | `models\u002Fatlas\u002Flarge` | Pretrained Atlas large model | 770M \u002F 110M |\n| Atlas-base | `models\u002Fatlas\u002Fbase` | Pretrained Atlas base model | 220M \u002F 110M |\n| NQ-finetuned Atlas-xxl | `models\u002Fatlas_nq\u002Fxxl` |Atlas XXL model, finetuned on Natural Question | 11B \u002F 110M |\n| NQ-finetuned Atlas-xl | `models\u002Fatlas_nq\u002Fxl` | Atlas XL model, finetuned on Natural Question | 3B \u002F 110M |\n| NQ-finetuned Atlas-large | `models\u002Fatlas_nq\u002Flarge` | Atlas large model, finetuned on Natural Question | 770M \u002F 110M |\n| NQ-finetuned Atlas-base | `models\u002Fatlas_nq\u002Fbase` |Atlas base model, finetuned on Natural Question| 220M \u002F 110M |\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"prebuilt-indices\">Pre-built Indices\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas will automatically build an index if none is provided. This is convenient, but can take a long time, especially with fewer GPU workers, or if the index is very large.\n\nWe have therefore made precomputed indices available for download for the wiki-dec2018 corpus for the pretrained Atlas checkpoints, and for nq-finetuned Atlas checkpoints\n\nThese can be downloaded as follows :\n```bash\npython preprocessing\u002Fdownload_index.py --index {index download key} --output_directory ${DATA_DIR} \n```\n\nThe above script will download the requested pretrained index and save them to `${DATA_DIR}\u002F{index download key}`. \nThey can then be used in training or evaluation by passing them to `--load_index_path`. \nMore details on index saving and loading are given in [Retrieval and Index Details](#retrieval-and-index-details). \nThe following indices are available for download:\n\n| Index  | Index Download Key | Corresponding Model |  Description |\n| --------| ------| --------| ------|\n| Atlas XXL wiki-dec2018 index | `indices\u002Fatlas\u002Fwiki\u002Fxxl` | `models\u002Fatlas\u002Fxxl` | Precomputed index for the wiki-dec2018 corpus for the pretrained Atlas-xxl model |\n| Atlas XL wiki-dec2018 index | `indices\u002Fatlas\u002Fwiki\u002Fxl` | `models\u002Fatlas\u002Fxl` | Precomputed index for the wiki-dec2018 corpus for the pretrained Atlas-xl model |\n| Atlas large wiki-dec2018 index | `indices\u002Fatlas\u002Fwiki\u002Flarge` | `models\u002Fatlas\u002Flarge` | Precomputed index for the wiki-dec2018 corpus for the pretrained Atlas-large model |\n| Atlas base wiki-dec2018 index | `indices\u002Fatlas\u002Fwiki\u002Fbase` | `models\u002Fatlas\u002Fbase` | Precomputed index for the wiki-dec2018 corpus for the pretrained Atlas-base model |\n| Atlas-nq XXL wiki-dec2018 index | `indices\u002Fatlas_nq\u002Fwiki\u002Fxxl` | `models\u002Fatlas_nq\u002Fxxl` | Precomputed index for the wiki-dec2018 corpus for the natural-questions-finetuned Atlas xxl model |\n| Atlas-nq XL wiki-dec2018 index | `indices\u002Fatlas_nq\u002Fwiki\u002Fxl` | `models\u002Fatlas\u002Fxl` | Precomputed index for the wiki-dec2018 corpus for the natural-questions-finetuned Atlas xl model |\n| Atlas-nq large wiki-dec2018 index | `indices\u002Fatlas_nq\u002Fwiki\u002Flarge` | `models\u002Fatlas\u002Flarge` | Precomputed index for the wiki-dec2018 corpus for the natural-questions-finetuned Atlas large model |\n| Atlas-nq base wiki-dec2018 index | `indices\u002Fatlas_nq\u002Fwiki\u002Fbase` | `models\u002Fatlas\u002Fbase` | Precomputed index for the wiki-dec2018 corpus for the natural-questions-finetuned Atlas base model |\n\u003C\u002Fdetails>\n\n## Tasks\n\nAtlas can train (or evaluate) on any supervised learning task which can be formulated in a \"seq2seq\" format, where there is a sequence of 1 or more tokens comprising an input *query* and a sequence of 1 or more tokens comprising an output *target*.\nFor example, a query might be a question, `Where is the Bermuda Triangle?`, and a target might be the answer to that question, `Western part of the North Atlantic Ocean`.\nThis way of modelling will be familiar to users of models like T5 or BART. Anywhere these models could be used, Atlas can be used too, using the exact same data: Atlas will learn to retrieve passages from its retrieval index by itself - annotations for associating passages to (`query`, `target`) pairs are not used.\n\nThe Atlas codebase configures what task it is doing, and what evaluation metrics to call using the `--task` command line argument. \nWe have implemented a `base` task, with only the most basic support for seq2seq training, but provide more fully-featured functionality for Masked Language Modelling (`mlm`), Language Modelling (`lm`), Wikipedia section generation (`section`), Open-domain QA (`QA`), Multiple choice QA (`multiple_choice`), fact checking (`fever`), and the KILT suite (`kilt`), \nAll tasks expect input data formatted as jsonl format, but the specific field names are task specific. Some tasks have additional command line args, and specialized evaluation.\nAdding new tasks is straightforward, and described [here](#defining-your-own-task).\n\nThe tasks are described in more detail below, and most have example commands in `examples\u002F{task}\u002F` (click to expand).\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"base-task\">Base Task\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n\nThis is the most basic task available, and is probably not the best option for you, especially if your task closely resembles one the other implemented tasks.\n\nSpecify this task by passing `--task base` to either `train.py` or `evaluate.py` \n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py` or `evaluate.py` space-separated lists to `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nThis task expects input files to have a `query` field with the input query string and a `target` field with the output query string, e.g.:\n\n```json\n{\"query\": \"input to Atlas\", \"target\": \"desired generation from Atlas\"}\n```\n\nThe evaluation loop will calculate evaluation loss and the fraction of eval data examples where Atlas generates an output that exactly matches the target.\nIf you pass `--write_results` to the script, Atlas predictions on the eval data will be written to the save checkpoint directory with the following format:\n\n```json\n{\"query\": \"input to Atlas\", \"answers\": [\"desired generation from Atlas\"], \"generation\": \"Atlas's prediction for the query\", \"passages\": [\"list of retrieved passages\"]}\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"mlm-task\">Masked Language Modelling\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nThe Masked Language modelling task implements the Masked Language Modelling pretraining task as introduced by [T5](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683).\nThis is the task we use to pretrain the main Atlas in the paper.\n\nSpecify this task by passing `--task mlm` to `train.py`.\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nThese files should be comprised of JSON objects with the following format:\n```python\n{\n  \"text\": \"text passage to apply noise to and train to de-noise\",\n  \"id\": \"unique id of text passage\"\n  ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\nThe intention is that the same files that you use for the retrieval corpus, (passed to `--passages`) can be used as training data.\nThe task will apply the T5 noise function to `text` field, to automatically create inputs and target generations.\n\nThe MLM task will prevent Atlas from retrieving the passage that it is trying to de-noise. It does this by filtering out any passage from retrieved results which have same `id` field as the instance Atlas is de-noising. \nThis functionality is important if the de-noising training data and the passages Atlas is retrieving from are the same corpus.\n\nThis task has the following task specific args:\n```\n  --mlm_noise_density MLM_NOISE_DENSITY\n      how much of an input text should be masked by masking spans (default: 0.15)\n  --mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH\n      average length of an MLM masking span (default: 3)\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      Instances with fewer than min_words_per_lm_instance instances will be skipped for MLM\u002FLM\u002FSection generation (default: None)\n```\n\nIf you pass `--write_results`, Atlas will write its mask-filling predictions to file.\n\nAtlas will log the following evaluation metrics for MLM during its evaluation loop: \n* `eval_loss`: evaluation reader loss of generated mlm mask-fill spans\n* `accuracy`: fraction of perfectly de-noised mask-fill spans\n* `f1`: token f1 fraction of correct de-noised mask-fill spans\n* `rouge_1`: rouge 1 score of generated mask-fill spans relative to the gold reference masked spans\n* `rouge_2`: rouge 2 score of generated mask-fill spans relative to the gold reference masked spans\n* `rouge_L`: rouge L score of generated mask-fill spans relative to the gold reference masked spans\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"lm-task\">Language Modelling\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to do Left-to-Right Language Modeling by passing `--task lm` to `train.py`.\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nThese files should be comprised of JSON objects with the following format:\n```python\n{\n  \"text\": \"text passage to train Atlas to generate\",\n  \"id\": \"unique id of text passage\"\n  ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\nThe intention is that the same files that you use for the retrieval corpus, (passed to `--passages`) can be used as training data.\nThe task will preprocess the `text` field automatically, dividing it into two random segments - the left part serves as conditioning context, and the right part is the text the Atlas model will be trained to generate as a continuation.\n\nThe LM task will prevent Atlas from retrieving the same passage that it is trying to generate. It does this by filtering out any passage from retrieved results which have same `id` field as the instance Atlas is generating. \nThis functionality is important if the de-noising training data and the passages Atlas is retrieving from are the same corpus.\n\nThis task has the following task specific args:\n```\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      Instances with fewer than min_words_per_lm_instance instances will be skipped for  MLM\u002FLM\u002FSection generation (default: None)\n  --min_lm_context_ratio MIN_LM_CONTEXT_RATIO\n      Splits text into two segments for language modelling.' 'Left segment is conditioning context, right segment is for generating.' 'The left segment must be more than min_lm_context_ratio of\n      the right segment (default: 0.5)\n  --max_lm_context_ratio MAX_LM_CONTEXT_RATIO\n      Splits text into two segments for language modelling.' 'Left segment is conditioning context, right segment is for generating.' 'The left segment must be less than max_lm_context_ratio\n      of the right segment (default: 0.5)\n```\n\nIf you pass `--write_results`, Atlas will write its lm predictions to file.\n\nAtlas will log the following evaluation metrics for LM during its evaluation loop: \n* `eval_loss`: evaluation reader loss of continuations for the reference data\n* `accuracy`: fraction of perfectly predicted continuations\n* `f1`: token f1 fraction of correct generated continuations\n* `rouge_1`: rouge 1 score of generated continuations relative to the gold reference continuations\n* `rouge_2`: rouge 2 score of generated continuations relative to the gold reference continuations\n* `rouge_L`: rouge L score of generated continuations relative to the gold reference continuations\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"section-task\">Wikipedia Section Generation\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to generate the text of a wikipedia passage given its title and section title, by passing  `--task section` to `train.py`.\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should have the form of the `text-list-100-sec.jsonl` files in the wikipedia dumps.\nThese can be obtained by following the instructions in [Available Data and Models for download](#available-data-and-Models-for-download), for example the training file: `enwiki-dec2018\u002Ftext-list-100-sec.jsonl`.\nThese files should be comprised of JSON objects, one per line, with the following format:\n```json\n{\n  \"id\": \"3793043\", \n  \"title\": \"Bermuda Triangle\",\n  \"section\": \"Compass variations\",\n  \"text\": \" Compass problems are one of the cited phrases in many Triangle incidents. While some have theorized that unusual local magnetic anomalies may exist in the area, such anomalies have not been found. Compasses have natural magnetic variations in relation to the magnetic poles, a fact which navigators have known for centuries.\"\n}\n```\nThe task will automatically format the input query to the model as \"{Title}, {Section}\" - e.g. in this example, the input to Atlas will be constructed as `Bermuda Triangle, Compass Variations`. The output will be the `text` field of the example.\nThe `section` task will prevent Atlas from retrieving the same passage that it is trying to generate. It does this by filtering out any passage from retrieved results which have same `id` field as the instance Atlas is generating. \n\nThis task has the following task specific args:\n```\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      Instances with fewer than min_words_per_lm_instance instances will be skipped for MLM\u002FLM\u002FSection generation (default: None)\n```\nIf you pass `--write_results`, Atlas will write its generated predictions for the text for Wikipedia sections to file.\n\nAtlas will log the following evaluation metrics for `section` during its evaluation loop: \n* `eval_loss`: evaluation reader loss of continuations for the reference data\n* `accuracy`: fraction of perfectly predicted continuations\n* `f1`: token f1 fraction of correct generated continuations\n* `rouge_1`: rouge 1 score of generated continuations relative to the gold reference continuations\n* `rouge_2`: rouge 2 score of generated continuations relative to the gold reference continuations\n* `rouge_L`: rouge L score of generated continuations relative to the gold reference continuations\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"qa-task\">Open-Domain Question Answering (e.g. NaturalQuestions, TriviaQA, TempLama)\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to answer open-domain QA questions by passing `--task qa` to `train.py` or `evaluate.py`.\nThere is a worked example of QA in the [Getting Started and Codebase at a Glance](#getting-started-and-codebase-at-a-glance) section.\nWe use this task for the NaturalQuestions, TriviaQA and TempLama datasets in the paper.\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nFiles should have one JSON instance per line with the following format:\n```python\n{\n  \"question\": \"where is the bermuda triangle\",\n  \"answers\": [\"Western part of the North Atlantic Ocean\"],\n   ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\nThe question will be formatted according to the task specific argument `--qa_prompt_format`, which defaults to `question: {question} answer: \u003Cextra_id_0>`.\nFor example above, the question would be automatically formatted as input queries to Atlas as `question: where is the bermuda triangle answer: \u003Cextra_id_0>`.\nThe supervision target is obtained from the `target` field. If this field does not exist, the supervision target will get selected at random from the available answers in the `answers` field, and formatted as `\u003Cextra_id_0> {answer}`.\n\nIf you pass `--write_results`, Atlas will write its predicted answers to file.\n\nAtlas will log the following evaluation metrics for open domain QA during its evaluation loop: \n* `eval_loss`: evaluation reader loss of evaluation answers.\n* `exact_match`: Open-domain QA exact match score of generated answers\n* `f1`: Open-domain QA F1 score of generated answers\n\n#### Natural Questions & TriviaQA\n\nYou can download the NaturalQuestions and TriviaQA data by calling:\n\n```bash\npython preprocessing\u002Fprepare_qa.py --output_directory ${DATA_DIR} \n```\n\nwhich will download `train.jsonl`, `train.64-shot.jsonl` (the fewshot training dataset we use), `dev.jsonl` and `test.jsonl` to `${DATA_DIR}\u002Fdata\u002Fnq_data` and `${DATA_DIR}\u002Fdata\u002Ftriviaqa_data`.\n\nExample scripts for running fewshot and standard finetuning and evaluation with a wikipedia index for NQ can be found in `examples\u002Fnq`. This script can be used for TriviaQA by swapping the train\u002Fdev\u002Ftest files.\n\n#### TempLama\n\nWe defined a cloze-question answering task for assessing index faithfulness and temporal transfer, derived from the TempLAMA dataset.\n\nYou can download the TempLAMA data and create and format our derived dataset by calling the following script:\n\n```bash\npython preprocessing\u002Fprepare_templama.py --output_directory ${DATA_DIR} \n```\n\nwhich will create the files  `temp_lama.train.2017.jsonl`, `temp_lama.valid.2017.jsonl`, `temp_lama.test.2017.jsonl`, `temp_lama.train.2020.jsonl`, `temp_lama.valid.2020.jsonl`, `temp_lama.test.2020.jsonl` under `${DATA_DIR}\u002Fdata\u002Ftemplama_data\u002F`.\nThese files will contain cloze questions, with answers specific to that year. \n\nExample scripts for running training and evaluation for TempLama can be found at `examples\u002Ftemplama`. (note the use of `qa_prompt_format {question}`, which switches off the automatic QA prompt formatting used for TriviaQA and NQ)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"mcqa-task\">Multiple Choice Question Answering (e.g. MMLU)\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to answer multiple choice questions by passing `--task multiple_choice` to `train.py` or `evaluate.py`.\nWe use this task for our experiments with MMLU.\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nFiles should have one JSON instance per line with the following format:\n```python\n{\n  \"question\": \"Which of the following is the body cavity that contains the pituitary gland?\", \n  \"options\": {\n    \"A\": \"Abdominal\",\n    \"B\": \"Cranial\",\n    \"C\": \"Pleural\", \n    \"D\": \"Spinal\"\n    ... # you can have more (or fewer) answer options as long as they have alphabetically consecutive upper case letter keys, starting at A\n  }, \n  \"answer\": \"B\",\n  ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\nThese will get automatically formatted into input queries for Atlas of the form `question: {question} answers: (A) {options['A']} (B) {options['B']} (C) {options['C']} (D) {options['D']} Answer: \u003Cextra_id_0>`, with target generations of the format `\u003Cextra_id_0> {answer letter}`.\nThe example above would get formatted to: `question: {Which of the following is the body cavity that contains the pituitary gland? answers: (A) Abdominal (B) Cranial (C) Pleural (D) Spinal Answer: \u003Cextra_id_0>`, with the target generation `{extra_id_0} B`.\n\n\nMultiple-Choice QA has the following task specific args:\n```\n  --multiple_choice_num_options\n      How many choice options for multiple choice QA (MMLU is 4) (default: 4)\n  --multiple_choice_train_permutations {single,cyclic,all}\n      Whether to train with answer order permutations When training on multiple choice (e.g. MMLU). Can improve results by de-biasing models's preferences for arbitrary answer orderings. Recommend\n      training with 'all'. single: no permutations. cyclic: cyclic permutations. all: all possible answer order permutations' (default: single)\n  --multiple_choice_eval_permutations {single,cyclic,all}\n      Whether to evaluate with answer order permutations for multiple choice (e.g. MMLU). Can improve results by de-biasing models's preferences for arbitrary answer orderings. Best results with\n      'all' but very slow. 'cyclic' is a good compromise. single: no permutations. cyclic: cyclic permutations. all: all possible answer order permutations' (default: single)\n```\n\nThe permutation options will automatically duplicate the inputs, but with the answer orders permuted (e.g. With \"A\" now being \"cranial\", \"B\" being \"pleural\" etc.)\nThis improves results for when we have very small amounts of supervised data (or zeroshot). \nThe code will automatically marginalize across results for evaluation permutations for you, in the case you use --multiple_choice_eval_permutations option `cyclic` or `all`.\nMore details on the permutation de-biasing can be found in the appendix of [Atlas: Few-shot Learning with Retrieval Augmented Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03299.pdf).\n\nIf you pass `--write_results`, Atlas will write its predicted answers to file, with the following format:\n\n```json\n{\n  \"question\": \"the prompt-template applied input\",\n  \"generation\": \"answer letter choice with highest probability after marginalizing across permutations\",\n  \"choice_probs\": \"the probability of each answer choice (normalized over total answer options)\",\n  \"all_probs\": \"the un-marginalized answer probabilities from all the answer order permutations\",\n  \"permutations\": [\"the list of prediction objects for each permutation of the answer ordering\"]\n}\n```\n\n#### MMLU\n\nA dedicated ReadMe is available for running MMLU experiments [here](.\u002Fexample_scripts\u002Fmmlu\u002FREADME_MMLU.md). \nThere is a tool to download and preprocess the MMLU data, and example scripts for running each of the experimental settings that we explore with MMLU are available `examples\u002Fmmlu`.\nThese are documented in detail in the MMLU Dedicated Readme.\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"fever-task\">FEVER Fact Verification\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to classify textual claims as \"SUPPORTED\", \"REFUTED\" or \"NOT_ENOUGH_INFO\" by a corpus, such as for the FEVER task  by using `--task fever` to `train.py` or `evaluate.py`.\n\t\nYou can download the FEVER data by calling the following script:\n\n```bash\npython preprocessing\u002Fprepare_fever.py --output_directory ${DATA_DIR} \n```\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nFiles should have one JSON instance per line with the following format:\n\n```python\n{\n  \"claim\": \"the claim to assess\", \n  \"label\": \"either 'SUPPORTS', 'REFUTES' or 'NOT ENOUGH INFO'\",\n   ... # you can have other fields you want to keep around for ease of analysis, but they wont actually be used\n}\n```\nAtlas will automatically process these instances, and format them for input as `question: {claim} answer: \u003Cextra_id_0>` and the output as `\u003Cextra_id_0> {true, false or maybe}`.\nIf you pass `--write_results`, Atlas will write its predicted labels to file.\nAtlas will log the following evaluation metrics for open domain QA during its evaluation loop: \n\n* `accuracy`:  how many claims were correctly classified by the model.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"kilt-task\">KILT\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas can be trained to perform KILT tasks by using `--task kilt` to `train.py` or `evaluate.py`.\n\nKILT data can be obtained from [here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FKILT)\n\nTrain\u002Fvalidation\u002Ftest data for this task should consist of jsonl files, which should be passed to `train.py`  as `--train_data train_file_1.jsonl train_file_2.jsonl`, and `--eval_data eval_file_1.jsonl eval_file_2.jsonl` etc.\nFiles should have one JSON instance per line with the following format (i.e. the codebase will accept the KILT format directly):\n```python\n{'id': # original data point id if available otherwise unique id\n 'input': # question \u002F claim \u002F sentence \u002F etc\n 'output': [ # each element might contain an answer, a provenance or both\n    {\n    'answer': # answer in textual form\n    'provenance': [\n        # evidence set for the answer from the KILT ks\n        {\n            'wikipedia_id':  # *mandatory* \n            'title': \n            'section': \n            'start_paragraph_id': \n            'start_character': \n            'end_paragraph_id':\n            'end_character': \n            'bleu_score': # wrt original evidence\n            'meta': # dataset\u002Ftask specific\n        }\n        ] \n      }\n    ]\n 'meta': # dataset\u002Ftask specific\n }\n```\nAtlas will automatically process these instances appropriately, into Atlas] query inputs based on the `input` field and target generations based on the `answer` fields\n\nIf you pass `--write_results`, Atlas will write its predicted labels to file.\n\nAtlas will log the following evaluation metrics for open domain QA during its evaluation loop: \n* `accuracy`:  how often generations exactly match the reference\n* `exact_match`:  how often generations exactly match the reference, with open-domain QA normalization applied\n* `f1`:  the token level f1 score overlap between the generation and reference\n\n\u003C\u002Fdetails>\n\n\n## Retrieval and Index Details\n\nThe following section gives more details on retrieval and indices.\n\nAs briefly mentioned in the introduction, retrieval is handled in the Atlas code by taking advantage of the parallel nature of training modern large neural networks.\nSpecifically, for all modern training on GPUs, training (and inference) requires several GPU workers to be available for parallel computation.\n\nAtlas makes use of this already-existing distributed setup.\nIt shards its retrieval index (passages + embedded passages) into N equally sized shards, with one shard per GPU worker. \nBy default, retrieval is performed entirely using exact search on GPU using pytorch, (no approximate search using FAISS), which is still fast because the search is parallelized across all the GPU workers (assuming there are enough GPUs).\n\nKnowing the mechanics of how retrieval is accomplished should not be needed to run the codebase, but might be useful to adapt the codebase to specific ends.\nWe thus include a brief description below. For simplicity, we'll assume we're performing open-domain question answering, and using a per_gpu_batch_size of 1, and assume we have W GPU workers, and N total passages in the retrieval index.\n\nThere are two high level functions that the retriever needs to do:\n\u003Cdetails>\n\u003Csummary>\n1. Build\u002FRefresh Embeddings\n\u003C\u002Fsummary>\n\nBuilding or refreshing the index involves calculating embeddings for every passage in the retrieval index, which are then saved in memory, which allows for fast computation of maximum inner products\u002Fnearest neighbors when doing retrieval later. \n(Assume we have W GPU workers, and N total passages in the retrieval index)\n\nRecall that each GPU worker has a shard of N\u002FW passages.\nPassage embedding calculations is quite simple, and proceeds as follows:\n* We halt any model training going on\n* Each worker calculates embeddings for its shard of passages in parallel, iterating over them in batches, saving them in large torch tensor (faiss support is also available, but discussed later). \n* When all the workers have finished embedding their shard, we can continue with model training.\n\nN may be very large (10-100M), so embedding all the passages is quite slow. \nHowever, because we can parallelize it across our workers, it can be relatively fast if we have enough GPUS. None-the-less, for large indices, this can still be quite slow.\nWe have functionality to [save indices](#index-saving-and-loading) to disk to avoid excessive index building.\nMoreover, as the retriever gets trained, the cached embeddings become out-of-date, or \"stale\", and need to be recalculated, which incurs even more cost.  See [Strategies for dealing with stale indices](#strategies-for-dealing-with-stale-indices) for ways to reduce\u002Favoid the need for frequent index refreshes.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n2. Perform Distributed Retrieval\n\u003C\u002Fsummary>\n\nAtlas performs retrieval-in-the-loop: i.e. a retrieval call is used as part of the forward pass.\nHere, we'll briefly the describe the steps in a forward pass of Atlas, including how retrieval is accomplished:\n(Assume we have W GPU workers, and N total passages in the retrieval index, and for simplicity, assume a training per gpu batch size of 1, and a question-answering task).\n\n* Each worker has a question, which it embeds into a query vector.\n* An all-gather is performed, which results in each worker having a copy of all the W query vectors (i.e. a query vector for all the questions in the total minibatch)\n* Each worker then performs a maximum inner product search on GPU for its shard for all the query vectors in the batch.\n* The top K results from each workers shard for each query are then sent back to the GPU that embedded that query, via a gather.\n* This results in each worker with the top W * K results for its query, from which the true top K results are selected.\n* The retrieval is now complete, and a standard distributed-data-parallel forward pass can continue (i.e. run the model forward pass, calculate gradients, aggregate gradients across workers, and update the parameters.)\n\n\u003C\u002Fdetails>\n\n### Flat vs Faiss\n\nThere are two index modes implemented for Atlas. \nBy default, we perform retrieval using an exact search ('Flat') index, where retrieval is performed on GPU using pure pytorch.\nWe also support a [FAISS](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss) mode, which is useful for saving GPU memory for extremely large indices, or where GPU memory is very restricted.\nFAISS is a library for fast approximate nearest neighbor search. Our retrieval is on GPU, so we do not usually require further search acceleration, but faiss can be used for compressing the size of an index in memory, which may be of use for very large indices.\n\nThe mode to use is specified by `--index_mode {\"flat\"|\"faiss\"}`. \nFor most use cases, the `flat` index will be sufficient and likely preferable. \n\nIf using the faiss index, users should specify what kind of faiss index to use, using the following options:\n\n```\n  --faiss_index_type {ivfflat,flat,ivfsq,ivfpq,pq}\n      IVFFlat, IndexFlatIP, IVFScalarQuantizer, IndexPQ or IndexIVFPQ with faiss-gpu (default: flat)\n  --faiss_code_size FAISS_CODE_SIZE\n      Parameter for PQ\u002FSQ quantization (default: None)\n```\n\nA good default if using a faiss index is to use `--faiss_index_type ivfpq --faiss_code_size 16`. This will use an IVF-PQ index with the number of IVF clusters set to the square root of the number of embeddings per shard, and PQ code size of 16. More details on this index structure can be found in the faiss documentation [FAISS](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss).\n\n### Index Saving and Loading\n\nIndices (passage and embeddings shards) can be saved to disk and loaded in, to avoid recomputing them.\nSee [above](#prebuilt-indices) for some downloadable indices.\n\nIndex saving can be switched on using `--save_index_path {path\u002Fto\u002Fdirectory\u002Fsave\u002Findex\u002Fin}`, which will create a directory,  \nand save each worker's embedding shard to index (as a pytorch tensor on disk) and passages shard (as a pickle file).\n\nTo load an index, pass `--load_index_path {path}`, which will load the index at the specified path.\n\nSaving and loading works with both `flat` and `faiss` modes.\n\nIn order to easily load an index when using a different number of workers from the index that created it, we can configure `--save_index_n_shards N`, which will save the index into N shards (for example if we have 32 workers, we can pass `--save_index_n_shards 128` to save the index as 128 shards to disk). \nWhen we try to load the index again, for example with 64 workers, the code will figure out it should load 2 saved files per worker. (Note: this functionality only works with `flat` indices - for faiss indices, you can only load indices where the number of workers is the same as when it was saved to disk).\n\n### Strategies for dealing with stale indices\n\nAs the retriever is trained, the passage embeddings stored in memory become stale. \nThis affects the accuracy of retrieval, and, over long periods of time, may lead to suboptimal training or instability.\nAtlas has three methods that can combat this\n\n1. \u003Cb name=\"#index-refresh\">Index Refresh\u003C\u002Fb>: The simplest and most expensive option is to recompute the embeddings using the up-to-date retriever embedder. The index refresh rate schedule is controlled by the `--refresh_index` argument. format: `startstep-endstep:refreshrate,` e.g. `--refresh_index 0-1000:500,1000-10000:1000` will refresh the index every 500 steps for the first 1000 steps, and then every 1000 steps from step 1000 to 10000. You can also just pass in a single number e.g. `--refresh_index 100` will refresh the index every 100 steps. Pass `--refresh_index -1` to never refresh. We use this setting for large datasets and pretraining. \n2. \u003Cb name=\"#overretrieve-with-reranking\">Over-Retrieve with Reranking\u003C\u002Fb>: Here, instead of refreshing the index, we can retrieve the top L passages (where L > K), and then, rerank these L passages using the up-to-date embedder on-the-fly, and pass the top K of these. This works well if the true top K are indeed contained in the stale top L. To use this pass `--retrieve_with_rerank` and specify `--n_to_rerank_with_retrieve_with_rerank L`. This method can be used in conjunction with index refreshing, to reduce staleness between refreshes.\n3.  \u003Cb name=\"#query-Side-finetuning\">Query-Side Finetuning\u003C\u002Fb>: To avoid stale-ness, we can keep the passage embedder of the retriever fixed, and only train the query embedder. This method will sacrifice retriever performance if there is lots of training data, but works well in few-shot settings. To enable this mode, pass `--query_side_retriever_training`. Note: usually we use parameter sharing for the passage and query encoder of the retriever - this mode is the exception, where we break the parameter tying to keep the passage encoder fixed.\n\n### Retrieve-only mode\n\nAtlas can be used purely in a retrieval mode at evaluation time. \nThis can be useful for users who want a fast, scalable, easy to launch GPU-enabled dense retriever.\n\nIn this mode, (which only works with `evaluate.py`) no reader language model gets loaded, and the script will perform retrieval, and then write retrieval results to file if the `--write_results` flag has been passed.\n\nTo use this mode, pass `--retrieve_only` to `evaluate.py`.\nThere is an example of NaturalQuestions retrieval using this mode in `examples\u002Fnq\u002Fretrieve_only.sh`.\n\n### Using pre-retrieved or cached passages\n\nIn some cases, users may have already performed retrieval and want to cache the retrieved results for their dataset, or know a priori the most relevant passages, and thus do not need to perform retrieval.\n\nIn these cases, Atlas can be forced to use user-specified passages per input instance, rather than retrieve, by 1) passing the `--use_file_passages` flag and 2) including a json field `passages` in the train\u002Feval files they pass in, with the following format (e.g for the `qa` task)\n\n\u003Cdetails>\n\u003Csummary>\n(click to expand to see example)\n\u003C\u002Fsummary>\n\n```python\n{\n  \"question\": \"where is the bermuda triangle\",\n  \"answers\": [\"Western part of the North Atlantic Ocean\"],\n  \"passages\": [\n    {\n      \"text\": \"text of first passage\",\n      \"title\": \"title of  first passage\",\n      \"id\": \"id of first passage\"\n      ... # other fields can be here but wont be used\n    },\n    {\n      \"text\": \"text of second passage\",\n      \"title\": \"title of  second passage\",\n      \"id\": \"id of second passage\"\n    },\n    ... # more passages if you like\n  ]\n}\n```\n\n\u003C\u002Fdetails>\n\n## Other features\n\nThe following are other features that Atlas provides for advanced users:\n\n### Closed book mode\n\nAtlas can be run as a standard non-retrieval-augmented T5 model, often referred to as \"closed-book\" in the literature. This is useful for running baseline experiments, and checking that your model does indeed benefit from retrieval-augmentation for your task. Pass the `--closed_book` argument to do closed-book training and ignore the retrieved passages.\n\n### Specifying formats\n\nFormat strings can be injected for greater formatting control of how the inputs get presented to the Atlas model:\n\n```\n  --encoder_format ENCODER_FORMAT\n    format string for reader's encoder preprocessing (default: \"{query} title: {title} context: {text}\")\n  --retriever_format RETRIEVER_FORMAT\n    format string for retriever's encoder preprocessing (default: \"{title} {text}\")\n```\n\nFor example, passing `--encoder_format \"{query} text: {text}\"` wouldn't pass the retrieved passages' titles to the reader model.\n\n\n### Implementing your own task\n\nTo implement a new task for Atlas, there are two options: the easiest is to preprocess or format your task to be compatible using one of the already implemented tasks (the `base` task should support almost all potential use cases).\n\nThe other is to implement your own task under `src\u002Ftasks\u002Fyour_task_name.py` and import it under `src\u002Ftasks\u002F__init__.py`.\n\nSee the `src\u002Ftasks\u002Fqa.py` for an example. \n\nThe `process` function takes the raw parsed, jsonl-objects passed to --train_data or --eval_data, and should return a dict with `{query: \"query to pass to Atlas\", \"target\": \"target string\", \"passages\": [list of gold retrieved passages, can be empty]}`\n\nThe `evaluate` function takes a predicted generation and references for a task, and return a dict of task-specific evaluation scores, which the codebase will average across evaluation instances.\n\n## Full list of command line flags:\n\n\u003Cdetails>\n\u003Csummary>\nClick to Expand\n\u003C\u002Fsummary>\n\n```\nusage: train.py\u002Fevaluate.py [-h] [--name NAME] [--checkpoint_dir CHECKPOINT_DIR] [--model_path MODEL_PATH] [--per_gpu_batch_size PER_GPU_BATCH_SIZE] [--per_gpu_embedder_batch_size PER_GPU_EMBEDDER_BATCH_SIZE] [--local_rank LOCAL_RANK]\n                [--main_port MAIN_PORT] [--seed SEED] [--log_freq LOG_FREQ] [--eval_freq EVAL_FREQ] [--save_freq SAVE_FREQ] [--train_data TRAIN_DATA [TRAIN_DATA ...]] [--eval_data EVAL_DATA [EVAL_DATA ...]] [--write_results]\n                [--dont_write_passages] [--load_index_path LOAD_INDEX_PATH] [--save_index_path SAVE_INDEX_PATH] [--save_index_n_shards SAVE_INDEX_N_SHARDS] [--index_mode {flat,faiss}] [--faiss_index_type {ivfflat,flat,ivfsq,sq,pq}]\n                [--faiss_code_size FAISS_CODE_SIZE] --reader_model_type\n                {t5-small,t5-base,t5-large,t5-3b,t5-11b,google\u002Ft5-v1_1-base,google\u002Ft5-v1_1-large,google\u002Ft5-v1_1-xl,google\u002Ft5-v1_1-xxl,google\u002Ft5-base-lm-adapt,google\u002Ft5-large-lm-adapt,google\u002Ft5-xl-lm-adapt,google\u002Ft5-xxl-lm-adapt}\n                [--text_maxlength TEXT_MAXLENGTH] [--target_maxlength TARGET_MAXLENGTH] [--n_context N_CONTEXT] [--passages PASSAGES [PASSAGES ...]] [--max_passages MAX_PASSAGES] [--retriever_model_path RETRIEVER_MODEL_PATH]\n                [--retrieve_only] [--train_retriever] [--use_file_passages] [--retriever_n_context RETRIEVER_N_CONTEXT] [--gold_score_mode {evalnormsum,loop,ppmean,emdr,pdist,adist}] [--closed_book]\n                [--temperature_score TEMPERATURE_SCORE] [--temperature_gold TEMPERATURE_GOLD] [--compute_crossattention_stats] [--filtering_overretrieve_ratio FILTERING_OVERRETRIEVE_RATIO]\n                [--freeze_retriever_steps FREEZE_RETRIEVER_STEPS] [--query_side_retriever_training] [--retrieve_with_rerank] [--n_to_rerank_with_retrieve_with_rerank N_TO_RERANK_WITH_RETRIEVE_WITH_RERANK]\n                [--decoder_format DECODER_FORMAT] [--decoder_prompt_format DECODER_PROMPT_FORMAT] [--encoder_format ENCODER_FORMAT] [--retriever_format RETRIEVER_FORMAT] [--generation_max_length GENERATION_MAX_LENGTH]\n                [--generation_min_length GENERATION_MIN_LENGTH] [--generation_length_penalty GENERATION_LENGTH_PENALTY] [--generation_num_beams GENERATION_NUM_BEAMS] [--task {base,mlm,lm,multiple_choice,kilt,section,fever,qa}]\n                [--mlm_noise_density MLM_NOISE_DENSITY] [--mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH] [--min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE] [--min_lm_context_ratio MIN_LM_CONTEXT_RATIO]\n                [--max_lm_context_ratio MAX_LM_CONTEXT_RATIO] [--qa_prompt_format QA_PROMPT_FORMAT] [--multiple_choice_num_options MULTIPLE_CHOICE_NUM_OPTIONS] [--multiple_choice_train_permutations {single,cyclic,all}]\n                [--multiple_choice_eval_permutations {single,cyclic,all}] [--warmup_steps WARMUP_STEPS] [--total_steps TOTAL_STEPS] [--scheduler_steps SCHEDULER_STEPS] [--accumulation_steps ACCUMULATION_STEPS] [--dropout DROPOUT]\n                [--lr LR] [--lr_retriever LR_RETRIEVER] [--clip CLIP] [--scheduler {linear,cosine,fixed}] [--weight_decay WEIGHT_DECAY] [--save_optimizer] [--epsilon EPSILON] [--alpha ALPHA] [--beta2 BETA2]\n                [--refresh_index REFRESH_INDEX] [--shuffle] [--precision {fp16,fp32,bf16}] [--shard_optim] [--shard_grads] [--use_gradient_checkpoint_reader] [--use_gradient_checkpoint_retriever]\n\noptional arguments:\n  -h, --help            show this help message and exit\n  --name NAME           name of the experiment - also used as directory name (default: experiment_name)\n  --checkpoint_dir CHECKPOINT_DIR\n                        models are saved here (default: .\u002Fcheckpoint\u002F)\n  --model_path MODEL_PATH\n                        Path to a pretrained model to initialize from (pass 'none' to init from t5 and contriever) (default: none)\n  --per_gpu_batch_size PER_GPU_BATCH_SIZE\n                        Batch size per GPU\u002FCPU for training. (default: 1)\n  --per_gpu_embedder_batch_size PER_GPU_EMBEDDER_BATCH_SIZE\n                        Embedder's batch size per GPU. (default: 512)\n  --local_rank LOCAL_RANK\n                        For distributed training: local_rank (default: -1)\n  --main_port MAIN_PORT\n                        Main port (for multi-node jobs) (default: -1)\n  --seed SEED           random seed for initialization (default: 0)\n  --log_freq LOG_FREQ   log train stats \u003Clog_freq> steps during training (default: 100)\n  --eval_freq EVAL_FREQ\n                        evaluate model every \u003Ceval_freq> steps during training (default: 500)\n  --save_freq SAVE_FREQ\n                        save model every \u003Csave_freq> steps during training (default: 5000)\n  --train_data TRAIN_DATA [TRAIN_DATA ...]\n                        list of space-separated paths to jsonl-formatted train sets (default: [])\n  --eval_data EVAL_DATA [EVAL_DATA ...]\n                        list of space-separated paths to jsonl-formatted evaluation sets (default: [])\n  --write_results       save evaluation results to file (default: False)\n  --dont_write_passages\n                        if writing results, passages can take up a lot of space, pass this flag not to write passages as part of dumped results (default: False)\n  --load_index_path LOAD_INDEX_PATH\n                        path for loading the index, passage embeddings and passages (default: None)\n  --save_index_path SAVE_INDEX_PATH\n                        path for saving the index and\u002For embeddings (default: None)\n  --save_index_n_shards SAVE_INDEX_N_SHARDS\n                        how many shards to save an index to file with. Must be an integer multiple of the number of workers. (default: 128)\n  --index_mode {flat,faiss}\n                        Use flat torch index or a faiss index for retrieving the k nearest neighbors (default: flat)\n  --faiss_index_type {ivfflat,flat,ivfsq,sq,pq}\n                        IVFFlat, IndexFlatIP, IVFScalarQuantizer, ScalarQuantizer or IndexPQ with faiss-gpu (default: flat)\n  --faiss_code_size FAISS_CODE_SIZE\n                        Parameter for PQ quantization (default: None)\n  --reader_model_type {t5-small,t5-base,t5-large,t5-3b,t5-11b,google\u002Ft5-v1_1-base,google\u002Ft5-v1_1-large,google\u002Ft5-v1_1-xl,google\u002Ft5-v1_1-xxl,google\u002Ft5-base-lm-adapt,google\u002Ft5-large-lm-adapt,google\u002Ft5-xl-lm-adapt,google\u002Ft5-xxl-lm-adapt}\n                        t5 Architecture for reader FID model, e.g. google\u002Ft5-xl-lm-adapt (default: None)\n  --text_maxlength TEXT_MAXLENGTH\n                        maximum number of tokens in input text segments (concatenated question+passage). Inputs longer than this will be truncated. (default: 200)\n  --target_maxlength TARGET_MAXLENGTH\n                        Maximum length of target outputs in tokens when training the model. Targets longer than this will be truncated. No truncation if -1 (default: None)\n  --n_context N_CONTEXT\n                        number of top k passages to pass to reader (default: 1)\n  --passages PASSAGES [PASSAGES ...]\n                        list of paths to jsonl files containing passages to index and retrieve from. Unused if loading a saved index using --load_index_path (default: None)\n  --max_passages MAX_PASSAGES\n                        maximum number of passages to index. -1 to read all passages in passage files (default: -1)\n  --retriever_model_path RETRIEVER_MODEL_PATH\n                        path to contriever model to init from (overridden if passing a value to --model_path (default: facebook\u002Fcontriever)\n  --retrieve_only       Pass this to prevent loading a reader, and only run retrieval evaluation (default: False)\n  --train_retriever     Pass to train retriever as well as reader (default: False)\n  --use_file_passages   uses passages in \"passages\" field in train or eval jsonl files rather than retrieving passages (default: False)\n  --retriever_n_context RETRIEVER_N_CONTEXT\n                        number of top k passages to use to train the retriever with (default: 5)\n  --gold_score_mode {evalnormsum,loop,ppmean,emdr,pdist,adist}\n                        retriever training method. `pdist` is the name used in the paper for `ppmean`. `adist` is the name used in the paper for `evalnormsum` (default: ppmean)\n  --closed_book         Don't use retrieval - reduces to T5. Overrides n_context, n_context_retriever and encoder_format if they are set (default: False)\n  --temperature_score TEMPERATURE_SCORE\n                        softmax temperature for retriever (default: 0.01)\n  --temperature_gold TEMPERATURE_GOLD\n                        softmax temperature for target distribution for retriever distillation (default: 0.01)\n  --compute_crossattention_stats\n  --filtering_overretrieve_ratio FILTERING_OVERRETRIEVE_RATIO\n                        if filtering, over-retrieve the topK by this factor, and then filter out undesirable results. Useful, Set to 1 only if using a task that doesn't filter retrieved results (default: 2)\n  --freeze_retriever_steps FREEZE_RETRIEVER_STEPS\n                        freezes retriever for n steps (default: -1)\n  --query_side_retriever_training\n                        pass to enable query-side finetuning of retriever (unties the parameters of the contriever encoder's passage and query encoders, and freezes the passage encoder. Useful to avoid index refreshes. (default: False)\n  --retrieve_with_rerank\n                        pass this to enable reranking with fresh passage encoder for retriever (default: False)\n  --n_to_rerank_with_retrieve_with_rerank N_TO_RERANK_WITH_RETRIEVE_WITH_RERANK\n                        n passages to rerank when passing --retrieve_with_rerank. Higher is slower but more accurate. Recommend 64-128 (default: 128)\n  --decoder_format DECODER_FORMAT\n                        format for decoder, model will be train on the format and evaluation will be performed with the format contrary to the decoder_prompt_format option (default: None)\n  --decoder_prompt_format DECODER_PROMPT_FORMAT\n                        format for decoder prompting, for instance \"what is the answer to {query}:\" (default: None)\n  --encoder_format ENCODER_FORMAT\n                        format string for reader's encoder preprocessing (default: {query} title: {title} context: {text})\n  --retriever_format RETRIEVER_FORMAT\n                        format string for retriever's encoder preprocessing (default: {title} {text})\n  --generation_max_length GENERATION_MAX_LENGTH\n  --generation_min_length GENERATION_MIN_LENGTH\n  --generation_length_penalty GENERATION_LENGTH_PENALTY\n  --generation_num_beams GENERATION_NUM_BEAMS\n  --task {base,mlm,lm,multiple_choice,kilt,section,fever,qa}\n                        Task performed by the model. Used to setup preprocessing, retrieval filtering, evaluations, etc. (default: None)\n  --mlm_noise_density MLM_NOISE_DENSITY\n                        how much of an input text should be masked by masking spans (default: 0.15)\n  --mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH\n                        average length of an MLM masking span (default: 3)\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n                        Instances with fewer than min_words_per_lm_instance instances will be skipped for MLM\u002FLM\u002FSection Generation (default: None)\n  --min_lm_context_ratio MIN_LM_CONTEXT_RATIO\n                        Splits text into two segments for language modelling.' 'Left segment is conditioning context, right segment is for generating.' 'The left segment must be more than min_lm_context_ratio of the right segment\n                        (default: 0.5)\n  --max_lm_context_ratio MAX_LM_CONTEXT_RATIO\n                        Splits text into two segments for language modelling.' 'Left segment is conditioning context, right segment is for generating.' 'The left segment must be less than max_lm_context_ratio of the right\n                        segment (default: 0.5)\n  --qa_prompt_format QA_PROMPT_FORMAT\n                        How to format question as input prompts when using --task qa (default: question: {question} answer: \u003Cextra_id_0>)\n  --multiple_choice_num_options MULTIPLE_CHOICE_NUM_OPTIONS\n                        How many choice options for multiple choice QA (MMLU is 4) (default: 4)\n  --multiple_choice_train_permutations {single,cyclic,all}\n                        Whether to train with answer order permutations When training on multiple choice (e.g. MMLU). Can improve results by de-biasing models's preferences for arbitrary answer orderings. Recommend training with 'all'.\n                        single: no permutations. cyclic: cyclic permutations. all: all possible answer order permutations' (default: single)\n  --multiple_choice_eval_permutations {single,cyclic,all}\n                        Whether to evaluate with answer order permutations for multiple choice (e.g. MMLU). Can improve results by de-biasing models's preferences for arbitrary answer orderings. Best results with 'all' but very slow.\n                        'cyclic' is a good compromise. single: no permutations. cyclic: cyclic permutations. all: all possible answer order permutations' (default: single)\n  --warmup_steps WARMUP_STEPS\n                        number of learning rate warmup steps (default: 1000)\n  --total_steps TOTAL_STEPS\n                        total number of training steps (default: 1000)\n  --scheduler_steps SCHEDULER_STEPS\n                        total number of step for the scheduler, if None then scheduler_total_step = total_step (default: None)\n  --accumulation_steps ACCUMULATION_STEPS\n                        gradient accumulation (default: 1)\n  --dropout DROPOUT     dropout rate (default: 0.1)\n  --lr LR               learning rate (default: 0.0001)\n  --lr_retriever LR_RETRIEVER\n                        learning rate for retriever (default: 1e-05)\n  --clip CLIP           gradient clipping (default: 1.0)\n  --scheduler {linear,cosine,fixed}\n                        learning rate schedule to use (default: cosine)\n  --weight_decay WEIGHT_DECAY\n                        amount of weight decay to apply in training (default: 0.1)\n  --save_optimizer      Pass flag to save optimizer state in saved checkpoints (default: False)\n  --epsilon EPSILON     adamw epsilon value (default: 1e-06)\n  --alpha ALPHA         adamw alpha value (default: 1.0)\n  --beta2 BETA2         adamw beta2 value (default: 0.999)\n  --refresh_index REFRESH_INDEX\n                        index refresh schedule. format: startstep-endstep:refreshrate,startstep-endstep:refreshrate e.g. --refresh_index 0-100:10,100-1000000:500 will refresh the index every 10 steps for the first 100 steps, and then\n                        every 500 steps from step 100 to 1M.Syntactic Sugar for a fixed schedule: can just pass in a single number e.g. --refresh_index 100 will refresh the index every 100 steps. -1 to never refresh. (default:\n                        0-1000000:1000000)\n  --shuffle             shuffle data for training (default: False)\n  --precision {fp16,fp32,bf16}\n                        numerical precision - recommend bf16 if available, fp16 likely to be unstable for training (default: fp32)\n  --shard_optim         train-time memory optimization: shards optimizer state over available GPUs using sharded data parallel, recommended for larger models (default: False)\n  --shard_grads         train-time memory optimization: shards gradients over available GPUs using sharded data parallel, recommended for larger models (default: False)\n  --use_gradient_checkpoint_reader\n                        use gradient checkpointing in the reader (default: False)\n  --use_gradient_checkpoint_retriever\n                        use gradient checkpointing for retriever (default: False)\n```\n\n\u003C\u002Fdetails>\n\n## Citing\n\nTo cite this work, please use the following bibtex:\n```\n@article{izacard_few-shot_2022,\n\ttitle = {Few-shot {Learning} with {Retrieval} {Augmented} {Language} {Models}},\n\turl = {http:\u002F\u002Farxiv.org\u002Fabs\u002F2208.03299},\n\tpublisher = {arXiv},\n\tauthor = {Izacard, Gautier and Lewis, Patrick and Lomeli, Maria and Hosseini, Lucas and Petroni, Fabio and Schick, Timo and Dwivedi-Yu, Jane and Joulin, Armand and Riedel, Sebastian and Grave, Edouard},\n\tyear = {2022},\n}\n```\n\n## License\n\n### Code License:\n\nThe majority of the Atlas code is licensed under [CC-BY-NC](.\u002FLICENSE), however portions of the project are available under separate license terms: huggingface transformers is licensed under the [Apache 2.0 license](https:\u002F\u002Fraw.githubusercontent.com\u002Fhuggingface\u002Ftransformers\u002Fmain\u002FLICENSE), which covers `src\u002Fmodeling_bert.py` and `src\u002Fmodeling_t5.py`.\n\n### Data License:\n\nThe wikipedia-derived data used in the repository, such as the corpora and indices available from `download_corpus.py` and `download_index.py` are licensed according to [CC-BY-SA](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F3.0\u002F). \n","# Atlas: 基于检索增强语言模型的少样本学习\n\n该仓库已不再维护，研究代码按原样提供。\n\n本仓库包含用于论文《Atlas: 基于检索增强语言模型的少样本学习》（https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03299.pdf）的预训练模型、语料库、索引以及预训练、微调、检索和评估的相关代码。\n\n请阅读我们的 [Atlas 博客文章](https:\u002F\u002Fresearch.facebook.com\u002Fblog\u002F2023\u002F1\u002Fatlas-few-shot-learning-with-retrieval-augmented-language-models\u002F)，以快速了解该项目，并学习如何使用 torchrun 运行代码（无需 Slurm）。\n\n我们联合预训练了一个基于检索增强的序列到序列语言模型，该模型由一个基于段落的密集检索器和一个编码器-解码器语言模型组成。我们在 MMLU、KILT 和 NaturalQuestions 等广泛的任务上进行了评估，并研究了文档索引内容的影响，结果表明索引可以轻松更新。值得注意的是，当使用 2018 年的维基百科索引时，Atlas 在仅使用 64 个示例的情况下，在 Natural Questions 数据集上达到了超过 45% 的准确率，尽管参数量仅为 540B 参数模型的 1\u002F50，仍比后者高出 6 个百分点。此外，Atlas 在更大的数据集上进行微调后表现也非常出色——在完整的 Natural Questions 数据上微调后，Atlas 创下了 64% 的新 SOTA 记录，比当前最佳水平高出 8 个百分点。\n\n本仓库支持大规模和小规模数据集的预训练与微调。它具备以下功能：\n* 训练大型融合解码器序列到序列模型，最高已测试至 11B 参数；\n* 使用多种蒸馏方法将融合解码器模型中的相关性信号提炼到密集检索模型中；\n* 对用户提供的段落语料库（已测试最大达 4 亿段落，约 400 亿词）进行端到端的检索增强训练，并在训练过程中集成检索机制；\n* 支持掩码语言建模、前缀语言建模、维基百科章节生成、开放域问答、多选题问答、事实核查以及 KILT 等任务的训练（也可支持任意序列到序列任务）；\n* 提供基于 GPU 的快速并行分布式精确和近似最大内积搜索，用于密集向量检索；\n* 支持快速就地索引刷新；\n* 多种内存优化技术及在循环中训练检索器时保持快速准确检索的方法；\n* 更多功能，请参阅命令行参数或自述文件。\n\n## 目录\n\n* [安装](#installation)\n* [快速入门与代码库概览](#getting-started-and-codebase-at-a-glance)\n* [可下载的数据与模型](#available-data-and-Models-for-download)\n  * [语料库](#corpora)\n  * [模型](#models)\n  * [预构建索引](#prebuilt-indices)\n* [任务](#tasks)\n  * [基础任务](#base-task)\n  * [掩码语言建模](#mlm-task)\n  * [维基百科章节生成](#section-task)\n  * [开放域问答（如 NaturalQuestions、TriviaQA、TempLama）](#qa-task)\n  * [多选题问答（如 MMLU）](#mcqa-task)\n  * [事实核查（如 FEVER）](#fever-task)\n  * [KILT](#kilt-task)\n* [检索与索引详情](#retrieval-and-index-details)\n  * [Flat 索引 vs Faiss 索引](#flat-vs-faiss)\n  * [索引的保存与加载](#index-saving-and-loading)\n  * [处理过时索引的策略](#strategies-for-dealing-with-stale-indices)\n    * [索引刷新](#strategies-for-dealing-with-stale-indices)\n    * [过度检索结合重排序](#strategies-for-dealing-with-stale-indices)\n    * [查询端微调](#strategies-for-dealing-with-stale-indices)\n  * [仅检索模式](#retrieve-only-mode)\n  * [使用预先检索或缓存的段落](#using-pre-retrieved-or-cached-passages)\n* [其他功能](#other-features)\n  * [闭卷模式](#closed-book-mode)\n  * [指定格式](#specifying-formats)\n  * [实现自定义任务](#implementing-your-own-task)\n* [完整命令行参数列表](#full-list-of-command-line-flags)\n* [引用](#citing)\n* [许可证](#license)\n  * [代码许可证：](#code-license)\n  * [数据许可证：](#data-license)\n\n\n## 安装\n\nAtlas 代码库依赖以下软件包：\n\n* Python 3（已测试 3.8）\n* fairscale（已测试 0.4.6）\n* transformers（已测试 4.18.0）\n* numpy（已测试 1.22.4）\n* faiss（已测试 1.7.2）\n\n我们建议使用 Conda 进行安装。以下命令将安装所有依赖项：\n```\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas.git\ncd atlas\nconda create --name atlas-env python=3.8\nconda activate atlas-env\nconda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch\nconda install -c pytorch faiss-gpu=1.7.2 cudatoolkit=11.3\npip install -r requirements.txt\n```\n\n## 入门与代码库概览\n\nAtlas 仓库提供了训练和评估检索增强生成模型的功能，该模型由编码器-解码器语言模型和密集向量检索器组成。\n\u003C!-- 我们目前支持*联合*训练编码器-解码器语言模型和使用密集检索器的检索增强模型。 -->\n我们当前支持 T5 架构作为编码器-解码器语言模型，以及 Contriever 架构作为检索器（目前暂不计划支持其他架构，但欢迎提交 PR）。\nAtlas 模型由 Contriever 检索器和融合解码器（FID）架构（使用 T5）组成。如果您有兴趣，可以分别在 [这里](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FFiD) 和 [这里](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcontriever) 了解更多关于 Contriever 和 FID 的信息，不过所有必要的功能都已经在此代码库中重新实现。\n\n与大多数标准 NLP 训练代码库相比，最大的不同在于 Atlas 会实时进行检索，并且可以在原地刷新其检索嵌入索引。\n这是通过一个自定义设计的分布式 GPU 索引实现的，它可以自动处理快速且可扩展的检索。\n\n**关于检索实现方式的说明：**\n在启动训练或评估运行时，代码库首先会加载预训练模型，然后每个 GPU 工作节点会加载一部分要检索的文档——如果有 N 个 GPU，则每个节点会加载 1\u002FN 的文档分片。\n随后，每个工作节点会使用检索器的嵌入模块为其分片中的文档生成嵌入，并将这些文档嵌入保留在 GPU 内存中（也可以选择构建 FAISS 索引）。\n此时，文档和嵌入分片（称为“索引”）可以选择保存到磁盘，以避免每次运行时都重新计算索引。\n检索是并行进行的，每个 GPU 工作节点会对其分片中的所有查询执行精确的最大内积搜索。\n更多关于检索的详细信息请参见 [检索与索引详情](#retrieval-and-index-details) 部分。\n*请注意，以上所有步骤均由代码库自动完成*，因此用户无需过多了解或担心嵌入、索引刷新或检索的具体实现方式，只需注意以下几点：\n1) 用户可以通过传递磁盘上格式正确的文档路径（或已保存的索引），轻松从任何他们喜欢的文档集中进行检索；\n2) 嵌入、索引刷新和检索的速度会随着 GPU 工作节点数量的增加而加快；\n3) 根据可用的 GPU 数量和 CPU 内存大小，Atlas 可以支持训练参数量超过 110 亿的模型，以及包含 4 亿+ 向量的索引，约相当于 400 亿个词（假设每篇文档约 100 个词）。\n\n训练和评估采用数据并行模式：对于 N 个 GPU 工作节点，每个节点处理总 mini-batch 数据的 1\u002FN。为了节省训练时的内存，优化器状态和梯度可以使用 fairscale 的 ShardedDataParallel 进行分片。\n\n所有数据文件（检索器文档以及训练\u002F验证\u002F测试数据）都应以 [jsonlines](https:\u002F\u002Fjsonlines.org\u002F)（“jsonl”）格式提供。\n用于检索的文档应为 JSON 序列化的对象，每行包含 `text` 和 `title` 文本字段，每行对应一篇文档。\n示例文档文件可用于维基百科（详见 [语料库](#corpora)）。\n训练\u002F验证\u002F测试数据文件也应为 JSON 序列化的对象，每行代表一个实例。字段名称取决于具体任务（详见 [任务](#tasks)），例如，在 NaturalQuestions 数据集中，所需的字段是 `question`（问题字符串）和 `answers`（参考答案字符串列表）。\n\n代码库有两个入口脚本：`train.py` 用于训练，`evaluate.py` 用于测试时的评估（以及 [独立检索模式](#retriever-only-mode)，如果需要）。您可以通过运行 `python train.py -h` 打印命令行参数来查看 Atlas 的完整功能（完整输出 [此处](#full-list-of-command-line-flags)）。\n\n*展示代码库最简单的方式就是通过一个示例：*\n\n以下示例展示了如何使用 Atlas-large 在 NaturalQuestions 数据集上进行少样本微调和评估（这些操作也可通过 `example_scripts\u002Fnq\u002F` 中的可运行 sbatch 脚本完成），并从 2018 年的维基百科转储（约 3000 万篇文档）中进行检索。\n\n```bash\n# 假设使用 4 个节点，每个节点配备 8 个 GPU\nDATA_DIR=.\u002Fatlas_data\nSIZE=large # 使用 large 版本（比 base 版本慢，但仍然相当快速且易于使用，只是准确率不如 xl 或 xxl）\n \n# 下载 NaturalQuestions 数据\npython preprocessing\u002Fprepare_qa.py --output_directory ${DATA_DIR}\u002Fdata\u002F\n# 下载 2018 年的维基百科语料库\npython preprocessing\u002Fdownload_corpus.py --corpus corpora\u002Fwiki\u002Fenwiki-dec2018 --output_directory ${DATA_DIR}\n\n# 下载预训练的 Atlas-large 模型\npython preprocessing\u002Fdownload_model.py --model models\u002Fatlas\u002F${SIZE} --output_directory ${DATA_DIR}  \n\nport=$(shuf -i 15000-16000 -n 1)\nTRAIN_FILE=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftrain.64-shot.jsonl\"\nEVAL_FILES=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl\"\nSAVE_DIR=${DATA_DIR}\u002Fexperiments\u002F\nEXPERIMENT_NAME=my-nq-64-shot-example\nTRAIN_STEPS=30\n\nsrun python train.py \\\n    --shuffle \\\n    --train_retriever \\\n    --gold_score_mode pdist \\ # 检索器的损失函数（参见论文）\n    --use_gradient_checkpoint_reader --use_gradient_checkpoint_retriever\\ # 使用梯度检查点节省显存，但会降低速度\n    --precision fp32 \\ # 如果你的 GPU 支持，可以使用 \"bf16\"；fp16 通常不稳定\n    --shard_optim --shard_grads \\ # 通过这些优化节省显存\n    --temperature_gold 0.01 --temperature_score 0.01 \\ \n    --refresh_index -1 \\ # 对于少样本微调，刷新索引（即重新计算嵌入）成本很高，且收益不大\n    --query_side_retriever_training\\ # 相反，在少样本场景下，仅微调 Contriever 的查询编码器效果较好。去掉该标志则会微调整个检索器\n    --target_maxlength 16 \\ # 生成的最大长度\n    --reader_model_type google\u002Ft5-${SIZE}-lm-adapt \\ # Atlas 的架构\n    --dropout 0.1 --weight_decay 0.01 --lr 4e-5 --lr_retriever 4e-5 --scheduler linear \\ # 优化参数\n    --text_maxlength 512 \\ # 问题与段落拼接后的最大长度\n    --model_path \"${DATA_DIR}\u002Fmodels\u002Fatlas\u002F${SIZE}\" \\ # 刚才下载的预训练 Atlas 模型路径（传入 'none' 可从纯 T5 和 Contriever 开始训练）\n    --train_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftrain.64-shot.jsonl\" \\ # 刚才下载的 64 抽样训练数据集路径\n    --eval_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl\" \\ # 刚才下载的 NQ 验证集路径，用于在训练完成后评估\n    --per_gpu_batch_size 1 \\\n    --n_context 40 \\ # 将检索器返回的前 40 篇文档传递给语言模型\n    --retriever_n_context 40 \\ # 使用前 40 篇文档对检索器进行微调\n    --name ${EXPERIMENT_NAME} \\ # 实验名称（日志和模型也将保存到该目录）\n    --checkpoint_dir ${SAVE_DIR} \\ # 日志和模型检查点将保存到 ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\n    --eval_freq ${TRAIN_STEPS} \\ # 训练结束后进行评估\n    --log_freq 4 \\ # 每 4 个训练步骤记录一次统计信息。日志会写入 ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Frun.log，同时如果已安装 TensorBoard，也会生成 TensorBoard 日志\n    --total_steps ${TRAIN_STEPS} \\ # 训练指定步数\n    --warmup_steps 5 \\\n    --save_freq ${TRAIN_STEPS} \\ # 在本示例中，训练完成后只保存一个检查点\n    --main_port $port \\ # 用于分布式训练\n    --write_results \\ # 写出预测结果——它们将保存在检查点文件夹中，${SAVE_DIR}\u002F${EXPERIMENT_NAME}\n    --task qa \\ # 我们执行的是问答任务\n    --index_mode flat \\ # 不使用 Faiss，保持索引为扁平结构（建议如此操作，除非使用超大索引或显存非常有限）\n    --passages \"${DATA_DIR}\u002Fcorpora\u002Fwiki\u002Fenwiki-dec2018\u002Ftext-list-100-sec.jsonl\" \"${DATA_DIR}\u002Fcorpora\u002Fwiki\u002Fenwiki-dec2018\u002Finfobox.jsonl\"\\ # 输入维基百科段落以构建索引并进行检索（我们同时使用文本和信息框）\n    --save_index_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fsaved_index # 将构建的索引保存到此路径\n```\n\n训练脚本首先会对 2018 年版维基百科构建索引并将其保存到检查点文件夹内（`${SAVE_DIR}\u002F${EXPERIMENT_NAME}`）。随后，脚本将以 64 抽样的方式对 Atlas-large NQ 模型进行少样本微调，共 30 步，并从整个 2018 年版维基百科中检索内容。该脚本仅微调检索器的查询编码器及 FID 参数，而保持段落编码器冻结不动（详情请参阅论文或[下方](#strategies-for-dealing-with-stale-indices)的相关说明）。最后，脚本将对验证集进行评估并保存检查点。你可以在 `${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Frun.log` 中查看实验日志，其中记录了约 38% 的 NQ 验证集精确匹配分数（我们的运行结果为 38.4%），以及可进一步检查的预测结果。\n\n要评估模型性能（例如在保留的测试集上），我们可以使用 `evaluate.py` 入口脚本：\n\n```bash\nsrun python evaluate.py \\\n    --name 'my-nq-64-shot-example-evaluation' \\\n    --generation_max_length 16 \\\n    --gold_score_mode \"pdist\" \\\n    --precision fp32 \\\n    --reader_model_type google\u002Ft5-${size}-lm-adapt \\\n    --text_maxlength 512 \\\n    --model_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fcheckpoint\u002Fstep-30 \\ # 现在指向我们刚刚训练好的模型\n    --eval_data \"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl ${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftest.jsonl\" \\ # 这次我们将同时评估验证集和测试集\n    --per_gpu_batch_size 1 \\\n    --n_context 40 --retriever_n_context 40 \\\n    --checkpoint_dir ${SAVE_DIR} \\\n    --main_port $port \\\n    --index_mode \"flat\"  \\\n    --task \"qa\" \\\n    --load_index_path ${SAVE_DIR}\u002F${EXPERIMENT_NAME}\u002Fsaved_index\\ # 不再重新嵌入所有维基百科段落，而是直接加载我们之前保存的索引\n    --write_results # 写出推理结果\n```\n\n该脚本将加载模型，并由于指定了通过 `--load_index_path` 加载已保存的索引，因此不会像之前那样从段落中重新嵌入，而是直接使用索引进行检索。随后，它将对开发集和测试集进行评估。检查 `${SAVE_DIR}\u002Fmy-nq-64-shot-example-evaluation\u002Frun.log` 中保存的日志，你会看到与先前相同的验证集精确匹配分数，以及约 38% 的测试集分数（我们的情况是 38.8% EM）。\n\n本自述文件的其余部分将详细描述数据、代码和功能。\n\n## 可下载的数据和模型\n\n目前，Atlas 的维基百科语料库、预训练模型以及预构建的维基百科索引均可下载。\n\n点击展开：\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"corpora\">语料库\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n我们用于 Atlas 检索和预训练的预处理维基百科转储文件可按如下方式下载：\n\n```bash\npython preprocessing\u002Fdownload_corpus.py --corpus {语料下载键} --output_directory ${DATA_DIR} \n```\n上述命令将下载一个语料并将其解压到 `${DATA_DIR}\u002F{语料下载键}`。\n\n可用的语料如下表所示：\n\n| 语料名称      | 语料下载键 | 描述 | 大小 |\n| ----------- | ----------- | --------|  ---- |\n| enwiki-dec2017      | `corpora\u002Fwiki\u002Fenwiki-dec2017` | 2017年12月的维基百科转储，已预处理为段落       |  30.4M (26.9M 文本, 2.7M 信息框)| \n| enwiki-dec2018      | `corpora\u002Fwiki\u002Fenwiki-dec2018` | 2018年12月的维基百科转储，已预处理为段落（推荐用于 NQ、TriviaQA） | 32.1M (28.4M 文本, 3.7M 信息框) |\n| enwiki-aug2019      | `corpora\u002Fwiki\u002Fenwiki-aug2019` | 2019年8月的维基百科转储，已预处理为段落       | 33.1M (29.4M 文本, 3.8M 信息框)  |\n| enwiki-dec2020      | `corpora\u002Fwiki\u002Fenwiki-dec2020` | 2020年12月的维基百科转储，已预处理为段落       | 35.6M (31.5M 文本, 4.1M 信息框) |\n| enwiki-dec2021      | `corpora\u002Fwiki\u002Fenwiki-dec2021` | 2021年12月的维基百科转储，已预处理为段落       | 37.5M (33.1M 文本, 4.3M 信息框) |\n\n段落文件采用 jsonl 格式，每行序列化为一个 JSON 对象。默认情况下，每个段落应按以下格式组织：\n\n```python\n{\n    \"id\": \"0\", # 段落应具有唯一 ID\n    \"title\": \"兰花\", # 应指定该段落来自的页面标题（如果没有合适的标题，可以为空字符串）\n    \"text\": \"兰花与其他植物很容易区分，因为它们具有一些非常明显的衍生特征或共源性状。其中包括：花的两侧对称性（合轴对称）、许多倒置的花朵、几乎总是高度特化的唇瓣、雄蕊和雌蕊融合在一起，以及极其微小的种子。\", # 段落的主要文本\n    \"section\": \"描述\" # 可选字段，表示段落所属的小节标题；如果非空，则默认会将此字段附加到标题后，形成 {title}: {section}\n    ... # 您还可以添加其他字段以方便分析，但这些字段实际上不会被使用\n}\n```\n\n如果您按照上述格式创建自己的段落文件，那么与 Atlas 配合使用应该会非常简单。\n\n目前，我们无法开源论文中使用的 Common Crawl 索引。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"models\">模型\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n我们正在开源基础、大、XL 和 XXL 尺寸的预训练 Atlas 模型。这些模型同时包含预训练的检索器和阅读器权重。\n\n此外，我们还开源了性能最强的全微调 Natural Questions Atlas 模型，供希望进行最先进问答推理（或在其他问答任务上进行微调）的用户使用。\n\n模型可按如下方式下载：\n\n```bash\npython preprocessing\u002Fdownload_model.py --model {模型下载键} --output_directory ${DATA_DIR} \n```\n\n这将会把请求的模型下载到 `${DATA_DIR}\u002F{模型下载键}`，随后您可以通过将 `${DATA_DIR}\u002F{模型下载键}` 传递给 `--model_path` 参数，在脚本中使用这些模型。\n\n下表详细列出了可用的模型：\n\n| 模型 | 模型下载键 | 描述 | 参数量（阅读器 \u002F 检索器） |\n| ----------- | ----------- | --------| ----|\n| Atlas-xxl | `models\u002Fatlas\u002Fxxl` | 预训练的 Atlas XXL 模型 | 11B \u002F 110M |\n| Atlas-xl | `models\u002Fatlas\u002Fxl` | 预训练的 Atlas XL 模型 | 3B \u002F 110M |\n| Atlas-large | `models\u002Fatlas\u002Flarge` | 预训练的 Atlas 大模型 | 770M \u002F 110M |\n| Atlas-base | `models\u002Fatlas\u002Fbase` | 预训练的 Atlas 基础模型 | 220M \u002F 110M |\n| NQ 微调后的 Atlas-xxl | `models\u002Fatlas_nq\u002Fxxl` | 经 Natural Questions 数据微调的 Atlas XXL 模型 | 11B \u002F 110M |\n| NQ 微调后的 Atlas-xl | `models\u002Fatlas_nq\u002Fxl` | 经 Natural Questions 数据微调的 Atlas XL 模型 | 3B \u002F 110M |\n| NQ 微调后的 Atlas-large | `models\u002Fatlas_nq\u002Flarge` | 经 Natural Questions 数据微调的 Atlas 大模型 | 770M \u002F 110M |\n| NQ 微调后的 Atlas-base | `models\u002Fatlas_nq\u002Fbase` | 经 Natural Questions 数据微调的 Atlas 基础模型 | 220M \u002F 110M |\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"prebuilt-indices\">预构建索引\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n如果未提供索引，Atlas 会自动构建一个索引。这种方式虽然方便，但耗时较长，尤其是在 GPU 工作节点较少或索引规模较大的情况下。\n\n因此，我们提供了针对预训练 Atlas 检查点以及经 Natural Questions 数据微调的 Atlas 检查点的 wiki-dec2018 语料的预计算索引供下载。\n\n这些索引可按如下方式下载：\n```bash\npython preprocessing\u002Fdownload_index.py --index {索引下载键} --output_directory ${DATA_DIR} \n```\n\n上述脚本将下载请求的预训练索引，并将其保存到 `${DATA_DIR}\u002F{索引下载键}`。随后，您可以通过将这些索引传递给 `--load_index_path` 参数，在训练或评估中使用它们。关于索引的保存和加载的更多细节，请参阅 [检索与索引详情](#retrieval-and-index-details)。\n\n可供下载的索引如下表所示：\n\n| 索引  | 索引下载键 | 对应模型 |  描述 |\n| --------| ------| --------| ------|\n| Atlas XXL wiki-dec2018 索引 | `indices\u002Fatlas\u002Fwiki\u002Fxxl` | `models\u002Fatlas\u002Fxxl` | 针对预训练 Atlas-xxl 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas XL wiki-dec2018 索引 | `indices\u002Fatlas\u002Fwiki\u002Fxl` | `models\u002Fatlas\u002Fxl` | 针对预训练 Atlas-xl 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas large wiki-dec2018 索引 | `indices\u002Fatlas\u002Fwiki\u002Flarge` | `models\u002Fatlas\u002Flarge` | 针对预训练 Atlas-large 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas base wiki-dec2018 索引 | `indices\u002Fatlas\u002Fwiki\u002Fbase` | `models\u002Fatlas\u002Fbase` | 针对预训练 Atlas-base 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas-nq XXL wiki-dec2018 索引 | `indices\u002Fatlas_nq\u002Fwiki\u002Fxxl` | `models\u002Fatlas_nq\u002Fxxl` | 针对经 Natural Questions 数据微调的 Atlas xxl 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas-nq XL wiki-dec2018 索引 | `indices\u002Fatlas_nq\u002Fwiki\u002Fxl` | `models\u002Fatlas\u002Fxl` | 针对经 Natural Questions 数据微调的 Atlas xl 模型的 wiki-dec2018 语料预计算索引 |\n| Atlas-nq large wiki-dec2018 索引 | `indices\u002Fatlas_nq\u002Fwiki\u002Flarge` | `models\u002Fatlas\u002Flarge` | 针对经 Natural Questions 数据微调的 Atlas 大模型的 wiki-dec2018 语料预计算索引 |\n| Atlas-nq base wiki-dec2018 索引 | `indices\u002Fatlas_nq\u002Fwiki\u002Fbase` | `models\u002Fatlas\u002Fbase` | 针对经 Natural Questions 数据微调的 Atlas 基础模型的 wiki-dec2018 语料预计算索引 |\n\u003C\u002Fdetails>\n\n## 任务\n\nAtlas 可以在任何可以表示为“seq2seq”格式的监督学习任务上进行训练（或评估），其中输入是一个由一个或多个标记组成的序列，称为*query*，而输出则是由一个或多个标记组成的序列，称为*target*。\n例如，一个 query 可能是一个问题，如“百慕大三角在哪里？”；而对应的 target 则是该问题的答案，“北大西洋西部海域”。\n这种建模方式对于使用 T5 或 BART 等模型的用户来说会很熟悉。凡是可以使用这些模型的地方，Atlas 也同样适用，并且可以使用完全相同的数据：Atlas 将自行学会从其检索索引中检索段落——无需用于将段落与 (`query`, `target`) 对关联的标注。\n\nAtlas 的代码库通过命令行参数 `--task` 来配置当前执行的任务以及要调用的评估指标。\n我们实现了一个 `base` 任务，仅提供最基本的 seq2seq 训练支持，但同时也为掩码语言建模 (`mlm`)、语言建模 (`lm`)、维基百科章节生成 (`section`)、开放域问答 (`QA`)、选择题问答 (`multiple_choice`)、事实核查 (`fever`) 以及 KILT 套件 (`kilt`) 提供了更全面的功能。\n所有任务都期望输入数据采用 jsonl 格式，但具体的字段名称因任务而异。部分任务还具有额外的命令行参数和专门的评估方法。\n添加新任务非常简单，具体说明请参见 [这里](#defining-your-own-task)。\n\n以下将更详细地介绍各个任务，大多数任务在 `examples\u002F{task}\u002F` 目录下都有示例命令（点击展开）。\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"base-task\">基础任务\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n\n这是最基础的任务，可能并不是您的最佳选择，尤其是当您的任务与已实现的其他任务非常相似时。\n通过向 `train.py` 或 `evaluate.py` 传递 `--task base` 参数即可指定此任务。\n该任务的训练\u002F验证\u002F测试数据应由 jsonl 文件组成，需以空格分隔的形式传入 `train.py` 或 `evaluate.py`，例如 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等。\n输入文件应包含 `query` 字段，用于存储输入查询字符串，以及 `target` 字段，用于存储输出目标字符串，例如：\n\n```json\n{\"query\": \"输入到 Atlas 的内容\", \"target\": \"希望 Atlas 生成的内容\"}\n```\n\n评估循环会计算评估损失，以及 Atlas 生成的输出与目标完全匹配的验证数据样本所占的比例。\n如果您向脚本传递 `--write_results` 参数，Atlas 在验证数据上的预测结果将以如下格式写入保存检查点的目录：\n\n```json\n{\"query\": \"输入到 Atlas 的内容\", \"answers\": [\"希望 Atlas 生成的内容\"], \"generation\": \"Atlas 对该查询的预测结果\", \"passages\": [\"检索到的段落列表\"]}\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"mlm-task\">掩码语言建模\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n掩码语言建模任务实现了由 [T5](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683) 提出的掩码语言建模预训练任务。这也是我们在论文中用来预训练主模型 Atlas 的任务。\n通过向 `train.py` 传递 `--task mlm` 参数即可指定此任务。\n该任务的训练\u002F验证\u002F测试数据应由 jsonl 文件组成，需以 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等形式传入 `train.py`。\n这些文件应由具有以下格式的 JSON 对象组成：\n```python\n{\n  \"text\": \"需要施加噪声并训练去噪的文本片段\",\n  \"id\": \"该文本片段的唯一标识符\"\n  ... # 您还可以保留其他字段以便于分析，但这些字段实际上并不会被使用\n}\n```\n其设计意图是，您可以将用于检索语料库的文件（通过 `--passages` 传递）直接用作训练数据。\n该任务会应用 T5 的噪声函数处理 `text` 字段，从而自动创建输入和目标生成内容。\nMLM 任务还会阻止 Atlas 检索正在尝试去噪的那篇文档。它通过过滤掉检索结果中与正在去噪的实例具有相同 `id` 字段的文档来实现这一点。如果去噪训练数据与 Atlas 正在检索的文档来自同一语料库，这一功能就显得尤为重要。\n该任务具有以下特定于任务的参数：\n```\n  --mlm_noise_density MLM_NOISE_DENSITY\n      输入文本中应被掩码跨度覆盖的比例（默认：0.15）\n  --mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH\n      MLM 掩码跨度的平均长度（默认：3）\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      如果实例中的词数少于此值，则会跳过该实例，不参与 MLM\u002FLM\u002F章节生成任务（默认：无）\n```\n\n如果您传递 `--write_results` 参数，Atlas 会将其填空预测结果写入文件。\n在评估过程中，Atlas 会记录以下 MLM 任务的评估指标：\n* `eval_loss`：生成的 MLM 填空跨度的评估损失\n* `accuracy`：完美去噪的填空跨度所占比例\n* `f1`：正确去噪的填空跨度的 token F1 分数\n* `rouge_1`：生成的填空跨度相对于黄金参考掩码跨度的 ROUGE-1 分数\n* `rouge_2`：生成的填空跨度相对于黄金参考掩码跨度的 ROUGE-2 分数\n* `rouge_L`：生成的填空跨度相对于黄金参考掩码跨度的 ROUGE-L 分数\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"lm-task\">语言建模\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n通过向 `train.py` 传递 `--task lm` 参数，Atlas 可以被训练为执行从左到右的语言建模任务。\n该任务的训练\u002F验证\u002F测试数据应由 jsonl 文件组成，需以 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等形式传入 `train.py`。\n这些文件应由具有以下格式的 JSON 对象组成：\n```python\n{\n  \"text\": \"需要训练 Atlas 生成的文本片段\",\n  \"id\": \"该文本片段的唯一标识符\"\n  ... # 您还可以保留其他字段以便于分析，但这些字段实际上并不会被使用\n}\n```\n其设计意图是，您可以将用于检索语料库的文件（通过 `--passages` 传递）直接用作训练数据。\n该任务会自动对 `text` 字段进行预处理，将其随机分为两部分：左侧作为条件上下文，右侧则作为 Atlas 模型将被训练生成的后续内容。\nLM 任务还会阻止 Atlas 检索正在尝试生成的同一文档。它通过过滤掉检索结果中与正在生成的实例具有相同 `id` 字段的文档来实现这一点。如果去噪训练数据与 Atlas 正在检索的文档来自同一语料库，这一功能就显得尤为重要。\n\n该任务具有以下特定于任务的参数：\n```\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      具有少于 min_words_per_lm_instance 个词的实例将被跳过，不参与 MLM\u002FLM\u002FSection 生成（默认：无）\n  --min_lm_context_ratio MIN_LM_CONTEXT_RATIO\n      将文本分割为两个部分用于语言建模。左侧部分作为条件上下文，右侧部分用于生成。左侧部分必须大于右侧部分的 min_lm_context_ratio（默认：0.5）\n  --max_lm_context_ratio MAX_LM_CONTEXT_RATIO\n      将文本分割为两个部分用于语言建模。左侧部分作为条件上下文，右侧部分用于生成。左侧部分必须小于右侧部分的 max_lm_context_ratio（默认：0.5）\n```\n\n如果您传递 `--write_results`，Atlas 会将其语言模型预测结果写入文件。\n\n在评估过程中，Atlas 会记录以下语言模型评估指标：\n* `eval_loss`：参考数据中续写的评估读取损失\n* `accuracy`：完全正确预测的续写比例\n* `f1`：生成续写的 token F1 分数，表示正确生成的比例\n* `rouge_1`：生成续写相对于黄金参考续写的 ROUGE-1 分数\n* `rouge_2`：生成续写相对于黄金参考续写的 ROUGE-2 分数\n* `rouge_L`：生成续写相对于黄金参考续写的 ROUGE-L 分数\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"section-task\">维基百科章节生成\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n通过向 `train.py` 传递 `--task section`，可以训练 Atlas 根据维基百科条目的标题和章节标题生成相应段落的文本。\n\n此任务的训练\u002F验证\u002F测试数据应由 JSONL 文件组成，其格式应与维基百科转储中的 `text-list-100-sec.jsonl` 文件一致。这些文件可以通过遵循[可下载的数据和模型](#available-data-and-Models-for-download)中的说明获取，例如训练文件：`enwiki-dec2018\u002Ftext-list-100-sec.jsonl`。这些文件应由每行一个 JSON 对象组成，格式如下：\n```json\n{\n  \"id\": \"3793043\", \n  \"title\": \"百慕大三角\",\n  \"section\": \"指南针偏差\",\n  \"text\": \" 指南针问题是许多百慕大三角事件中经常提到的现象之一。尽管有人推测该地区可能存在异常的局部磁异常，但至今尚未发现此类异常。实际上，指南针会因与地磁极的关系而产生自然的偏差，这是航海家们几个世纪以来就已知的事实。\"\n}\n```\n该任务会自动将输入查询格式化为“{标题}, {章节}”——例如，在此示例中，输入到 Atlas 的内容将是 `百慕大三角, 指南针偏差`。输出将是示例中的 `text` 字段。\n`section` 任务会防止 Atlas 生成与其检索到的同一段落相同的内容。它通过过滤掉检索结果中与正在生成的实例具有相同 `id` 字段的段落来实现这一点。\n\n该任务具有以下特定于任务的参数：\n```\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n      具有少于 min_words_per_lm_instance 个词的实例将被跳过，不参与 MLM\u002FLM\u002FSection 生成（默认：无）\n```\n\n如果您传递 `--write_results`，Atlas 会将其生成的维基百科章节文本预测结果写入文件。\n\n在评估过程中，Atlas 会记录以下 `section` 任务的评估指标：\n* `eval_loss`：参考数据中续写的评估读取损失\n* `accuracy`：完全正确预测的续写比例\n* `f1`：生成续写的 token F1 分数，表示正确生成的比例\n* `rouge_1`：生成续写相对于黄金参考续写的 ROUGE-1 分数\n* `rouge_2`：生成续写相对于黄金参考续写的 ROUGE-2 分数\n* `rouge_L`：生成续写相对于黄金参考续写的 ROUGE-L 分数\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"qa-task\">开放域问答（如 NaturalQuestions、TriviaQA、TempLama）\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\n通过向 `train.py` 或 `evaluate.py` 传递 `--task qa`，可以训练 Atlas 回答开放域问答问题。在[快速入门与代码库概览](#getting-started-and-codebase-at-a-glance)部分有一个 QA 的示例。\n\n我们在论文中使用此任务处理 NaturalQuestions、TriviaQA 和 TempLama 数据集。\n\n此任务的训练\u002F验证\u002F测试数据应由 JSONL 文件组成，这些文件应以 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等形式传递给 `train.py`。\n每个文件应包含一行 JSON 实例，格式如下：\n```python\n{\n  \"question\": \"百慕大三角在哪里\",\n  \"answers\": [\"北大西洋西部\"],\n   ... # 您可以保留其他字段以便于分析，但这些字段不会实际用于训练\n}\n```\n问题将根据任务特定参数 `--qa_prompt_format` 进行格式化，默认值为 `question: {question} answer: \u003Cextra_id_0>`。\n例如，上述问题将自动格式化为输入到 Atlas 的查询：`question: 百慕大三角在哪里 answer: \u003Cextra_id_0>`。\n监督目标来自 `target` 字段。如果该字段不存在，则监督目标将从 `answers` 字段中的可用答案中随机选择，并格式化为 `\u003Cextra_id_0> {answer}`。\n\n如果您传递 `--write_results`，Atlas 会将其预测的答案写入文件。\n\n在评估过程中，Atlas 会记录以下开放域问答的评估指标：\n* `eval_loss`：评估答案的评估读取损失\n* `exact_match`：生成答案的开放域问答精确匹配分数\n* `f1`：生成答案的开放域问答 F1 分数\n\n#### Natural Questions 和 TriviaQA\n\n您可以通过运行以下命令下载 NaturalQuestions 和 TriviaQA 数据：\n\n```bash\npython preprocessing\u002Fprepare_qa.py --output_directory ${DATA_DIR} \n```\n\n这将下载 `train.jsonl`、`train.64-shot.jsonl`（我们使用的少量样本训练数据集）、`dev.jsonl` 和 `test.jsonl`，并将其保存到 `${DATA_DIR}\u002Fdata\u002Fnq_data` 和 `${DATA_DIR}\u002Fdata\u002Ftriviaqa_data` 中。\n\n有关使用 NQ 维基百科索引进行少量样本和标准微调及评估的示例脚本可在 `examples\u002Fnq` 中找到。只需替换训练\u002F验证\u002F测试文件，即可将该脚本用于 TriviaQA。\n\n#### TempLama\n\n我们基于 TempLAMA 数据集定义了一个完形填空式问答任务，用于评估索引的忠实性和时间迁移能力。\n\n您可以通过运行以下脚本下载 TempLAMA 数据并创建和格式化我们衍生的数据集：\n\n```bash\npython preprocessing\u002Fprepare_templama.py --output_directory ${DATA_DIR} \n```\n\n这将创建文件 `temp_lama.train.2017.jsonl`、`temp_lama.valid.2017.jsonl`、`temp_lama.test.2017.jsonl`、`temp_lama.train.2020.jsonl`、`temp_lama.valid.2020.jsonl`、`temp_lama.test.2020.jsonl`，位于 `${DATA_DIR}\u002Fdata\u002Ftemplama_data\u002F` 目录下。\n这些文件将包含完形填空题，答案会根据年份有所不同。\n\n运行 TempLama 训练和评估的示例脚本可以在 `examples\u002Ftemplama` 中找到。（注意使用了 `qa_prompt_format {question}`，它会关闭 TriviaQA 和 NQ 所使用的自动 QA 提示格式化功能）\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"mcqa-task\">多项选择题回答（例如 MMLU）\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas 可以通过在 `train.py` 或 `evaluate.py` 中添加 `--task multiple_choice` 参数来训练回答多项选择题。我们在 MMLU 的实验中使用了这一任务。\n该任务的训练\u002F验证\u002F测试数据应由 jsonl 文件组成，这些文件应作为 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等参数传递给 `train.py`。\n每个文件每行应包含一个 JSON 实例，格式如下：\n```python\n{\n  \"question\": \"以下哪个是包含垂体的体腔？\", \n  \"options\": {\n    \"A\": \"腹腔\",\n    \"B\": \"颅腔\",\n    \"C\": \"胸膜腔\", \n    \"D\": \"脊髓腔\"\n    ... # 你可以有更多的（或更少的）选项，只要它们的键是按字母顺序连续的大写字母，从 A 开始即可\n  }, \n  \"answer\": \"B\",\n  ... # 你还可以保留其他字段以便于分析，但这些字段并不会被实际使用\n}\n```\n这些数据会被自动格式化为 Atlas 的输入查询，形式为 `question: {question} answers: (A) {options['A']} (B) {options['B']} (C) {options['C']} (D) {options['D']} Answer: \u003Cextra_id_0>`，目标生成格式为 `\u003Cextra_id_0> {answer letter}`。\n上述示例会被格式化为：`question: {以下哪个是包含垂体的体腔？ answers: (A) 腹腔 (B) 颅腔 (C) 胸膜腔 (D) 脊髓腔 Answer: \u003Cextra_id_0>`，目标生成为 `{extra_id_0} B`。\n\n\n多项选择问答任务有以下特定参数：\n```\n  --multiple_choice_num_options\n      多项选择问答中选项的数量（MMLU 是 4 个）（默认值：4）\n  --multiple_choice_train_permutations {single,cyclic,all}\n      在进行多项选择训练时（例如 MMLU），是否启用答案顺序的排列组合。这有助于消除模型对任意答案顺序的偏好，从而提升效果。建议使用 'all' 模式。single：不进行排列；cyclic：循环排列；all：所有可能的答案排列组合。（默认值：single）\n  --multiple_choice_eval_permutations {single,cyclic,all}\n      在进行多项选择评估时（例如 MMLU），是否启用答案顺序的排列组合。这同样可以减少模型对答案顺序的偏好，从而提升效果。使用 'all' 模式效果最佳，但速度较慢。'cyclic' 是一个不错的折中方案。single：不进行排列；cyclic：循环排列；all：所有可能的答案排列组合。（默认值：single）\n```\n\n排列选项会自动复制输入数据，并对答案顺序进行排列（例如，“A”变为“颅腔”，“B”变为“胸膜腔”等）。\n当监督数据量非常少（或零样本情况下），这种方法可以显著提升效果。\n如果你使用了 `--multiple_choice_eval_permutations` 参数中的 `cyclic` 或 `all` 选项，代码会自动对不同排列组合的结果进行汇总，为你提供最终的评估结果。\n关于排列去偏化的更多细节，请参阅论文 [Atlas: Few-shot Learning with Retrieval Augmented Language Models](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2208.03299.pdf) 的附录。\n\n如果你指定了 `--write_results`，Atlas 会将其预测的答案写入文件，格式如下：\n\n```json\n{\n  \"question\": \"应用提示模板后的输入内容\",\n  \"generation\": \"在对排列组合结果进行汇总后，概率最高的答案字母\",\n  \"choice_probs\": \"每个答案选项的概率（归一化到总选项数）\",\n  \"all_probs\": \"未汇总前的所有答案排列组合的概率\",\n  \"permutations\": [\"针对每种答案排序组合的预测对象列表\"]\n}\n```\n\n#### MMLU\n\n专门用于运行 MMLU 实验的 ReadMe 文档可在 [这里](.\u002Fexample_scripts\u002Fmmlu\u002FREADME_MMLU.md) 查看。\n我们提供了一个用于下载和预处理 MMLU 数据的工具，并且在 `examples\u002Fmmlu` 中提供了针对我们所探索的各种实验设置的示例脚本。\n这些内容在 MMLU 的专用 ReadMe 文档中都有详细说明。\n\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"fever-task\">FEVER 事实核查\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas 可以通过使用 `--task fever` 参数在 `train.py` 或 `evaluate.py` 中训练，使其能够根据语料库将文本陈述分类为“支持”、“反驳”或“信息不足”，例如用于 FEVER 任务。\n你可以通过运行以下脚本来下载 FEVER 数据：\n\n```bash\npython preprocessing\u002Fprepare_fever.py --output_directory ${DATA_DIR} \n```\n\n该任务的训练\u002F验证\u002F测试数据应由 jsonl 文件组成，这些文件应作为 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等参数传递给 `train.py`。\n每个文件每行应包含一个 JSON 实例，格式如下：\n\n```python\n{\n  \"claim\": \"需要评估的陈述\", \n  \"label\": \"要么是 'SUPPORTS'、'REFUTES'，要么是 'NOT ENOUGH INFO'\",\n   ... # 你可以保留其他字段以便于分析，但这些字段并不会被实际使用\n}\n```\nAtlas 会自动处理这些实例，并将其格式化为输入 `question: {claim} answer: \u003Cextra_id_0>` 和输出 `\u003Cextra_id_0> {true, false 或 maybe}`。\n如果你指定了 `--write_results`，Atlas 会将其预测的标签写入文件。\n在评估过程中，Atlas 还会记录以下开放域问答的评估指标：\n\n* `accuracy`: 模型正确分类的陈述数量。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n\u003Ch4 name=\"kilt-task\">KILT\u003C\u002Fh4>\n\u003C\u002Fsummary>\n\nAtlas 可以通过在 `train.py` 或 `evaluate.py` 中使用 `--task kilt` 参数来训练执行 KILT 任务。\nKILT 数据可以从 [这里](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FKILT) 获取。\n\n该任务的训练\u002F验证\u002F测试数据应由 JSONL 文件组成，这些文件应作为 `--train_data train_file_1.jsonl train_file_2.jsonl` 和 `--eval_data eval_file_1.jsonl eval_file_2.jsonl` 等参数传递给 `train.py`。\n每个文件应每行包含一个 JSON 实例，格式如下（即代码库可以直接接受 KILT 格式）：\n```python\n{'id': # 原始数据点的 ID，如果有的话；否则为唯一 ID\n 'input': # 问题 \u002F 主张 \u002F 句子 \u002F 等等\n 'output': [ # 每个元素可能包含答案、证据来源或两者\n    {\n    'answer': # 文本形式的答案\n    'provenance': [\n        # 针对答案的 KILT 数据集中的证据集合\n        {\n            'wikipedia_id':  # *必须* \n            'title': \n            'section': \n            'start_paragraph_id': \n            'start_character': \n            'end_paragraph_id':\n            'end_character': \n            'bleu_score': # 相对于原始证据的 BLEU 分数\n            'meta': # 数据集\u002F任务特定的元数据\n        }\n        ] \n      }\n    ]\n 'meta': # 数据集\u002F任务特定的元数据\n }\n```\nAtlas 会自动根据 `input` 字段将这些实例处理为 Atlas 查询输入，并根据 `answer` 字段生成目标输出。\n\n如果您传递 `--write_results` 参数，Atlas 会将其预测标签写入文件。\n\n在评估过程中，Atlas 将记录以下开放域问答任务的评估指标：\n* `accuracy`: 生成内容与参考答案完全匹配的频率\n* `exact_match`: 在应用开放域问答标准化后，生成内容与参考答案完全匹配的频率\n* `f1`: 生成内容与参考答案之间的 token 级别 F1 分数重叠\n\n\u003C\u002Fdetails>\n\n\n\n\n## 检索与索引详情\n\n以下部分提供了关于检索和索引的更多细节。\n\n如前言中简要提及，Atlas 代码利用现代大型神经网络训练的并行特性来处理检索。具体来说，所有现代 GPU 上的训练（以及推理）都需要多个 GPU 工作节点来进行并行计算。\n\nAtlas 利用了这一已有的分布式架构。它会将检索索引（文档片段 + 嵌入后的文档片段）划分为 N 个大小相等的分片，每个 GPU 工作节点负责一个分片。默认情况下，检索完全在 GPU 上使用 PyTorch 进行精确搜索（不使用 FAISS 的近似搜索），尽管如此，由于搜索是在所有 GPU 工作节点之间并行进行的，因此速度仍然很快（假设 GPU 数量足够）。\n\n了解检索的具体实现机制并非运行该代码库的必要条件，但可能有助于针对特定需求调整代码库。因此，我们在此提供简要说明。为简单起见，我们假设正在进行开放域问答任务，每 GPU 的批处理大小为 1，共有 W 个 GPU 工作节点，检索索引中总共有 N 条文档。\n\n检索器需要完成两个高层次的功能：\n\u003Cdetails>\n\u003Csummary>\n1. 构建\u002F刷新嵌入\n\u003C\u002Fsummary>\n\n构建或刷新索引涉及为检索索引中的每条文档计算嵌入，然后将其保存在内存中，以便在后续检索时能够快速计算最大内积或最近邻。（假设我们有 W 个 GPU 工作节点，检索索引中总共有 N 条文档）\n\n请注意，每个 GPU 工作节点负责 N\u002FW 条文档的分片。文档嵌入的计算非常简单，过程如下：\n* 暂停正在进行的模型训练\n* 每个工作节点并行计算其分片中文档的嵌入，按批次迭代处理，并将结果保存在一个大型 PyTorch 张量中（FAISS 支持也可用，但稍后讨论）。\n* 当所有工作节点都完成其分片的嵌入计算后，我们可以继续进行模型训练。\n\nN 可能非常大（1000万到1亿），因此嵌入所有文档的速度会很慢。然而，由于可以在各个工作节点之间并行化，如果 GPU 数量足够，整个过程可以相对较快。尽管如此，对于大型索引，这个过程仍然可能耗时较长。我们提供了[索引保存](#index-saving-and-loading)功能，可以将索引保存到磁盘，以避免频繁重建索引。此外，随着检索器的不断训练，缓存的嵌入可能会过时或“陈旧”，需要重新计算，这会带来额外的成本。有关减少或避免频繁刷新索引的方法，请参阅[处理陈旧索引的策略](#strategies-for-dealing-with-stale-indices)。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\n2. 执行分布式检索\n\u003C\u002Fsummary>\n\nAtlas 在训练循环中执行检索：即在前向传播过程中调用检索功能。下面我们将简要描述 Atlas 前向传播中的步骤，包括如何完成检索：\n（假设我们有 W 个 GPU 工作节点，检索索引中总共有 N 条文档，为简化起见，假设每 GPU 的训练批大小为 1，且任务为问答任务）。\n\n* 每个工作节点会收到一个问题，并将其嵌入为查询向量。\n* 然后执行全归约操作，使得每个工作节点都拥有全部 W 个查询向量的副本（即当前小批量中所有问题的查询向量）。\n* 接着，每个工作节点在其负责的文档分片上对小批量中的所有查询向量执行 GPU 上的最大内积搜索。\n* 每个工作节点会将其分片中每个查询的前 K 个结果通过归约操作发送回嵌入该查询的 GPU。\n* 最终，每个工作节点会获得其查询对应的前 W × K 个结果，从中选出真正的前 K 个结果。\n* 检索过程至此完成，随后可以继续进行标准的分布式数据并行前向传播（即运行模型前向传播、计算梯度、在各工作节点之间聚合梯度，并更新模型参数）。\n\n\u003C\u002Fdetails>\n\n### Flat 与 Faiss\n\nAtlas 实现了两种索引模式。默认情况下，我们使用精确搜索（“Flat”）索引进行检索，检索过程在 GPU 上通过纯 PyTorch 完成。我们还支持 [FAISS](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss) 模式，该模式有助于为超大规模索引节省 GPU 内存，或在 GPU 内存非常有限的情况下使用。FAISS 是一个用于快速近似最近邻搜索的库。由于我们的检索是在 GPU 上进行的，通常不需要进一步的搜索加速，但 FAISS 可以用来压缩内存中索引的大小，这对于非常大的索引可能会有所帮助。\n\n要使用的模式由 `--index_mode {\"flat\"|\"faiss\"}` 指定。对于大多数用例，`flat` 索引就足够了，而且通常更为推荐。\n\n如果使用 faiss 索引，用户应指定要使用的 faiss 索引类型，可选如下：\n\n```\n  --faiss_index_type {ivfflat,flat,ivfsq,ivfpq,pq}\n      IVFFlat、IndexFlatIP、IVFScalarQuantizer、IndexPQ 或 IndexIVFPQ（需配合 faiss-gpu 使用；默认值：flat）\n  --faiss_code_size FAISS_CODE_SIZE\n      PQ\u002FSQ 量化参数（默认值：无）\n```\n\n使用 faiss 索引时的一个良好默认设置是 `--faiss_index_type ivfpq --faiss_code_size 16`。这将使用 IVF-PQ 索引，其中 IVF 聚类的数量设置为每个分片嵌入数量的平方根，PQ 代码长度为 16。有关此索引结构的更多详细信息，请参阅 faiss 文档 [FAISS](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss)。\n\n### 索引的保存与加载\n\n索引（段落和嵌入分片）可以保存到磁盘并在需要时加载，以避免重新计算它们。有关一些可下载的索引，请参阅[上文](#prebuilt-indices)。\n\n可以通过 `--save_index_path {path\u002Fto\u002Fdirectory\u002Fsave\u002Findex\u002Fin}` 开启索引保存功能，该命令会创建一个目录，并将每个工作进程的嵌入分片（作为磁盘上的 PyTorch 张量）和段落分片（作为 pickle 文件）保存到该目录中。\n\n要加载索引，只需传递 `--load_index_path {path}`，即可从指定路径加载索引。\n\n保存和加载功能同时适用于 `flat` 和 `faiss` 模式。\n\n为了便于在使用的工作进程数与创建索引时不同的情况下加载索引，我们可以配置 `--save_index_n_shards N`，这会将索引保存为 N 个分片（例如，如果有 32 个工作进程，可以传递 `--save_index_n_shards 128`，将索引保存为 128 个分片）。当再次尝试加载索引时，比如使用 64 个工作进程，代码会自动判断每个工作进程应加载 2 个保存的文件。（注意：此功能仅适用于 `flat` 索引——对于 faiss 索引，只能加载与保存时工作进程数相同的索引。）\n\n### 处理过时索引的策略\n\n随着检索器的训练，存储在内存中的段落嵌入会逐渐过时。这会影响检索的准确性，并且在长时间内可能导致训练效果不佳或不稳定。Atlas 提供三种方法来应对这一问题：\n\n1. \u003Cb name=\"#index-refresh\">索引刷新\u003C\u002Fb>：最简单但也最昂贵的方法是使用最新的检索器嵌入器重新计算嵌入。索引刷新频率由 `--refresh_index` 参数控制。格式为：`startstep-endstep:refreshrate`，例如 `--refresh_index 0-1000:500,1000-10000:1000` 表示在前 1000 步中每 500 步刷新一次索引，随后从第 1000 步到第 10000 步每 1000 步刷新一次。也可以只传递一个数字，如 `--refresh_index 100` 表示每 100 步刷新一次索引。传递 `--refresh_index -1` 则表示永不刷新。我们通常在大型数据集和预训练中使用此设置。\n2. \u003Cb name=\"#overretrieve-with-reranking\">带重排序的超额检索\u003C\u002Fb>：在这种方法中，我们不刷新索引，而是检索前 L 个段落（其中 L > K），然后使用最新的嵌入器对这 L 个段落进行实时重排序，并从中选出前 K 个。如果真实的前 K 个确实包含在过时的前 L 个中，这种方法效果很好。要使用此方法，需传递 `--retrieve_with_rerank` 并指定 `--n_to_rerank_with_retrieve_with_rerank L`。此方法可以与索引刷新结合使用，以减少两次刷新之间的过时程度。\n3. \u003Cb name=\"#query-Side-finetuning\">查询端微调\u003C\u002Fb>：为了避免过时问题，我们可以固定检索器的段落嵌入器，仅训练查询嵌入器。如果训练数据量很大，这种方法会牺牲检索性能，但在少样本场景下效果较好。要启用此模式，需传递 `--query_side_retriever_training`。注意：通常检索器的段落编码器和查询编码器会共享参数——而此模式则例外，我们会解除参数绑定，以保持段落编码器不变。\n\n### 仅检索模式\n\nAtlas 在评估时可以完全以检索模式运行。这对于希望使用快速、可扩展、易于部署且支持 GPU 的密集型检索器的用户来说非常有用。\n\n在此模式下（仅适用于 `evaluate.py`），不会加载阅读语言模型，脚本将执行检索操作，并在传递了 `--write_results` 标志的情况下将检索结果写入文件。\n\n要使用此模式，需在 `evaluate.py` 中传递 `--retrieve_only`。`examples\u002Fnq\u002Fretrieve_only.sh` 中提供了一个使用此模式进行 NaturalQuestions 数据集检索的示例。\n\n### 使用预先检索或缓存的段落\n\n在某些情况下，用户可能已经完成了检索，并希望为其数据集缓存检索结果，或者事先知道最相关的段落，因此无需再进行检索。\n\n在这种情况下，可以通过以下两种方式让 Atlas 忽略检索步骤，直接使用用户指定的段落：1) 传递 `--use_file_passages` 标志；2) 在传入的训练\u002F评估文件中包含一个名为 `passages` 的 JSON 字段，其格式如下（以 `qa` 任务为例）：\n\n\u003Cdetails>\n\u003Csummary>\n（点击展开查看示例）\n\u003C\u002Fsummary>\n\n```python\n{\n  \"question\": \"百慕大三角在哪里\",\n  \"answers\": [\"北大西洋西部海域\"],\n  \"passages\": [\n    {\n      \"text\": \"第一段落的内容\",\n      \"title\": \"第一段落的标题\",\n      \"id\": \"第一段落的唯一标识符\"\n      ... # 其他字段也可存在，但不会被使用\n    },\n    {\n      \"text\": \"第二段落的内容\",\n      \"title\": \"第二段落的标题\",\n      \"id\": \"第二段落的唯一标识符\"\n    },\n    ... # 如有需要，可添加更多段落\n  ]\n}\n```\n\n\u003C\u002Fdetails>\n\n## 其他功能\n\n以下是 Atlas 为高级用户提供的其他功能：\n\n### 封闭书本模式\n\nAtlas 可以作为标准的非检索增强 T5 模型运行，在文献中常被称为“封闭书本”模式。这对于进行基线实验以及验证您的模型是否确实从针对特定任务的检索增强中受益很有帮助。传递 `--closed_book` 参数即可进行封闭书本训练，并忽略检索到的段落。\n\n### 指定格式\n\n可以通过注入格式字符串来更精细地控制输入如何呈现给 Atlas 模型：\n\n```\n  --encoder_format ENCODER_FORMAT\n    阅读器编码器预处理的格式字符串（默认: \"{query} title: {title} context: {text}\")\n  --retriever_format RETRIEVER_FORMAT\n    检索器编码器预处理的格式字符串（默认: \"{title} {text}\")\n```\n\n例如，传递 `--encoder_format \"{query} text: {text}\"` 将不会把检索到的段落标题传递给阅读器模型。\n\n\n### 实现您自己的任务\n\n要为 Atlas 实现新任务，有两种选择：最简单的方法是使用已实现的任务之一对您的任务进行预处理或格式化，使其兼容（`base` 任务应支持几乎所有潜在用例）。\n\n另一种方法是在 `src\u002Ftasks\u002Fyour_task_name.py` 中实现您自己的任务，并在 `src\u002Ftasks\u002F__init__.py` 中将其导入。\n\n请参阅 `src\u002Ftasks\u002Fqa.py` 以获取示例。\n\n`process` 函数接受传递给 `--train_data` 或 `--eval_data` 的原始解析后的 jsonl 对象，并应返回一个字典，包含 `{query: \"传递给 Atlas 的查询\", \"target\": \"目标字符串\", \"passages\": [黄金检索段落列表，可以为空]}`。\n\n`evaluate` 函数接受任务的预测生成和参考答案，并返回一个特定于任务的评估分数字典，代码库会针对所有评估实例计算这些分数的平均值。\n\n## 命令行参数完整列表：\n\n\u003Cdetails>\n\u003Csummary>\n点击展开\n\u003C\u002Fsummary>\n\n```\n用法: train.py\u002Fevaluate.py [-h] [--name NAME] [--checkpoint_dir CHECKPOINT_DIR] [--model_path MODEL_PATH] [--per_gpu_batch_size PER_GPU_BATCH_SIZE] [--per_gpu_embedder_batch_size PER_GPU_EMBEDDER_BATCH_SIZE] [--local_rank LOCAL_RANK]\n                [--main_port MAIN_PORT] [--seed SEED] [--log_freq LOG_FREQ] [--eval_freq EVAL_FREQ] [--save_freq SAVE_FREQ] [--train_data TRAIN_DATA [TRAIN_DATA ...]] [--eval_data EVAL_DATA [EVAL_DATA ...]] [--write_results]\n                [--dont_write_passages] [--load_index_path LOAD_INDEX_PATH] [--save_index_path SAVE_INDEX_PATH] [--save_index_n_shards SAVE_INDEX_N_SHARDS] [--index_mode {flat,faiss}] [--faiss_index_type {ivfflat,flat,ivfsq,sq,pq}]\n                [--faiss_code_size FAISS_CODE_SIZE] --reader_model_type\n                {t5-small,t5-base,t5-large,t5-3b,t5-11b,google\u002Ft5-v1_1-base,google\u002Ft5-v1_1-large,google\u002Ft5-v1_1-xl,google\u002Ft5-v1_1-xxl,google\u002Ft5-base-lm-adapt,google\u002Ft5-large-lm-adapt,google\u002Ft5-xl-lm-adapt,google\u002Ft5-xxl-lm-adapt}\n                [--text_maxlength TEXT_MAXLENGTH] [--target_maxlength TARGET_MAXLENGTH] [--n_context N_CONTEXT] [--passages PASSAGES [PASSAGES ...]] [--max_passages MAX_PASSAGES] [--retriever_model_path RETRIEVER_MODEL_PATH]\n                [--retrieve_only] [--train_retriever] [--use_file_passages] [--retriever_n_context RETRIEVER_N_CONTEXT] [--gold_score_mode {evalnormsum,loop,ppmean,emdr,pdist,adist}] [--closed_book]\n                [--temperature_score TEMPERATURE_SCORE] [--temperature_gold TEMPERATURE_GOLD] [--compute_crossattention_stats] [--filtering_overretrieve_ratio FILTERING_OVERRETRIEVE_RATIO]\n                [--freeze_retriever_steps FREEZE_RETRIEVER_STEPS] [--query_side_retriever_training] [--retrieve_with_rerank] [--n_to_rerank_with_retrieve_with_rerank N_TO_RERANK_WITH_RETRIEVE_WITH_RERANK]\n                [--decoder_format DECODER_FORMAT] [--decoder_prompt_format DECODER_PROMPT_FORMAT] [--encoder_format ENCODER_FORMAT] [--retriever_format RETRIEVER_FORMAT] [--generation_max_length GENERATION_MAX_LENGTH]\n                [--generation_min_length GENERATION_MIN_LENGTH] [--generation_length_penalty GENERATION_LENGTH_PENALTY] [--generation_num_beams GENERATION_NUM_BEAMS] [--task {base,mlm,lm,multiple_choice,kilt,section,fever,qa}]\n                [--mlm_noise_density MLM_NOISE_DENSITY] [--mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH] [--min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE] [--min_lm_context_ratio MIN_LM_CONTEXT_RATIO]\n                [--max_lm_context_ratio MAX_LM_CONTEXT_RATIO] [--qa_prompt_format QA_PROMPT_FORMAT] [--multiple_choice_num_options MULTIPLE_CHOICE_NUM_OPTIONS] [--multiple_choice_train_permutations {single,cyclic,all}]\n                [--multiple_choice_eval_permutations {single,cyclic,all}] [--warmup_steps WARMUP_STEPS] [--total_steps TOTAL_STEPS] [--scheduler_steps SCHEDULER_STEPS] [--accumulation_steps ACCUMULATION_STEPS] [--dropout DROPOUT]\n                [--lr LR] [--lr_retriever LR_RETRIEVER] [--clip CLIP] [--scheduler {linear,cosine,fixed}] [--weight_decay WEIGHT_DECAY] [--save_optimizer] [--epsilon EPSILON] [--alpha ALPHA] [--beta2 BETA2]\n                [--refresh_index REFRESH_INDEX] [--shuffle] [--precision {fp16,fp32,bf16}] [--shard_optim] [--shard_grads] [--use_gradient_checkpoint_reader] [--use_gradient_checkpoint_retriever]\n\n可选参数：\n  -h, --help            显示此帮助信息并退出\n  --name NAME           实验名称，也用作目录名（默认值：experiment_name）\n  --checkpoint_dir CHECKPOINT_DIR\n                        模型保存在此目录下（默认值：.\u002Fcheckpoint\u002F）\n  --model_path MODEL_PATH\n                        用于初始化的预训练模型路径（传入 'none' 表示从 T5 和 Contriever 初始化）（默认值：无）\n  --per_gpu_batch_size PER_GPU_BATCH_SIZE\n                        每个 GPU\u002FCPU 的训练批次大小。（默认值：1）\n  --per_gpu_embedder_batch_size PER_GPU_EMBEDDER_BATCH_SIZE\n                        Embedder 每个 GPU 的批次大小。（默认值：512）\n  --local_rank LOCAL_RANK\n                        用于分布式训练：本地进程编号（默认值：-1）\n  --main_port MAIN_PORT\n                        主端口（用于多节点任务）（默认值：-1）\n  --seed SEED           初始化时使用的随机种子（默认值：0）\n  --log_freq LOG_FREQ   训练过程中每 \u003Clog_freq> 步记录一次训练统计信息（默认值：100）\n  --eval_freq EVAL_FREQ\n                        训练过程中每 \u003Ceval_freq> 步评估一次模型（默认值：500）\n  --save_freq SAVE_FREQ\n                        训练过程中每 \u003Csave_freq> 步保存一次模型（默认值：5000）\n  --train_data TRAIN_DATA [TRAIN_DATA ...]\n                        以空格分隔的 JSONL 格式训练数据集路径列表（默认值：空列表）\n  --eval_data EVAL_DATA [EVAL_DATA ...]\n                        以空格分隔的 JSONL 格式评估数据集路径列表（默认值：空列表）\n  --write_results       将评估结果保存到文件中（默认值：False）\n  --dont_write_passages\n                        如果要写结果，段落可能会占用大量空间，使用此标志可以不将段落写入导出的结果中（默认值：False）\n  --load_index_path LOAD_INDEX_PATH\n                        用于加载索引、段落嵌入和段落的路径（默认值：None）\n  --save_index_path SAVE_INDEX_PATH\n                        用于保存索引和\u002F或嵌入的路径（默认值：None）\n  --save_index_n_shards SAVE_INDEX_N_SHARDS\n                        将索引保存为多少个分片文件。必须是工作进程数的整数倍。（默认值：128）\n  --index_mode {flat,faiss}\n                        使用扁平的 PyTorch 索引或 Faiss 索引来检索 k 个最近邻（默认值：flat）\n  --faiss_index_type {ivfflat,flat,ivfsq,sq,pq}\n                        IVFFlat、IndexFlatIP、IVFScalarQuantizer、ScalarQuantizer 或带有 faiss-gpu 的 IndexPQ（默认值：flat）\n  --faiss_code_size FAISS_CODE_SIZE\n                        PQ 量化参数（默认值：None）\n  --reader_model_type {t5-small,t5-base,t5-large,t5-3b,t5-11b,google\u002Ft5-v1_1-base,google\u002Ft5-v1_1-large,google\u002Ft5-v1_1-xl,google\u002Ft5-v1_1-xxl,google\u002Ft5-base-lm-adapt,google\u002Ft5-large-lm-adapt,google\u002Ft5-xl-lm-adapt,google\u002Ft5-xxl-lm-adapt}\n                        阅读器 FID 模型的 T5 架构，例如 google\u002Ft5-xl-lm-adapt（默认值：None）\n  --text_maxlength TEXT_MAXLENGTH\n                        输入文本片段（问题+段落拼接后）的最大 token 数。超过此长度的输入将被截断。（默认值：200）\n  --target_maxlength TARGET_MAXLENGTH\n                        训练模型时目标输出的最大 token 长度。超过此长度的目标将被截断。如果设置为 -1，则不进行截断（默认值：None）\n  --n_context N_CONTEXT\n                        传递给阅读器的 top k 段落数量（默认值：1）\n  --passages PASSAGES [PASSAGES ...]\n                        包含要索引和检索的段落的 JSONL 文件路径列表。如果使用 --load_index_path 加载已保存的索引，则此参数无效（默认值：None）\n  --max_passages MAX_PASSAGES\n                        要索引的最大段落数量。设置为 -1 表示读取段落文件中的所有段落（默认值：-1）\n  --retriever_model_path RETRIEVER_MODEL_PATH\n                        用于初始化的 Contriever 模型路径（如果传入 --model_path 参数，则覆盖此值）（默认值：facebook\u002Fcontriever）\n  --retrieve_only       传入此参数以防止加载阅读器，仅运行检索评估（默认值：False）\n  --train_retriever     传入此参数以同时训练检索器和阅读器（默认值：False）\n  --use_file_passages   使用训练或评估 JSONL 文件中 \"passages\" 字段中的段落，而不是通过检索获取段落（默认值：False）\n  --retriever_n_context RETRIEVER_N_CONTEXT\n                        用于训练检索器的 top k 段落数量（默认值：5）\n  --gold_score_mode {evalnormsum,loop,ppmean,emdr,pdist,adist}\n                        训练检索器的方法。`pdist` 是论文中对 `ppmean` 的称呼。`adist` 是论文中对 `evalnormsum` 的称呼（默认值：ppmean）\n  --closed_book         不使用检索功能——退化为 T5 模型。如果设置了 n_context、n_context_retriever 和 encoder_format，则此选项会覆盖它们（默认值：False）\n  --temperature_score TEMPERATURE_SCORE\n                        检索器的 softmax 温度（默认值：0.01）\n  --temperature_gold TEMPERATURE_GOLD\n                        检索器蒸馏目标分布的 softmax 温度（默认值：0.01）\n  --compute_crossattention_stats\n  --filtering_overretrieve_ratio FILTERING_OVERRETRIEVE_RATIO\n                        如果进行过滤，先按此比例超额检索 topK，然后再过滤掉不需要的结果。很有用；只有在处理不需过滤检索结果的任务时才设为 1（默认值：2）\n  --freeze_retriever_steps FREEZE_RETRIEVER_STEPS\n                        冻结检索器 n 步（默认值：-1）\n  --query_side_retriever_training\n                        传入此参数以启用查询端的检索器微调（解绑 Contriever 编码器中段落和查询编码器的参数，并冻结段落编码器。有助于避免索引刷新。（默认值：False）\n  --retrieve_with_rerank\n                        传入此参数以启用使用全新段落编码器进行重新排序的检索（默认值：False）\n  --n_to_rerank_with_retrieve_with_rerank N_TO_RERANK_WITH_RETRIEVE_WITH_RERANK\n                        当传入 --retrieve_with_rerank 时，需要重新排序的段落数量。数值越高越慢但越准确。推荐 64-128（默认值：128）\n  --decoder_format DECODER_FORMAT\n                        解码器的格式。模型将按照该格式进行训练，评估也将采用与 decoder_prompt_format 选项相反的格式（默认值：无）\n  --decoder_prompt_format DECODER_PROMPT_FORMAT\n                        解码器提示的格式，例如“{query} 的答案是什么？”（默认值：无）\n  --encoder_format ENCODER_FORMAT\n                        阅读器编码器预处理的格式字符串（默认值：{query} 标题：{title} 上下文：{text}）\n  --retriever_format RETRIEVER_FORMAT\n                        检索器编码器预处理的格式字符串（默认值：{title} {text}）\n  --generation_max_length GENERATION_MAX_LENGTH\n  --generation_min_length GENERATION_MIN_LENGTH\n  --generation_length_penalty GENERATION_LENGTH_PENALTY\n  --generation_num_beams GENERATION_NUM_BEAMS\n  --task {base,mlm,lm,multiple_choice,kilt,section,fever,qa}\n                        模型执行的任务。用于设置预处理、检索过滤、评估等。（默认值：无）\n  --mlm_noise_density MLM_NOISE_DENSITY\n                        输入文本中应被掩码跨度遮盖的比例（默认值：0.15）\n  --mlm_mean_noise_span_length MLM_MEAN_NOISE_SPAN_LENGTH\n                        MLM 掩码跨度的平均长度（默认值：3）\n  --min_words_per_lm_instance MIN_WORDS_PER_LM_INSTANCE\n                        对于 MLM\u002FLM\u002FSection Generation，如果实例中的单词数少于 min_words_per_lm_instance，则会跳过该实例（默认值：无）\n  --min_lm_context_ratio MIN_LM_CONTEXT_RATIO\n                        将文本分为两个部分进行语言建模。“左半部分作为条件上下文，右半部分用于生成。”左半部分必须大于右半部分的 min_lm_context_ratio。\n                        （默认值：0.5）\n  --max_lm_context_ratio MAX_LM_CONTEXT_RATIO\n                        将文本分为两个部分进行语言建模。“左半部分作为条件上下文，右半部分用于生成。”左半部分必须小于右半部分的 max_lm_context_ratio。\n                        （默认值：0.5）\n  --qa_prompt_format QA_PROMPT_FORMAT\n                        当使用 --task qa 时，如何将问题格式化为输入提示（默认值：问题：{question} 答案：\u003Cextra_id_0>）\n  --multiple_choice_num_options MULTIPLE_CHOICE_NUM_OPTIONS\n                        多项选择问答中有多少个选项（例如 MMLU 是 4 个）（默认值：4）\n  --multiple_choice_train_permutations {single,cyclic,all}\n                        在多项选择任务（如 MMLU）训练时，是否使用答案顺序的排列组合。这可以改善结果，消除模型对任意答案顺序的偏好。建议使用 all。\n                        single：无排列；cyclic：循环排列；all：所有可能的答案排列组合。（默认值：single）\n  --multiple_choice_eval_permutations {single,cyclic,all}\n                        在多项选择任务（如 MMLU）评估时，是否使用答案顺序的排列组合。这可以改善结果，消除模型对任意答案顺序的偏好。最好使用 all，但速度很慢。\n                        cyclic 是一个不错的折中方案。single：无排列；cyclic：循环排列；all：所有可能的答案排列组合。（默认值：single）\n  --warmup_steps WARMUP_STEPS\n                        学习率预热步数（默认值：1000）\n  --total_steps TOTAL_STEPS\n                        总训练步数（默认值：1000）\n  --scheduler_steps SCHEDULER_STEPS\n                        调度器的总步数。如果未指定，则 scheduler_total_step = total_step。（默认值：无）\n  --accumulation_steps ACCUMULATION_STEPS\n                        梯度累积（默认值：1）\n  --dropout DROPOUT     掉落率（默认值：0.1）\n  --lr LR               学习率（默认值：0.0001）\n  --lr_retriever LR_RETRIEVER\n                        检索器的学习率（默认值：1e-05）\n  --clip CLIP           梯度裁剪（默认值：1.0）\n  --scheduler {linear,cosine,fixed}\n                        使用的学习率调度策略（默认值：cosine）\n  --weight_decay WEIGHT_DECAY\n                        训练中应用的权重衰减量（默认值：0.1）\n  --save_optimizer      传入此标志以在保存的检查点中保存优化器状态（默认值：False）\n  --epsilon EPSILON     AdamW 的 epsilon 值（默认值：1e-06）\n  --alpha ALPHA         AdamW 的 alpha 值（默认值：1.0）\n  --beta2 BETA2         AdamW 的 beta2 值（默认值：0.999）\n  --refresh_index REFRESH_INDEX\n                        索引刷新计划。格式：起始步-结束步：刷新频率，起始步-结束步：刷新频率。例如，--refresh_index 0-100:10,100-1000000:500 将在前 100 步中每 10 步刷新一次索引，然后从第 100 步到第 100 万步每 500 步刷新一次。对于固定计划，可以直接传入一个数字，例如 --refresh_index 100 将每 100 步刷新一次索引。传入 -1 表示永不刷新。（默认值：0-1000000:1000000）\n  --shuffle             在训练时打乱数据（默认值：False）\n  --precision {fp16,fp32,bf16}\n                        数值精度——如果可用，建议使用 bf16；fp16 在训练中可能不稳定（默认值：fp32）\n  --shard_optim         训练时的内存优化：使用分片数据并行将优化器状态分散到可用的 GPU 上，推荐用于大型模型（默认值：False）\n  --shard_grads         训练时的内存优化：使用分片数据并行将梯度分散到可用的 GPU 上，推荐用于大型模型（默认值：False）\n  --use_gradient_checkpoint_reader\n                        在阅读器中使用梯度检查点技术（默认值：False）\n  --use_gradient_checkpoint_retriever\n                        在检索器中使用梯度检查点技术（默认值：False）\n```\n\n\u003C\u002Fdetails>\n\n\n\n## 引用\n\n如需引用本工作，请使用以下 BibTeX 格式：\n```\n@article{izacard_few-shot_2022,\n\ttitle = {基于检索增强语言模型的少样本学习},\n\turl = {http:\u002F\u002Farxiv.org\u002Fabs\u002F2208.03299},\n\tpublisher = {arXiv},\n\tauthor = {Izacard, Gautier 和 Lewis, Patrick 和 Lomeli, Maria 和 Hosseini, Lucas 和 Petroni, Fabio 和 Schick, Timo 和 Dwivedi-Yu, Jane 和 Joulin, Armand 和 Riedel, Sebastian 和 Grave, Edouard},\n\tyear = {2022},\n}\n```\n\n## 许可证\n\n### 代码许可证：\n\nAtlas 项目的大部分代码采用 [CC-BY-NC](.\u002FLICENSE) 许可证，但项目中的部分组件则遵循单独的许可条款：Hugging Face Transformers 库采用 [Apache 2.0 许可证](https:\u002F\u002Fraw.githubusercontent.com\u002Fhuggingface\u002Ftransformers\u002Fmain\u002FLICENSE)，该许可证适用于 `src\u002Fmodeling_bert.py` 和 `src\u002Fmodeling_t5.py` 文件。\n\n### 数据许可证：\n\n仓库中使用的维基百科相关数据，例如通过 `download_corpus.py` 和 `download_index.py` 获取的语料库和索引，均依据 [CC-BY-SA](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F3.0\u002F) 许可证授权。","# Atlas 快速上手指南\n\n**注意**：本仓库已不再维护，代码仅作为研究用途提供（As-Is）。Atlas 是一个基于检索增强语言模型（RAG）的少样本学习框架，结合了稠密检索器（Contriever）和编码器-解码器语言模型（T5\u002FFiD）。\n\n## 环境准备\n\n### 系统要求\n*   **Python**: 3.8（测试版本）\n*   **GPU**: 支持 CUDA 11.3 的 NVIDIA GPU（推荐多卡环境以发挥分布式检索优势）\n\n### 前置依赖\n主要依赖库及测试版本如下：\n*   `pytorch` == 1.11.0\n*   `fairscale` == 0.4.6\n*   `transformers` == 4.18.0\n*   `numpy` == 1.22.4\n*   `faiss-gpu` == 1.7.2\n\n## 安装步骤\n\n推荐使用 `conda` 进行环境管理。请依次执行以下命令：\n\n```bash\n# 1. 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas.git\ncd atlas\n\n# 2. 创建并激活 conda 环境\nconda create --name atlas-env python=3.8\nconda activate atlas-env\n\n# 3. 安装 PyTorch (CUDA 11.3)\nconda install pytorch==1.11.0 cudatoolkit=11.3 -c pytorch\n\n# 4. 安装 FAISS GPU 版本\nconda install -c pytorch faiss-gpu=1.7.2 cudatoolkit=11.3\n\n# 5. 安装其他 Python 依赖\npip install -r requirements.txt\n```\n\n> **提示**：如果下载速度较慢，可考虑配置清华或中科大镜像源加速 `pip` 和 `conda` 安装过程。\n\n## 基本使用\n\n以下示例展示如何使用预训练的 **Atlas-large** 模型，在 **NaturalQuestions (NQ)** 数据集上进行少样本（64-shot）微调与评估。该示例假设使用 4 个节点，每个节点 8 张 GPU。\n\n### 1. 准备数据与模型\n\n首先下载必要的测试数据、维基百科语料库以及预训练模型。\n\n```bash\n# 设置数据目录和模型规模\nDATA_DIR=.\u002Fatlas_data\nSIZE=large \n\n# 下载 NQ 数据处理脚本所需的数据\npython preprocessing\u002Fprepare_qa.py --output_directory ${DATA_DIR}\u002Fdata\u002F\n\n# 下载 2018 年维基百科语料库\npython preprocessing\u002Fdownload_corpus.py --corpus corpora\u002Fwiki\u002Fenwiki-dec2018 --output_directory ${DATA_DIR} \n\n# 下载预训练的 Atlas-large 模型\npython preprocessing\u002Fdownload_model.py --model models\u002Fatlas\u002F${SIZE} --output_directory ${DATA_DIR}  \n```\n\n### 2. 启动训练与评估\n\n使用 `srun` (Slurm) 或 `torchrun` 启动分布式训练。以下命令配置了少样本微调的关键参数，包括仅微调查询编码器以节省资源、使用梯度检查点优化显存等。\n\n```bash\n# 随机选择一个端口\nport=$(shuf -i 15000-16000 -n 1)\n\n# 设置路径变量\nTRAIN_FILE=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Ftrain.64-shot.jsonl\"\nEVAL_FILES=\"${DATA_DIR}\u002Fdata\u002Fnq_data\u002Fdev.jsonl\"\nSAVE_DIR=${DATA_DIR}\u002Fexperiments\u002F\nEXPERIMENT_NAME=my-nq-64-shot-example\nTRAIN_STEPS=30\n\n# 启动训练\nsrun python train.py \\\n    --shuffle \\\n    --train_retriever \\\n    --gold_score_mode pdist \\\n    --use_gradient_checkpoint_reader --use_gradient_checkpoint_retriever \\\n    --precision fp32 \\\n    --shard_optim --shard_grads \\\n    --temperature_gold 0.01 --temperature_score 0.01 \\\n    --refresh_index -1 \\\n    --query_side_retriever_training \\\n    --target_maxlength 16 \\\n    --reader_model_type google\u002Ft5-${SIZE}-lm-adapt \\\n    --dropout 0.1 --weight_decay 0.01 --lr 4e-5 --lr_retriever 4e-5 --scheduler linear \\\n    --text_maxlength 512 \\\n    --model_path \"${DATA_DIR}\u002Fmodels\u002Fatlas\u002F${SIZE}\" \\\n    --train_data \"${TRAIN_FILE}\" \\\n    --eval_data \"${EVAL_FILES}\" \\\n    --per_gpu_batch_size 1 \\\n    --n_context 40 \\\n    --retriever_n_context 40 \\\n    --name ${EXPERIMENT_NAME} \\\n    --checkpoint_dir ${SAVE_DIR} \\\n    --eval_freq ${TRAIN_STEPS} \\\n    --log_freq 4\n```\n\n### 关键参数说明\n*   `--model_path`: 预训练模型路径。若设为 `none`，将从原始的 T5 和 Contriever 初始化。\n*   `--train_data` \u002F `--eval_data`: 训练和评估数据的 JSONL 文件路径。\n*   `--n_context`: 传递给语言模型的顶部检索段落数量。\n*   `--query_side_retriever_training`: 在少样本场景下，仅微调 Contriever 的查询编码器通常效果更好且更高效。\n*   `--refresh_index -1`: 在少样本微调中，重新计算索引嵌入开销大且收益低，故禁用索引刷新。\n\n训练完成后，日志和模型检查点将保存在 `${SAVE_DIR}\u002F${EXPERIMENT_NAME}` 目录下。如需仅进行评估，可使用 `evaluate.py` 脚本并指向保存的检查点。","某金融科技公司的算法团队正在构建一个智能投研助手，需要让模型基于最新的上市公司财报和实时新闻，准确回答分析师提出的复杂事实性问题。\n\n### 没有 atlas 时\n- **知识滞后严重**：传统大模型依赖训练时的静态权重，无法获取最新发布的财报数据，导致回答过时甚至错误，必须频繁且昂贵地重新全量微调模型。\n- **小样本适应差**：面对特定垂直领域的专业问答，缺乏有效的少样本学习能力，需要收集成千上万条标注数据才能达到可用精度，冷启动成本极高。\n- **推理幻觉频发**：模型在缺乏确切依据时倾向于“编造”财务数据，且由于是黑盒生成，难以追溯信息来源，无法满足金融场景对可解释性和准确性的严苛要求。\n- **资源消耗巨大**：为了提升精度盲目扩大模型参数量（如使用千亿级参数模型），导致推理延迟高、算力成本高昂，难以在生产环境大规模部署。\n\n### 使用 atlas 后\n- **知识实时更新**：利用 Atlas 的检索增强机制，直接外挂最新的文档索引（如维基百科或内部财报库），无需重新训练即可让模型掌握最新信息，索引支持快速原地刷新。\n- **高效少样本学习**：凭借强大的检索增强能力，Atlas 仅用 64 个示例即可在自然问答任务上达到超过 45% 的准确率，大幅降低了对大规模标注数据的依赖。\n- **答案有据可依**：模型先检索相关段落再生成答案，显著减少幻觉。每个回答都能关联到具体的参考文档片段，便于人工核查与审计，提升了可信度。\n- **小模型高性能**：Atlas 以远少于超大模型的参数量（少 50 倍），在多项基准测试中超越了 540B 参数的模型，实现了更高的性价比和更快的推理速度。\n\n核心价值在于 Atlas 通过检索增强与少样本学习的结合，以极低的算力和数据成本，实现了具备实时知识更新能力且高可信度的专业问答系统。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_atlas_2a08f53c.png","facebookresearch","Meta Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffacebookresearch_449342bd.png","",null,"https:\u002F\u002Fopensource.fb.com","https:\u002F\u002Fgithub.com\u002Ffacebookresearch",[84,88],{"name":85,"color":86,"percentage":87},"Python","#3572A5",93.6,{"name":89,"color":90,"percentage":91},"Shell","#89e051",6.4,554,72,"2026-03-25T04:12:37","NOASSERTION",5,"Linux","必需 NVIDIA GPU。官方测试环境使用 CUDA 11.3。支持多卡分布式训练（示例为4节点每节点8卡）。显存需求取决于模型大小和批次大小，支持高达 110 亿参数模型的训练，建议使用梯度检查点和分片优化以节省显存。","未说明（但需足够加载大规模语料索引，支持高达 4 亿条向量索引，建议大内存）",{"notes":101,"python":102,"dependencies":103},"1. 代码库已不再维护，按原样提供研究代码。2. 强烈建议使用 conda 进行环境管理和依赖安装。3. 主要支持 T5 架构的编码器-解码器语言模型和 Contriever 架构的检索器。4. 数据文件需为 jsonlines 格式。5. 支持在训练循环中进行端到端的检索增强训练，并支持索引的就地刷新。","3.8",[104,105,106,107,108],"pytorch==1.11.0","fairscale==0.4.6","transformers==4.18.0","numpy==1.22.4","faiss-gpu==1.7.2",[26,13],"2026-03-27T02:49:30.150509","2026-04-06T07:12:56.757591",[113,118,123,128,133,138],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},11157,"ATLAS 模型对 GPU 显存有什么要求？如何在资源有限的情况下运行？","不同模型尺寸有不同的显存需求。参考配置如下：\n1. `base` 模型：可在配备 8 张 40GB 显存的 V100 GPU 上运行。\n2. `xxl` 模型：需使用配备 8 张 80GB 显存的 A100 GPU，并配合 `flat` 索引。\n\n为了降低单卡显存压力，建议使用多节点（例如每个节点 4 或 8 张 GPU），这样每个进程加载的嵌入向量更少。此外，微调过程需要加载完整的嵌入向量，这是显存消耗的主要原因。可以参考 Atlas 博客文章中关于不同 PQ 压缩尺寸的显存需求表，或使用 FAISS PQ 技术来减少内存需求。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas\u002Fissues\u002F7",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},11158,"运行 MLM 预训练脚本时出现 \"RuntimeError: einsum(): operands do not broadcast...\" 错误怎么办？","这是一个已知的代码问题，维护者已经修复并合并到了 master 分支。请拉取最新的代码更新即可解决该问题。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas\u002Fissues\u002F9",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},11159,"运行 NQ-64-shot 微调实验时报错 \"AttributeError: module 'torch.optim.adamw' has no attribute 'F'\" 如何解决？","这是由于 PyTorch v1.11 和 v1.12 之间函数式方法（functional methods）的细微变化导致的。解决方案是将 PyTorch 版本降级至 v1.11。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas\u002Fissues\u002F5",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},11160,"如何在没有 Slurm 集群管理的普通个人机器上运行 ATLAS 代码？","可以使用 `torchrun` 在单机上运行。具体方法可以参考 Atlas 的博客文章。需要注意的是，官方仅在配备 8 张 80GB 显存 A100 GPU 的机器上测试过此配置。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas\u002Fissues\u002F21",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},11161,"如何获取 KILT 任务的数据集？","你可以从 KILT 官方仓库下载 KILT 数据：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FKILT 。Atlas 会自动处理这些实例，将其转换为基于 input 字段的查询输入和基于 answer 字段的目标生成内容。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fatlas\u002Fissues\u002F19",{"id":139,"question_zh":140,"answer_zh":141,"source_url":117},11162,"是否有计划发布更小版本的 ATLAS 模型以适配资源有限的用户？","虽然 11B 参数量的模型相比论文中的其他大语言模型较小，但对于资源有限的机器学习从业者来说仍然较大。目前主要通过使用较小的模型变体（如 base 模型）以及 FAISS PQ 压缩技术来降低内存需求。微调过程中由于需要加载完整索引，显存占用较高，建议通过多 GPU 分布式加载来分摊内存压力。",[]]