[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-google-research--tuning_playbook":3,"tool-google-research--tuning_playbook":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":79,"stars":82,"forks":83,"last_commit_at":84,"license":85,"difficulty_score":86,"env_os":78,"env_gpu":87,"env_ram":87,"env_deps":88,"category_tags":91,"github_topics":79,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":92,"updated_at":93,"faqs":94,"releases":124},3496,"google-research\u002Ftuning_playbook","tuning_playbook","A playbook for systematically maximizing the performance of deep learning models.","tuning_playbook 是一份由 Google Research 团队开源的深度学习模型性能调优实战指南。它旨在解决当前深度学习领域中普遍存在的“试错成本高、经验难传承”的痛点：许多论文和商业项目往往只展示最终结果，却省略了达成该结果所经历的关键调参过程和工程细节，导致开发者不得不重复大量低效的摸索工作。\n\n这份文档特别适合希望系统化提升模型效果的机器学习工程师和研究人员使用。它不局限于介绍某个具体的算法库，而是提供了一套科学的调优方法论。其核心亮点在于提出了“增量调优策略”，指导用户如何在“探索”新配置与“利用”已知最优解之间取得平衡，并详细涵盖了从模型架构选择、优化器设定、批量大小调整，到训练步数决策、输入管道优化及实验追踪等全流程的最佳实践。\n\n通过遵循 tuning_playbook，团队可以将原本依赖直觉的调参过程转化为可复现、可度量的科学实验流程，从而在有限的计算资源下更高效地挖掘模型潜力，显著减少盲目尝试带来的时间与算力浪费。","# Deep Learning Tuning Playbook\n\n*This is not an officially supported Google product.*\n\n**Varun Godbole\u003Csup>&dagger;\u003C\u002Fsup>, George E. Dahl\u003Csup>&dagger;\u003C\u002Fsup>, Justin Gilmer\u003Csup>&dagger;\u003C\u002Fsup>, Christopher J. Shallue\u003Csup>&Dagger;\u003C\u002Fsup>, Zachary Nado\u003Csup>&dagger;\u003C\u002Fsup>**\n\n\n&dagger; Google Research, Brain Team\n\n&Dagger; Harvard University\n\n## Table of Contents\n\n-   [Who is this document for?](#who-is-this-document-for)\n-   [Why a tuning playbook?](#why-a-tuning-playbook)\n-   [Guide for starting a new project](#guide-for-starting-a-new-project)\n    -   [Choosing the model architecture](#choosing-the-model-architecture)\n    -   [Choosing the optimizer](#choosing-the-optimizer)\n    -   [Choosing the batch size](#choosing-the-batch-size)\n    -   [Choosing the initial configuration](#choosing-the-initial-configuration)\n-   [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance)\n    -   [The incremental tuning strategy](#the-incremental-tuning-strategy)\n    -   [Exploration vs exploitation](#exploration-vs-exploitation)\n    -   [Choosing the goal for the next round of experiments](#choosing-the-goal-for-the-next-round-of-experiments)\n    -   [Designing the next round of experiments](#Designing-the-next-round-of-experiments)\n    -   [Determining whether to adopt a training pipeline change or\n        hyperparameter\n        configuration](#Determining-whether-to-adopt-a-training-pipeline-change-or-hyperparameter-configuration)\n    -   [After exploration concludes](#After-exploration-concludes)\n-   [Determining the number of steps for each training run](#Determining-the-number-of-steps-for-each-training-run)\n    -   [Deciding how long to train when training is not compute-bound](#Deciding-how-long-to-train-when-training-is-not-compute-bound)\n    -   [Deciding how long to train when training is compute-bound](#Deciding-how-long-to-train-when-training-is-compute-bound)\n-   [Additional guidance for the training pipeline](#Additional-guidance-for-the-training-pipeline)\n    -   [Optimizing the input pipeline](#Optimizing-the-input-pipeline)\n    -   [Evaluating model performance](#evaluating-model-performance)\n    -   [Saving checkpoints and retrospectively selecting the best checkpoint](#Saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint)\n    -   [Setting up experiment tracking](#Setting-up-experiment-tracking)\n    -   [Batch normalization implementation details](#Batch-normalization-implementation-details)\n    -   [Considerations for multi-host pipelines](#Considerations-for-multi-host-pipelines)\n-   [FAQs](#faqs)\n-   [Acknowledgments](#acknowledgments)\n-   [Citing](#citing)\n-   [Contributing](#contributing)\n\n## Who is this document for?\n\nThis document is for engineers and researchers (both individuals and teams)\ninterested in **maximizing the performance of deep learning models**. We assume\nbasic knowledge of machine learning and deep learning concepts.\n\nOur emphasis is on the **process of hyperparameter tuning**. We touch on other\naspects of deep learning training, such as pipeline implementation and\noptimization, but our treatment of those aspects is not intended to be complete.\n\nWe assume the machine learning problem is a supervised learning problem or\nsomething that looks a lot like one (e.g. self-supervised). That said, some of\nthe prescriptions in this document may also apply to other types of problems.\n\n## Why a tuning playbook?\n\nCurrently, there is an astonishing amount of toil and guesswork involved in\nactually getting deep neural networks to work well in practice. Even worse, the\nactual recipes people use to get good results with deep learning are rarely\ndocumented. Papers gloss over the process that led to their final results in\norder to present a cleaner story, and machine learning engineers working on\ncommercial problems rarely have time to take a step back and generalize their\nprocess. Textbooks tend to eschew practical guidance and prioritize fundamental\nprinciples, even if their authors have the necessary experience in applied work\nto provide useful advice. When preparing to create this document, we couldn't\nfind any comprehensive attempt to actually explain *how to get good results with\ndeep learning*. Instead, we found snippets of advice in blog posts and on social\nmedia, tricks peeking out of the appendix of research papers, occasional case\nstudies about one particular project or pipeline, and a lot of confusion. There\nis a vast gulf between the results achieved by deep learning experts and less\nskilled practitioners using superficially similar methods. At the same time,\nthese very experts readily admit some of what they do might not be\nwell-justified. As deep learning matures and has a larger impact on the world,\nthe community needs more resources covering useful recipes, including all the\npractical details that can be so critical for obtaining good results.\n\nWe are a team of five researchers and engineers who have worked in deep learning\nfor many years, some of us since as early as 2006. We have applied deep learning\nto problems in everything from speech recognition to astronomy, and learned a\nlot along the way. This document grew out of our own experience training neural\nnetworks, teaching new machine learning engineers, and advising our colleagues\non the practice of deep learning. Although it has been gratifying to see deep\nlearning go from a machine learning approach practiced by a handful of academic\nlabs to a technology powering products used by billions of people, deep learning\nis still in its infancy as an engineering discipline and we hope this document\nencourages others to help systematize the field's experimental protocols.\n\nThis document came about as we tried to crystalize our own approach to deep\nlearning and thus it represents the opinions of the authors at the time of\nwriting, not any sort of objective truth. Our own struggles with hyperparameter\ntuning made it a particular focus of our guidance, but we also cover other\nimportant issues we have encountered in our work (or seen go wrong). Our\nintention is for this work to be a living document that grows and evolves as our\nbeliefs change. For example, the material on debugging and mitigating training\nfailures would not have been possible for us to write two years ago since it is\nbased on recent results and ongoing investigations. Inevitably, some of our\nadvice will need to be updated to account for new results and improved\nworkflows. We do not know the *optimal* deep learning recipe, but until the\ncommunity starts writing down and debating different procedures, we cannot hope\nto find it. To that end, we would encourage readers who find issues with our\nadvice to produce alternative recommendations, along with convincing evidence,\nso we can update the playbook. We would also love to see alternative guides and\nplaybooks that might have different recommendations so we can work towards best\npractices as a community. Finally, any sections marked with a 🤖 emoji are places\nwe would like to do more research. Only after trying to write this playbook did\nit become completely clear how many interesting and neglected research questions\ncan be found in the deep learning practitioner's workflow.\n\n## Guide for starting a new project\n\nMany of the decisions we make over the course of tuning can be made once at the\nbeginning of a project and only occasionally revisited when circumstances\nchange.\n\nOur guidance below makes the following assumptions:\n\n-   Enough of the essential work of problem formulation, data cleaning, etc. has\n    already been done that spending time on the model architecture and training\n    configuration makes sense.\n-   There is already a pipeline set up that does training and evaluation, and it\n    is easy to execute training and prediction jobs for various models of\n    interest.\n-   The appropriate metrics have been selected and implemented. These should be\n    as representative as possible of what would be measured in the deployed\n    environment.\n\n### Choosing the model architecture\n\n***Summary:*** *When starting a new project, try to reuse a model that already\nworks.*\n\n-   Choose a well established, commonly used model architecture to get working\n    first. It is always possible to build a custom model later.\n-   Model architectures typically have various hyperparameters that determine\n    the model's size and other details (e.g. number of layers, layer width, type\n    of activation function).\n    -   Thus, choosing the architecture really means choosing a family of\n        different models (one for each setting of the model hyperparameters).\n    -   We will consider the problem of choosing the model hyperparameters in\n        [Choosing the initial configuration](#choosing-the-initial-configuration)\n        and\n        [A scientific approach to improving model performance](#a-scientific-approach-to-improving-model-performance).\n-   When possible, try to find a paper that tackles something as close as\n    possible to the problem at hand and reproduce that model as a starting\n    point.\n\n### Choosing the optimizer\n\n***Summary:*** *Start with the most popular optimizer for the type of problem at\nhand.*\n\n-   No optimizer is the \"best\" across all types of machine learning problems and\n    model architectures. Even just\n    [comparing the performance of optimizers is a difficult task](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446).\n    🤖\n-   We recommend sticking with well-established, popular optimizers, especially\n    when starting a new project.\n    -   Ideally, choose the most popular optimizer used for the same type of\n        problem.\n-   Be prepared to give attention to **\\*****all****\\*** hyperparameters of the\n    chosen optimizer.\n    -   Optimizers with more hyperparameters may require more tuning effort to\n        find the best configuration.\n    -   This is particularly relevant in the beginning stages of a project when\n        we are trying to find the best values of various other hyperparameters\n        (e.g. architecture hyperparameters) while treating optimizer\n        hyperparameters as\n        [nuisance parameters](#identifying-scientific-nuisance-and-fixed-hyperparameters).\n    -   It may be preferable to start with a simpler optimizer (e.g. SGD with\n        fixed momentum or Adam with fixed $\\epsilon$, $\\beta_{1}$, and\n        $\\beta_{2}$) in the initial stages of the project and switch to a more\n        general optimizer later.\n-   Well-established optimizers that we like include (but are not limited to):\n    -   [SGD with momentum](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms)\n        (we like the Nesterov variant)\n    -   [Adam and NAdam](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms),\n        which are more general than SGD with momentum. Note that Adam has 4\n        tunable hyperparameters\n        [and they can all matter](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446)!\n        -   See\n            [How should Adam's hyperparameters be tuned?](#how-should-adams-hyperparameters-be-tuned)\n\n### Choosing the batch size\n\n***Summary:*** *The batch size governs the training speed and shouldn't be used\nto directly tune the validation set performance. Often, the ideal batch size\nwill be the largest batch size supported by the available hardware.*\n\n-   The batch size is a key factor in determining the *training time* and\n    *computing resource consumption*.\n-   Increasing the batch size will often reduce the training time. This can be\n    highly beneficial because it, e.g.:\n    -   Allows hyperparameters to be tuned more thoroughly within a fixed time\n        interval, potentially resulting in a better final model.\n    -   Reduces the latency of the development cycle, allowing new ideas to be\n        tested more frequently.\n-   Increasing the batch size may either decrease, increase, or not change the\n    resource consumption.\n-   The batch size should *not be* treated as a tunable hyperparameter for\n    validation set performance.\n    -   As long as all hyperparameters are well-tuned (especially the learning\n        rate and regularization hyperparameters) and the number of training\n        steps is sufficient, the same final performance should be attainable\n        using any batch size (see\n        [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   Please see [Why shouldn't the batch size be tuned to directly improve\n        validation set\n        performance?](#why-shouldnt-the-batch-size-be-tuned-to-directly-improve-validation-set-performance)\n\n#### Determining the feasible batch sizes and estimating training throughput\n\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   For a given model and optimizer, there will typically be a range of batch\n    sizes supported by the available hardware. The limiting factor is usually\n    accelerator memory.\n-   Unfortunately, it can be difficult to calculate which batch sizes will fit\n    in memory without running, or at least compiling, the full training program.\n-   The easiest solution is usually to run training jobs at different batch\n    sizes (e.g. increasing powers of 2) for a small number of steps until one of\n    the jobs exceeds the available memory.\n-   For each batch size, we should train for long enough to get a reliable\n    estimate of the *training throughput*\n\n\u003Cp align=\"center\">training throughput = (# examples processed per second)\u003C\u002Fp>\n\n\u003Cp align=\"center\">or, equivalently, the \u003Cem>time per step\u003C\u002Fem>.\u003C\u002Fp>\n\n\u003Cp align=\"center\">time per step = (batch size) \u002F (training throughput)\u003C\u002Fp>\n\n-   When the accelerators aren't yet saturated, if the batch size doubles, the\n    training throughput should also double (or at least nearly double).\n    Equivalently, the time per step should be constant (or at least nearly\n    constant) as the batch size increases.\n-   If this is not the case then the training pipeline has a bottleneck such as\n    I\u002FO or synchronization between compute nodes. This may be worth diagnosing\n    and correcting before proceeding.\n-   If the training throughput increases only up to some maximum batch size,\n    then we should only consider batch sizes up to that maximum batch size, even\n    if a larger batch size is supported by the hardware.\n    -   All benefits of using a larger batch size assume the training throughput\n        increases. If it doesn't, fix the bottleneck or use the smaller batch\n        size.\n    -   **Gradient accumulation** simulates a larger batch size than the\n        hardware can support and therefore does not provide any throughput\n        benefits. It should generally be avoided in applied work.\n-   These steps may need to be repeated every time the model or optimizer is\n    changed (e.g. a different model architecture may allow a larger batch size\n    to fit in memory).\n\n\u003C\u002Fdetails>\n\n#### Choosing the batch size to minimize training time\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\">Training time = (time per step) x (total number of steps)\u003C\u002Fp>\n\n-   We can often consider the time per step to be approximately constant for all\n    feasible batch sizes. This is true when there is no overhead from parallel\n    computations and all training bottlenecks have been diagnosed and corrected\n    (see the\n    [previous section](#determining-the-feasible-batch-sizes-and-estimating-training-throughput)\n    for how to identify training bottlenecks). In practice, there is usually at\n    least some overhead from increasing the batch size.\n-   As the batch size increases, the total number of steps needed to reach a\n    fixed performance goal typically decreases (provided all relevant\n    hyperparameters are re-tuned when the batch size is changed;\n    [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   E.g. Doubling the batch size might halve the total number of steps\n        required. This is called **perfect scaling**.\n    -   Perfect scaling holds for all batch sizes up to a critical batch size,\n        beyond which one achieves diminishing returns.\n    -   Eventually, increasing the batch size no longer reduces the number of\n        training steps (but never increases it).\n-   Therefore, the batch size that minimizes training time is usually the\n    largest batch size that still provides a reduction in the number of training\n    steps required.\n    -   This batch size depends on the dataset, model, and optimizer, and it is\n        an open problem how to calculate it other than finding it experimentally\n        for every new problem. 🤖\n    -   When comparing batch sizes, beware the distinction between an example\n        budget\u002F[epoch](https:\u002F\u002Fdevelopers.google.com\u002Fmachine-learning\u002Fglossary#epoch)\n        budget (running all experiments while fixing the number of training\n        example presentations) and a step budget (running all experiments with\n        the number of training steps fixed).\n        -   Comparing batch sizes with an epoch budget only probes the perfect\n            scaling regime, even when larger batch sizes might still provide a\n            meaningful speedup by reducing the number of training steps\n            required.\n    -   Often, the largest batch size supported by the available hardware will\n        be smaller than the critical batch size. Therefore, a good rule of thumb\n        (without running any experiments) is to use the largest batch size\n        possible.\n-   There is no point in using a larger batch size if it ends up increasing the\n    training time.\n\n\u003C\u002Fdetails>\n\n#### Choosing the batch size to minimize resource consumption\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   There are two types of resource costs associated with increasing the batch\n    size:\n    1.  *Upfront costs*, e.g. purchasing new hardware or rewriting the training\n        pipeline to implement multi-GPU \u002F multi-TPU training.\n    2.  *Usage costs*, e.g. billing against the team's resource budgets, billing\n        from a cloud provider, electricity \u002F maintenance costs.\n-   If there are significant upfront costs to increasing the batch size, it\n    might be better to defer increasing the batch size until the project has\n    matured and it is easier to assess the cost-benefit tradeoff. Implementing\n    multi-host parallel training programs can introduce\n    [bugs](#considerations-for-multi-host-pipelines) and\n    [subtle issues](#batch-normalization-implementation-details) so it is\n    probably better to start off with a simpler pipeline anyway. (On the other\n    hand, a large speedup in training time might be very beneficial early in the\n    process when a lot of tuning experiments are needed).\n-   We refer to the total usage cost (which may include multiple different kinds\n    of costs) as the \"resource consumption\". We can break down the resource\n    consumption into the following components:\n\n\u003Cp align=\"center\">Resource consumption = (resource consumption per step) x (total number of steps)\u003C\u002Fp>\n\n-   Increasing the batch size usually allows us to\n    [reduce the total number of steps](#choosing-the-batch-size-to-minimize-training-time).\n    Whether the resource consumption increases or decreases will depend on how\n    the consumption per step changes.\n    -   Increasing the batch size might *decrease* the resource consumption. For\n        example, if each step with the larger batch size can be run on the same\n        hardware as the smaller batch size (with only a small increase in time\n        per step), then any increase in the resource consumption per step might\n        be outweighed by the decrease in the number of steps.\n    -   Increasing the batch size might *not change* the resource consumption.\n        For example, if doubling the batch size halves the number of steps\n        required and doubles the number of GPUs used, the total consumption (in\n        terms of GPU-hours) will not change.\n    -   Increasing the batch size might *increase* the resource consumption. For\n        example, if increasing the batch size requires upgraded hardware, the\n        increase in consumption per step might outweigh the reduction in the\n        number of steps.\n\n\u003C\u002Fdetails>\n\n#### Changing the batch size requires re-tuning most hyperparameters\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   The optimal values of most hyperparameters are sensitive to the batch size.\n    Therefore, changing the batch size typically requires starting the tuning\n    process all over again.\n-   The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.\n-   Keep this in mind when choosing the batch size at the start of a project. If\n    you need to switch to a different batch size later on, it might be\n    difficult, time consuming, and expensive to re-tune everything for the new\n    batch size.\n\n\u003C\u002Fdetails>\n\n#### How batch norm interacts with the batch size\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   Batch norm is complicated and, in general, should use a different batch size\n    than the gradient computation to compute statistics. See the\n    [batch norm section](#batch-normalization-implementation-details) for a\n    detailed discussion.\n\n\u003C\u002Fdetails>\n\n### Choosing the initial configuration\n\n-   Before beginning hyperparameter tuning we must determine the starting point.\n    This includes specifying (1) the model configuration (e.g. number of\n    layers), (2) the optimizer hyperparameters (e.g. learning rate), and (3) the\n    number of training steps.\n-   Determining this initial configuration will require some manually configured\n    training runs and trial-and-error.\n-   Our guiding principle is to find a simple, relatively fast, relatively\n    low-resource-consumption configuration that obtains a \"reasonable\" result.\n    -   \"Simple\" means avoiding bells and whistles wherever possible; these can\n        always be added later. Even if bells and whistles prove helpful down the\n        road, adding them in the initial configuration risks wasting time tuning\n        unhelpful features and\u002For baking in unnecessary complications.\n        -   For example, start with a constant learning rate before adding fancy\n            decay schedules.\n    -   Choosing an initial configuration that is fast and consumes minimal\n        resources will make hyperparameter tuning much more efficient.\n        -   For example, start with a smaller model.\n    -   \"Reasonable\" performance depends on the problem, but at minimum means\n        that the trained model performs much better than random chance on the\n        validation set (although it might be bad enough to not be worth\n        deploying).\n-   Choosing the number of training steps involves balancing the following\n    tension:\n    -   On the one hand, training for more steps can improve performance and\n        makes hyperparameter tuning easier (see\n        [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n    -   On the other hand, training for fewer steps means that each training run\n        is faster and uses fewer resources, boosting tuning efficiency by\n        reducing the time between cycles and allowing more experiments to be run\n        in parallel. Moreover, if an unnecessarily large step budget is chosen\n        initially, it might be hard to change it down the road, e.g. once the\n        learning rate schedule is tuned for that number of steps.\n\n## A scientific approach to improving model performance\n\nFor the purposes of this document, the ultimate goal of machine learning\ndevelopment is to maximize the utility of the deployed model. Even though many\naspects of the development process differ between applications (e.g. length of\ntime, available computing resources, type of model), we can typically use the\nsame basic steps and principles on any problem.\n\nOur guidance below makes the following assumptions:\n\n-   There is already a fully-running training pipeline along with a\n    configuration that obtains a reasonable result.\n-   There are enough computational resources available to conduct meaningful\n    tuning experiments and run at least several training jobs in parallel.\n\n### The incremental tuning strategy\n\n***Summary:*** *Start with a simple configuration and incrementally make\nimprovements while building up insight into the problem. Make sure that any\nimprovement is based on strong evidence to avoid adding unnecessary complexity.*\n\n-   Our ultimate goal is to find a configuration that maximizes the performance\n    of our model.\n    -   In some cases, our goal will be to maximize how much we can improve the\n        model by a fixed deadline (e.g. submitting to a competition).\n    -   In other cases, we want to keep improving the model indefinitely (e.g.\n        continually improving a model used in production).\n-   In principle, we could maximize performance by using an algorithm to\n    automatically search the entire space of possible configurations, but this\n    is not a practical option.\n    -   The space of possible configurations is extremely large and there are\n        not yet any algorithms sophisticated enough to efficiently search this\n        space without human guidance.\n-   Most automated search algorithms rely on a hand-designed *search space* that\n    defines the set of configurations to search in, and these search spaces can\n    matter quite a bit.\n-   The most effective way to maximize performance is to start with a simple\n    configuration and incrementally add features and make improvements while\n    building up insight into the problem.\n    -   We use automated search algorithms in each round of tuning and\n        continually update our search spaces as our understanding grows.\n-   As we explore, we will naturally find better and better configurations and\n    therefore our \"best\" model will continually improve.\n    -   We call it a *launch* when we update our best configuration (which may\n        or may not correspond to an actual launch of a production model).\n    -   For each launch, we must make sure that the change is based on strong\n        evidence – not just random chance based on a lucky configuration – so\n        that we don't add unnecessary complexity to the training pipeline.\n\nAt a high level, our incremental tuning strategy involves repeating the\nfollowing four steps:\n\n1.  Identify an appropriately-scoped goal for the next round of experiments.\n2.  Design and run a set of experiments that makes progress towards this goal.\n3.  Learn what we can from the results.\n4.  Consider whether to launch the new best configuration.\n\nThe remainder of this section will consider this strategy in much greater\ndetail.\n\n### Exploration vs exploitation\n\n***Summary:*** *Most of the time, our primary goal is to gain insight into the\nproblem.*\n\n-   Although one might think we would spend most of our time trying to maximize\n    performance on the validation set, in practice we spend the majority of our\n    time trying to gain insight into the problem, and comparatively little time\n    greedily focused on the validation error.\n    -   In other words, we spend most of our time on \"exploration\" and only a\n        small amount on \"exploitation\".\n-   In the long run, understanding the problem is critical if we want to\n    maximize our final performance. Prioritizing insight over short term gains\n    can help us:\n    -   Avoid launching unnecessary changes that happened to be present in\n        well-performing runs merely through historical accident.\n    -   Identify which hyperparameters the validation error is most sensitive\n        to, which hyperparameters interact the most and therefore need to be\n        re-tuned together, and which hyperparameters are relatively insensitive\n        to other changes and can therefore be fixed in future experiments.\n    -   Suggest potential new features to try, such as new regularizers if\n        overfitting is an issue.\n    -   Identify features that don't help and therefore can be removed, reducing\n        the complexity of future experiments.\n    -   Recognize when improvements from hyperparameter tuning have likely\n        saturated.\n    -   Narrow our search spaces around the optimal value to improve tuning\n        efficiency.\n-   When we are eventually ready to be greedy, we can focus purely on the\n    validation error even if the experiments aren't maximally informative about\n    the structure of the tuning problem.\n\n### Choosing the goal for the next round of experiments\n\n***Summary:*** *Each round of experiments should have a clear goal and be\nsufficiently narrow in scope that the experiments can actually make progress\ntowards the goal.*\n\n-   Each round of experiments should have a clear goal and be sufficiently\n    narrow in scope that the experiments can actually make progress towards the\n    goal: if we try to add multiple features or answer multiple questions at\n    once, we may not be able to disentangle the separate effects on the results.\n-   Example goals include:\n    -   Try a potential improvement to the pipeline (e.g. a new regularizer,\n        preprocessing choice, etc.).\n    -   Understand the impact of a particular model hyperparameter (e.g. the\n        activation function)\n    -   Greedily minimize validation error.\n\n### Designing the next round of experiments\n\n***Summary:*** *Identify which hyperparameters are scientific, nuisance, and\nfixed hyperparameters for the experimental goal. Create a sequence of studies to\ncompare different values of the scientific hyperparameters while optimizing over\nthe nuisance hyperparameters. Choose the search space of nuisance\nhyperparameters to balance resource costs with scientific value.*\n\n#### Identifying scientific, nuisance, and fixed hyperparameters\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   For a given goal, all hyperparameters will be either **scientific\n    hyperparameters**, **nuisance hyperparameters**, or **fixed\n    hyperparameters**.\n    -   Scientific hyperparameters are those whose effect on the model's\n        performance we're trying to measure.\n    -   Nuisance hyperparameters are those that need to be optimized over in\n        order to fairly compare different values of the scientific\n        hyperparameters. This is similar to the statistical concept of\n        [nuisance parameters](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNuisance_parameter).\n    -   Fixed hyperparameters will have their values fixed in the current round\n        of experiments. These are hyperparameters whose values do not need to\n        (or we do not want them to) change when comparing different values of\n        the scientific hyperparameters.\n        -   By fixing certain hyperparameters for a set of experiments, we must\n            accept that conclusions derived from the experiments might not be\n            valid for other settings of the fixed hyperparameters. In other\n            words, fixed hyperparameters create caveats for any conclusions we\n            draw from the experiments.\n-   For example, if our goal is to \"determine whether a model with more hidden\n    layers will reduce validation error\", then the number of hidden layers is a\n    scientific hyperparameter.\n    -   The learning rate is a nuisance hyperparameter because we can only\n        fairly compare models with different numbers of hidden layers if the\n        learning rate is tuned separately for each number of layers (the optimal\n        learning rate generally depends on the model architecture).\n    -   The activation function could be a fixed hyperparameter if we have\n        determined in prior experiments that the best choice of activation\n        function is not sensitive to model depth, or if we are willing to limit\n        our conclusions about the number of hidden layers to only cover this\n        specific choice of activation function. Alternatively, it could be a\n        nuisance parameter if we are prepared to tune it separately for each\n        number of hidden layers.\n-   Whether a particular hyperparameter is a scientific hyperparameter, nuisance\n    hyperparameter, or fixed hyperparameter is not inherent to that\n    hyperparameter, but changes depending on the experimental goal.\n    -   For example, the choice of activation function could be a scientific\n        hyperparameter (is ReLU or tanh a better choice for our problem?), a\n        nuisance hyperparameter (is the best 5-layer model better than the best\n        6-layer model when we allow several different possible activation\n        functions?), or a fixed hyperparameter (for ReLU nets, does adding batch\n        normalization in a particular position help?).\n-   When designing a new round of experiments, we first identify the scientific\n    hyperparameters for our experimental goal.\n    -   At this stage, we consider all other hyperparameters to be nuisance\n        hyperparameters.\n-   Next, we convert some of the nuisance hyperparameters into fixed\n    hyperparameters.\n    -   With limitless resources, we would leave all non-scientific\n        hyperparameters as nuisance hyperparameters so that the conclusions we\n        draw from our experiments are free from caveats about fixed\n        hyperparameter values.\n    -   However, the more nuisance hyperparameters we attempt to tune, the\n        greater the risk we fail to tune them sufficiently well for each setting\n        of the scientific hyperparameters and end up reaching the wrong\n        conclusions from our experiments.\n        -   As described\n            [below](#striking-a-balance-between-informative-and-affordable-experiments),\n            we could counter this risk by increasing the computational budget,\n            but often our maximum resource budget is less than would be needed\n            to tune over all non-scientific hyperparameters.\n    -   We choose to convert a nuisance hyperparameter into a fixed\n        hyperparameter when, in our judgment, the caveats introduced by fixing\n        it are less burdensome than the cost of including it as a nuisance\n        hyperparameter.\n        -   The more a given nuisance hyperparameter interacts with the\n            scientific hyperparameters, the more damaging it is to fix its\n            value. For example, the best value of the weight decay strength\n            typically depends on the model size, so comparing different model\n            sizes assuming a single specific value of the weight decay would not\n            be very insightful.\n-   Although the type we assign to each hyperparameter depends on the\n    experimental goal, we have the following rules of thumb for certain\n    categories of hyperparameters:\n    -   Of the various optimizer hyperparameters (e.g. the learning rate,\n        momentum, learning rate schedule parameters, Adam betas etc.), at least\n        some of them will be nuisance hyperparameters because they tend to\n        interact the most with other changes.\n        -   They are rarely scientific hyperparameters because a goal like \"what\n            is the best learning rate for the current pipeline?\" doesn't give\n            much insight – the best setting could easily change with the next\n            pipeline change anyway.\n        -   Although we might fix some of them occasionally due to resource\n            constraints or when we have particularly strong evidence that they\n            don't interact with the scientific parameters, we should generally\n            assume that optimizer hyperparameters must be tuned separately to\n            make fair comparisons between different settings of the scientific\n            hyperparameters, and thus shouldn't be fixed.\n            -   Furthermore, we have no *a priori* reason to prefer one\n                optimizer hyperparameter value over another (e.g. they don't\n                usually affect the computational cost of forward passes or\n                gradients in any way).\n    -   In contrast, the *choice* of optimizer is typically a scientific\n        hyperparameter or fixed hyperparameter.\n        -   It is a scientific hyperparameter if our experimental goal involves\n            making fair comparisons between two or more different optimizers\n            (e.g. \"determine which optimizer produces the lowest validation\n            error in a given number of steps\").\n        -   Alternatively, we might make it a fixed hyperparameter for a variety\n            of reasons, including (1) prior experiments make us believe that the\n            best optimizer for our problem is not sensitive to current\n            scientific hyperparameters; and\u002For (2) we prefer to compare values\n            of the scientific hyperparameters using this optimizer because its\n            training curves are easier to reason about; and\u002For (3) we prefer to\n            use this optimizer because it uses less memory than the\n            alternatives.\n    -   Hyperparameters introduced by a regularization technique are typically\n        nuisance hyperparameters, but whether or not we include the\n        regularization technique at all is a scientific or fixed hyperparameter.\n        -   For example, dropout adds code complexity, so when deciding whether\n            to include it we would make \"no dropout\" vs \"dropout\" a scientific\n            hyperparameter and the dropout rate a nuisance hyperparameter.\n            -   If we decide to add dropout to our pipeline based on this\n                experiment, then the dropout rate would be a nuisance\n                hyperparameter in future experiments.\n    -   Architectural hyperparameters are often scientific or fixed\n        hyperparameters because architecture changes can affect serving and\n        training costs, latency, and memory requirements.\n        -   For example, the number of layers is typically a scientific or fixed\n            hyperparameter since it tends to have dramatic consequences for\n            training speed and memory usage.\n-   In some cases, the sets of nuisance and fixed hyperparameters will depend on\n    the values of the scientific hyperparameters.\n    -   For example, suppose we are trying to determine which optimizer out of\n        Nesterov momentum and Adam results in the lowest validation error. The\n        scientific hyperparameter is the `optimizer`, which takes values\n        `{\"Nesterov_momentum\", \"Adam\"}`. The value\n        `optimizer=\"Nesterov_momentum\"` introduces the nuisance\u002Ffixed\n        hyperparameters `{learning_rate, momentum}`, but the value\n        `optimizer=\"Adam\"` introduces the nuisance\u002Ffixed hyperparameters\n        `{learning_rate, beta1, beta2, epsilon}`.\n    -   Hyperparameters that are only present for certain values of the\n        scientific hyperparameters are called **conditional hyperparameters**.\n    -   We should not assume two conditional hyperparameters are the same just\n        because they have the same name! In the above example, the conditional\n        hyperparameter called `learning_rate` is a *different* hyperparameter\n        for `optimizer=\"Nesterov_momentum\"` versus `optimizer=\"Adam\"`. Its role\n        is similar (although not identical) in the two algorithms, but the range\n        of values that work well in each of the optimizers is typically\n        different by several orders of magnitude.\n\n\u003C\u002Fdetails>\n\n#### Creating a set of studies\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   Once we have identified the scientific and nuisance hyperparameters, we\n    design a \"study\" or sequence of studies to make progress towards the\n    experimental goal.\n    -   A study specifies a set of hyperparameter configurations to be run for\n        subsequent analysis. Each configuration is called a \"trial\".\n    -   Creating a study typically involves choosing the hyperparameters that\n        will vary across trials, choosing what values those hyperparameters can\n        take on (the \"search space\"), choosing the number of trials, and\n        choosing an automated search algorithm to sample that many trials from\n        the search space. Alternatively, we could create a study by specifying\n        the set of hyperparameter configurations manually.\n-   The purpose of the studies is to run the pipeline with different values of\n    the scientific hyperparameters, while at the same time **\"optimizing away\"**\n    (or \"optimizing over\") the nuisance hyperparameters so that comparisons\n    between different values of the scientific hyperparameters are as fair as\n    possible.\n-   In the simplest case, we would make a separate study for each configuration\n    of the scientific parameters, where each study tunes over the nuisance\n    hyperparameters.\n    -   For example, if our goal is to select the best optimizer out of Nesterov\n        momentum and Adam, we could create one study in which\n        `optimizer=\"Nesterov_momentum\"` and the nuisance hyperparameters are\n        `{learning_rate, momentum}`, and another study in which\n        `optimizer=\"Adam\"` and the nuisance hyperparameters are `{learning_rate,\n        beta1, beta2, epsilon}`. We would compare the two optimizers by\n        selecting the best performing trial from each study.\n    -   We can use any gradient-free optimization algorithm, including methods\n        such as Bayesian optimization or evolutionary algorithms, to optimize\n        over the nuisance hyperparameters, although\n        [we prefer](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n        to use quasi-random search in the\n        [exploration phase](#exploration-vs-exploitation) of tuning because of a\n        variety of advantages it has in this setting.\n        [After exploration concludes](#after-exploration-concludes), if\n        state-of-the-art Bayesian optimization software is available, that is\n        our preferred choice.\n-   In the more complicated case where we want to compare a large number of\n    values of the scientific hyperparameters and it is impractical to make that\n    many independent studies, we can include the scientific parameters in the\n    same search space as the nuisance hyperparameters and use a search algorithm\n    to sample values of *both* the scientific and nuisance hyperparameters in a\n    single study.\n    -   When taking this approach, conditional hyperparameters can cause\n        problems since it is hard to specify a search space unless the set of\n        nuisance hyperparameters is the same for all values of the scientific\n        hyperparameters.\n    -   In this case,\n        [our preference](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n        for using quasi-random search over fancier black-box optimization tools\n        is even stronger, since it ensures that we obtain a relatively uniform\n        sampling of values of the scientific hyperparameters. Regardless of the\n        search algorithm, we need to make sure somehow that it searches the\n        scientific parameters uniformly.\n\n\u003C\u002Fdetails>\n\n#### Striking a balance between informative and affordable experiments\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   When designing a study or sequence of studies, we need to allocate a limited\n    budget in order to adequately achieve the following three desiderata:\n    1.  Comparing enough different values of the scientific hyperparameters.\n    2.  Tuning the nuisance hyperparameters over a large enough search space.\n    3.  Sampling the search space of nuisance hyperparameters densely enough.\n-   The better we can achieve these three desiderata, the more insight we can\n    extract from our experiment.\n    -   Comparing as many values of the scientific hyperparameters as possible\n        broadens the scope of the insights we gain from the experiment.\n    -   Including as many nuisance hyperparameters as possible and allowing each\n        nuisance hyperparameter to vary over as wide a range as possible\n        increases our confidence that a \"good\" value of the nuisance\n        hyperparameters **exists** in the search space for each configuration of\n        the scientific hyperparameters.\n        -   Otherwise, we might make unfair comparisons between values of the\n            scientific hyperparameters by not searching possible regions of the\n            nuisance parameter space where better values might lie for some\n            values of the scientific parameters.\n    -   Sampling the search space of nuisance hyperparameters as densely as\n        possible increases our confidence that any good settings for the\n        nuisance hyperparameters that happen to exist in our search space will\n        be found by the search procedure.\n        -   Otherwise, we might make unfair comparisons between values of the\n            scientific parameters due to some values getting luckier with the\n            sampling of the nuisance hyperparameters.\n-   Unfortunately, improvements in *any* of these three dimensions require\n    either increasing the number of trials, and therefore increasing the\n    resource cost, or finding a way to save resources in one of the other\n    dimensions.\n    -   Every problem has its own idiosyncrasies and computational constraints,\n        so how to allocate resources across these three desiderata requires some\n        level of domain knowledge.\n    -   After running a study, we always try to get a sense of whether the study\n        tuned the nuisance hyperparameters well enough (i.e. searched a large\n        enough space extensively enough) to fairly compare the scientific\n        hyperparameters (as described in greater detail\n        [below](#extracting-insight-from-experimental-results)).\n\n\u003C\u002Fdetails>\n\n### Extracting insight from experimental results\n\n***Summary:*** *In addition to trying to achieve the original scientific goal of\neach group of experiments, go through a checklist of additional questions and,\nif issues are discovered, revise the experiments and rerun them.*\n\n-   Ultimately, each group of experiments has a specific goal and we want to\n    evaluate the evidence the experiments provide toward that goal.\n    -   However, if we ask the right questions, we will often find issues that\n        need to be corrected before a given set of experiments can make much\n        progress towards their original goal.\n        -   If we don’t ask these questions, we may draw incorrect conclusions.\n    -   Since running experiments can be expensive, we also want to take the\n        opportunity to extract other useful insights from each group of\n        experiments, even if these insights are not immediately relevant to the\n        current goal.\n-   Before analyzing a given set of experiments to make progress toward their\n    original goal, we should ask ourselves the following additional questions:\n    -   [Is the search space large enough?](#identifying-bad-search-space-boundaries)\n        -   If the optimal point from a study is near the boundary of the search\n            space in one or more dimensions, the search is probably not wide\n            enough. In this case, we should run another study with an expanded\n            search space.\n    -   [Have we sampled enough points from the search space?](#not-sampling-enough-points-in-the-search-space)\n        -   If not, run more points or be less ambitious in the tuning goals.\n    -   What fraction of the trials in each study are **infeasible** (i.e.\n        trials that diverge, get really bad loss values, or fail to run at all\n        because they violate some implicit constraint)?\n        -   When a very large fraction of points in a study are **infeasible**\n            we should try to adjust the search space to avoid sampling such\n            points, which sometimes requires reparameterizing the search space.\n        -   In some cases, a large number of infeasible points can indicate a\n            bug in the training code.\n    -   [Does the model exhibit optimization issues?](#how-can-optimization-failures-be-debugged-and-mitigated)\n    -   [What can we learn from the training curves of the best trials?](#examining-the-training-curves)\n        -   For example, do the best trials have training curves consistent with\n            problematic overfitting?\n-   If necessary, based on the answers to the questions above, refine the most\n    recent study (or group of studies) to improve the search space and\u002For sample\n    more trials, or take some other corrective action.\n-   Once we have answered the above questions, we can move on to evaluating the\n    evidence the experiments provide towards our original goal (for example,\n    [evaluating whether a change is useful](#detecting-whether-a-change-is-useful-with-isolation-plots)).\n\n#### Identifying bad search space boundaries\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   A search space is suspicious if the best point sampled from it is close to\n    its boundary. We might find an even better point if we expanded the search\n    range in that direction.\n-   To check search space boundaries, we like to plot completed trials on what\n    we call **basic hyperparameter axis plots** where we plot the validation\n    objective value versus one of the hyperparameters (e.g. learning rate). Each\n    point on the plot corresponds to a single trial.\n    -   The validation objective value for each trial should usually be the best\n        value it achieved over the course of training.\n\n\u003Cp align=\"center\" id=\"figure-1\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_125f971c0a71.png\" width=\"49%\" alt=\"Example of bad search space boundaries\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b81cb982cf68.png\" width=\"49%\" alt=\"Example of good search space boundaries\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 1:\u003C\u002Fb> Examples of bad search space boundaries and acceptable search space boundaries.\u003C\u002Fp>\n\n-   The plots in [Figure 1](#figure-1) show the error rate (lower is better)\n    against the initial learning rate.\n-   If the best points cluster towards the edge of a search space (in some\n    dimension), then the search space boundaries might need to be expanded until\n    the best observed point is no longer close to the boundary.\n-   Often, a study will include \"infeasible\" trials that diverge or get very bad\n    results (marked with red Xs in the above plots).\n    -   If all trials are infeasible for learning rates greater than some\n        threshold value, and if the best performing trials have learning rates\n        at the edge of that region, the model [may suffer from stability issues\n        preventing it from accessing higher learning\n        rates](#how-can-optimization-failures-be-debugged-and-mitigated).\n\n\u003C\u002Fdetails>\n\n#### Not sampling enough points in the search space\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   In general,\n    [it can be very difficult to know](#how-many-trials-are-needed-to-get-good-results-with-quasi-random-search)\n    if the search space has been sampled densely enough. 🤖\n-   Running more trials is of course better, but comes at an obvious cost.\n-   Since it is so hard to know when we have sampled enough, we usually sample\n    what we can afford and try to calibrate our intuitive confidence from\n    repeatedly looking at various hyperparameter axis plots and trying to get a\n    sense of how many points are in the \"good\" region of the search space.\n\n\u003C\u002Fdetails>\n\n#### Examining the training curves\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n***Summary:*** *Examining the training curves is an easy way to identify common\nfailure modes and can help us prioritize what actions to take next.*\n\n-   Although in many cases the primary objective of our experiments only\n    requires considering the validation error of each trial, we must be careful\n    when reducing each trial to a single number because it can hide important\n    details about what’s going on below the surface.\n-   For every study, we always look at the **training curves** (training error\n    and validation error plotted versus training step over the duration of\n    training) of at least the best few trials.\n-   Even if this is not necessary for addressing the primary experimental\n    objective, examining the training curves is an easy way to identify common\n    failure modes and can help us prioritize what actions to take next.\n-   When examining the training curves, we are interested in the following\n    questions.\n-   Are any of the trials exhibiting **problematic overfitting?**\n    -   Problematic overfitting occurs when the validation error starts\n        *increasing* at some point during training.\n    -   In experimental settings where we optimize away nuisance hyperparameters\n        by selecting the \"best\" trial for each setting of the scientific\n        hyperparameters, we should check for problematic overfitting in *at\n        least* each of the best trials corresponding to the settings of the\n        scientific hyperparameters that we’re comparing.\n        -   If any of the best trials exhibits problematic overfitting, we\n            usually want to re-run the experiment with additional regularization\n            techniques and\u002For better tune the existing regularization parameters\n            before comparing the values of the scientific hyperparameters.\n            -   This may not apply if the scientific hyperparameters include\n                regularization parameters, since then it would not be surprising\n                if low-strength settings of those regularization parameters\n                resulted in problematic overfitting.\n        -   Reducing overfitting is often straightforward using common\n            regularization techniques that add minimal code complexity or extra\n            computation (e.g. dropout, label smoothing, weight decay), so it’s\n            usually no big deal to add one or more of these to the next round of\n            experiments.\n        -   For example, if the scientific hyperparameter is \"number of hidden\n            layers\" and the best trial that uses the largest number of hidden\n            layers exhibited problematic overfitting, then we would usually\n            prefer to try it again with additional regularization instead of\n            immediately selecting the smaller number of hidden layers.\n        -   Even if none of the \"best\" trials are exhibiting problematic\n            overfitting, there might still be a problem if it occurs in *any* of\n            the trials.\n            -   Selecting the best trial suppresses configurations exhibiting\n                problematic overfitting and favors those that do not. In other\n                words, it will favor configurations with more regularization.\n            -   However, anything that makes training worse can act as a\n                regularizer, even if it wasn't intended that way. For example,\n                choosing a smaller learning rate can regularize training by\n                hobbling the optimization process, but we typically don't want\n                to choose the learning rate this way.\n            -   So we must be aware that the \"best\" trial for each setting of\n                the scientific hyperparameters might be selected in such a way\n                that favors \"bad\" values of some of the scientific or nuisance\n                hyperparameters.\n-   Is there high step-to-step variance in the training or validation error late\n    in training?\n    -   If so, this could interfere with our ability to compare different values\n        of the scientific hyperparameters (since each trial randomly ends on a\n        \"lucky\" or \"unlucky\" step) and our ability to reproduce the result of\n        the best trial in production (since the production model might not end\n        on the same \"lucky\" step as in the study).\n    -   The most likely causes of step-to-step variance are batch variance (from\n        randomly sampling examples from the training set for each batch), small\n        validation sets, and using a learning rate that’s too high late in\n        training.\n    -   Possible remedies include increasing the batch size, obtaining more\n        validation data, using learning rate decay, or using Polyak averaging.\n-   Are the trials still improving at the end of training?\n    -   If so, this indicates that we are in the\n        [\"compute bound\" regime](#determining-the-number-of-steps-for-each-training-run)\n        and we may benefit from\n        [increasing the number of training steps](#Deciding-how-long-to-train-when-training-is-compute-bound)\n        or changing the learning rate schedule.\n-   Has performance on the training and validation sets saturated long before\n    the final training step?\n    -   If so, this indicates that we are in the\n        [\"not compute-bound\"](#determining-the-number-of-steps-for-each-training-run)\n        regime and that we may be able to\n        [decrease the number of training steps](#deciding-how-long-to-train-when-training-is-not-compute-bound).\n-   Although we cannot enumerate them all, there are many other additional\n    behaviors that can become evident from examining the training curves (e.g.\n    training loss *increasing* during training usually indicates a bug in the\n    training pipeline).\n\n\u003C\u002Fdetails>\n\n#### Detecting whether a change is useful with isolation plots\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\" id=\"figure-2\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_3e7ce3a4ed0d.png\" width=\"49%\" alt=\"Isolation plot that investigates the best value of weight decay for ResNet-50\ntrained on ImageNet.\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 2:\u003C\u002Fb> Isolation plot that investigates the best value of weight decay for ResNet-50 trained on ImageNet.\u003C\u002Fp>\n\n-   Often, the goal of a set of experiments is to compare different values of a\n    scientific hyperparameter.\n    -   For example, we may want to determine the value of weight decay that\n        results in the best validation error.\n-   An **isolation plot** is a special case of the basic hyperparameter axis\n    plot. Each point on an isolation plot corresponds to the performance of the\n    *best* trial across some (or all) of the nuisance hyperparameters.\n    -   In other words, we plot the model performance after \"optimizing away\"\n        the nuisance hyperparameters.\n-   An isolation plot makes it easier to perform an apples-to-apples comparison\n    between different values of the scientific hyperparameter.\n-   For example, [Figure 2](#figure-2) reveals the value of weight decay that\n    produces the best validation performance for a particular configuration of\n    ResNet-50 trained on ImageNet.\n    -   If our goal is to determine whether to include weight decay at all, then\n        we would compare the best point from this plot against the baseline of\n        no weight decay. For a fair comparison, the baseline should also have\n        its learning rate equally well tuned.\n-   When we have data generated by (quasi)random search and are considering a\n    continuous hyperparameter for an isolation plot, we can approximate the\n    isolation plot by bucketing the x-axis values of the basic hyperparameter\n    axis plot and taking the best trial in each vertical slice defined by the\n    buckets.\n\n\u003C\u002Fdetails>\n\n#### Automate generically useful plots\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   The more effort it is to generate plots, the less likely we are to look at\n    them as much as we should, so it behooves us to set up our infrastructure to\n    automatically produce as many of them as possible.\n-   At a minimum, we automatically generate basic hyperparameter axis plots for\n    all hyperparameters that we vary in an experiment.\n-   Additionally, we automatically produce training curves for all trials and\n    make it as easy as possible to find the best few trials of each study and\n    examine their training curves.\n-   There are many other potential plots and visualizations we can add that can\n    be useful. Although the ones described above are a good starting point, to\n    paraphrase Geoffrey Hinton, \"Every time you plot something new, you learn\n    something new.\"\n\n\u003C\u002Fdetails>\n\n### Determining whether to adopt a training pipeline change or hyperparameter configuration\n\n***Summary:*** *When deciding whether to make a change to our model or training\nprocedure or adopt a new hyperparameter configuration going forward, we need to\nbe aware of the different sources of variation in our results.*\n\n-   When we are trying to improve our model, we might observe that a particular\n    candidate change initially achieves a better validation error compared to\n    our incumbent configuration, but find that after repeating the experiment\n    there is no consistent advantage. Informally, we can group the most\n    important sources of variation that might cause such an inconsistent result\n    into the following broad categories:\n    -   **Training procedure variance**, **retrain variance**, or **trial\n        variance**: the variation we see between training runs that use the same\n        hyperparameters, but different random seeds.\n        -   For example, different random initializations, training data\n            shuffles, dropout masks, patterns of data augmentation operations,\n            and orderings of parallel arithmetic operations, are all potential\n            sources of trial variance.\n    -   **Hyperparameter search variance**, or **study variance**: the variation\n        in results caused by our procedure to select the hyperparameters.\n        -   For example, we might run the same experiment with a particular\n            search space, but with two different seeds for quasi-random search\n            and end up selecting different hyperparameter values.\n    -   **Data collection and sampling variance**: the variance from any sort of\n        random split into training, validation, and test data or variance due to\n        the training data generation process more generally.\n-   It is all well and good to make comparisons of validation error rates\n    estimated on a finite validation set using fastidious statistical tests, but\n    often the trial variance alone can produce statistically significant\n    differences between two different trained models that use the same\n    hyperparameter settings.\n-   We are most concerned about study variance when trying to make conclusions\n    that go beyond the level of an individual point in hyperparameters space.\n    -   The study variance depends on the number of trials and the search space\n        and we have seen cases where it is larger than the trial variance as\n        well as cases where it is much smaller.\n-   Therefore, before adopting a candidate change, consider running the best\n    trial N times to characterize the run-to-run trial variance.\n    -   Usually, we can get away with only recharacterizing the trial variance\n        after major changes to the pipeline, but in some applications we might\n        need fresher estimates.\n    -   In other applications, characterizing the trial variance is too costly\n        to be worth it.\n-   At the end of the day, although we only want to adopt changes (including new\n    hyperparameter configurations) that produce real improvements, demanding\n    complete certainty that something helps isn't the right answer either.\n-   Therefore, if a new hyperparameter point (or other change) gets a better\n    result than the baseline (taking into account the retrain variance of both\n    the new point and the baseline as best we can), then we probably should\n    adopt it as the new baseline for future comparisons.\n    -   However, we should only adopt changes that produce improvements that\n        outweigh any complexity they add.\n\n### After exploration concludes\n\n***Summary:*** *Bayesian optimization tools are a compelling option once we’re\ndone exploring for good search spaces and have decided what hyperparameters even\nshould be tuned at all.*\n\n-   At some point, our priorities will shift from learning more about the tuning\n    problem to producing a single best configuration to launch or otherwise use.\n-   At this point, there should be a refined search space that comfortably\n    contains the local region around the best observed trial and has been\n    adequately sampled.\n-   Our exploration work should have revealed the most essential hyperparameters\n    to tune (as well as sensible ranges for them) that we can use to construct a\n    search space for a final automated tuning study using as large a tuning\n    budget as possible.\n-   Since we no longer care about maximizing our insight into the tuning\n    problem, many of\n    [the advantages of quasi-random search](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)\n    no longer apply and Bayesian optimization tools should be used to\n    automatically find the best hyperparameter configuration.\n    -   [Open-Source Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) implements\n        a variety of sophisticated algorithms for tuning ML models, including\n        Bayesian Optimization algorithms.\n    -   If the search space contains a non-trivial volume of divergent points\n        (points that get NaN training loss or even training loss many standard\n        deviations worse than the mean), it is important to use black box\n        optimization tools that properly handle trials that diverge (see\n        [Bayesian Optimization with Unknown Constraints](https:\u002F\u002Farxiv.org\u002Fabs\u002F1403.5607)\n        for an excellent way to deal with this issue). [Open-Source Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier)\n        has support for divergent points by marking trials as infeasible, although it may not use our preferred approach from [Gelbart et al.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1403.5607), depending on how it is configured.\n-   At this point, we should also consider checking the performance on the test\n    set.\n    -   In principle, we could even fold the validation set into the training\n        set and retraining the best configuration found with Bayesian\n        optimization. However, this is only appropriate if there won't be future\n        launches with this specific workload (e.g. a one-time Kaggle\n        competition).\n\n## Determining the number of steps for each training run\n\n-   There are two types of workloads: those that are compute-bound and those\n    that are not.\n-   When training is **compute-bound**, training is limited by how long we are\n    willing to wait and not by how much training data we have or some other\n    factor.\n    -   In this case, if we can somehow train longer or more efficiently, we\n        should see a lower training loss and, with proper tuning, an improved\n        validation loss.\n    -   In other words, *speeding up* training is equivalent to *improving*\n        training and the \"optimal\" training time is always \"as long as we can\n        afford.\"\n    -   That said, just because a workload is compute-limited doesn't mean\n        training longer\u002Ffaster is the only way to improve results.\n-   When training is **not compute-bound**, we can afford to train as long as we\n    would like to, and, at some point, training longer doesn't help much (or\n    even causes problematic overfitting).\n    -   In this case, we should expect to be able to train to very low training\n        loss, to the point where training longer might slightly reduce the\n        training loss, but will not meaningfully reduce the validation loss.\n    -   Particularly when training is not compute-bound, a more generous\n        training time budget can make tuning easier, especially when tuning\n        learning rate decay schedules, since they have a particularly strong\n        interaction with the training budget.\n        -   In other words, very stingy training time budgets might require a\n            learning rate decay schedule tuned to perfection in order to achieve\n            a good error rate.\n-   Regardless of whether a given workload is compute-bound or not, methods that\n    increase the variance of the gradients (across batches) will usually result\n    in slower training progress, and thus may increase the number of training\n    steps required to reach a particular validation loss. High gradient variance\n    can be caused by:\n    -   Using a smaller batch size\n    -   Adding data augmentation\n    -   Adding some types of regularization (e.g. dropout)\n\n### Deciding how long to train when training is *not* compute-bound\n\n-   Our main goal is to ensure we are training long enough for the model to\n    reach the best possible result, while avoiding being overly wasteful in the\n    number of training steps.\n-   When in doubt, err on the side of training longer. Performance should never\n    degrade when training longer, assuming retrospective (optimal) checkpoint\n    selection is used properly and checkpoints are frequent enough.\n-   Never tune the `max_train_steps` number in a study. Pick a value and use it\n    for all trials. From these trials, plot the training step that retrospective\n    checkpoint selection finds in order to refine the choice of\n    `max_train_steps`.\n    -   For example, if the best step is always during the first 10% of\n        training, then the maximum number of steps is way too high.\n    -   Alternatively, if the best step is consistently in the last 25% of\n        training we might benefit from training longer and re-tuning the decay\n        schedule.\n-   The ideal number of training steps can change when the architecture or data\n    changes (e.g. adding data augmentation).\n-   Below we describe how to pick an initial candidate value for\n    `max_train_steps` based on the number of steps necessary to \"perfectly fit\"\n    the training set using a constant learning rate.\n    -   Note, we are not using the phrase \"perfectly fit the training set\" in a\n        precise or mathematically well-defined way. It is merely meant as an\n        informal descriptor to indicate a very low training loss.\n        -   For example, when training with the log loss, absent regularization\n            terms, we might see the training loss keep slowly improving until we\n            reach floating point limits as the network weights grow without\n            bound and the predictions of the model on the training set become\n            increasingly confident. In this case, we might say the model\n            \"perfectly fit\" the training set around the time the\n            misclassification error reached zero on the training set.\n    -   The starting value for `max_train_steps` we find may need to be\n        increased if the amount of gradient noise in the training procedure\n        increases.\n        -   For example, if data augmentation or regularizers like dropout are\n            introduced to the model.\n    -   It may be possible to decrease `max_train_steps` if the training process\n        improves somehow.\n        -   For example, with a better tuned optimizer or a better tuned\n            learning rate schedule.\n\n#### Algorithm for picking an initial candidate for max_train_steps using a learning rate sweep\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   This procedure assumes it is possible to not only \"perfectly\" fit the\n    training set, but to do so using a constant learning rate schedule.\n-   If it is possible to perfectly fit the entire training set, then there must\n    exist a configuration (with some value of `max_train_steps`) that perfectly\n    fits the training set; find any such configuration and use its value of\n    `max_train_steps` as a starting point `N`.\n-   Run a constant learning rate sweep (i.e. grid search the learning rate)\n    without data augmentation and without regularization where each trial trains\n    for `N` steps.\n-   The number of steps required for the fastest trial in the sweep to reach\n    perfect training performance is our initial guess for `max_train_steps`.\n-   **NOTE:** Bad search spaces can make it possible to engage in\n    self-deception.\n    -   For example, if all the learning rates in a study are too small, we\n        might incorrectly conclude that a very large value of `max_train_steps`\n        is necessary.\n    -   At a minimum, we should check that the optimal learning rate in the\n        study is not at the boundary of the search space.\n\n\u003C\u002Fdetails>\n\n### Deciding how long to train when training is compute-bound\n\n-   In some cases, training loss keeps improving indefinitely and our patience\n    and computational resources become the limiting factors.\n-   If training loss (or even validation loss) keeps improving indefinitely,\n    should we always train as long as we can afford? Not necessarily.\n    -   We might be able to tune more effectively by running a larger number of\n        shorter experiments and reserving the longest \"production length\" runs\n        for the models we hope to launch.\n    -   As the training time for trials approaches our patience limit, tuning\n        experiments become more relevant for our potential launch candidates,\n        but we can complete fewer of them.\n    -   There are probably many questions we can answer while only training for\n        ~10% of the production length, but there is always a risk that our\n        conclusions at this time limit will not apply to experiments at 20% of\n        the production length, let alone 100%.\n-   Tuning in multiple rounds with increasing, per-trial training step limits is\n    a sensible approach.\n    -   We can do as many rounds as we want, but usually 1-3 are the most\n        practical.\n    -   Essentially, try to obtain as much understanding of the problem as\n        possible using trials with a very quick turnaround time, trading off\n        tuning thoroughness with relevance to the final, longest runs.\n    -   Once a given per-trial time limit has generated useful insights, we can\n        increase the training time and continue tuning, double-checking our\n        conclusions from the shorter runs as needed.\n-   As a starting point, we recommend two rounds of tuning:\n    -   Round 1: Shorter runs to find good model and optimizer hyperparameters.\n    -   Round 2: Very few long runs on good hyperparameter points to get the\n        final model.\n-   The biggest question going from `Round i` &rarr; `Round i+1` is how to\n    adjust learning rate decay schedules.\n    -   One common pitfall when adjusting learning rate schedules between rounds\n        is using all the extra training steps with too small of a learning rate.\n\n#### Round 1\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   Unfortunately, there is no guarantee that good hyperparameters found in\n    short, incomplete training are still good choices when training length is\n    significantly increased. However, for some kinds of hyperparameters, they\n    are often correlated enough for Round 1 to be useful.\n-   What hyperparameter values found in shorter runs do we expect to transfer to\n    longer training runs? For all of this, we need more research. But based on\n    what we know so far, here are the authors’ suspicions in order of decreasing\n    probability of transferring:\n    -   Very likely to transfer\n        -   Early training instability can be resolved in the first round of\n            tuning using a smaller number of training steps. Perhaps these\n            hyperparameters are the closest thing to a sure bet for transfer\n            that we have.\n            -   Warmup length\n            -   Initialization\n    -   Likely to transfer\n        -   Model architecture - A dramatic win in the model architecture will\n            usually transfer, but there are probably many counterexamples.\n    -   Might transfer\n        -   Optimization algorithm\u002Foptimizer hyperparameters - We think this\n            would \"loosely\" transfer. It’s definitely weaker than the things\n            above it.\n        -   Data augmentation\n        -   Regularization\n            -   If it isn't possible to perfectly fit the training set, the\n                model might be in a regime where regularization is unlikely to\n                help very much.\n    -   Unlikely to transfer\n        -   Learning rate schedule: unlikely to transfer perfectly.\n            -   [This paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556) suggests that\n                even decay schedule transfers, but we don't believe this is true\n                in general. Example: Tuning sqrt decay on small # of training\n                steps then extending to large # will result in the majority of\n                training occurring at overly small steps.\n                -   One can likely do \"good enough\" with most schedules in the\n                    limit of extreme training budget, but noticeable performance\n                    improvements can likely be seen if it is tuned.\n            -   [Understanding Short-Horizon Bias in Stochastic\n                Meta-Optimization](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.02021) describes\n                the dangers of trying to pick learning rates myopically.\n\n\u003C\u002Fdetails>\n\n#### Round 2\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   Run the best hyperparameter configuration from Round 1.\n-   **(Speculation)** 🤖 Use the extra steps to extend the period of training at\n    a high learning rate.\n    -   E.g. if linear schedule then keep the length of the decay fixed from\n        Round 1 and extend the period of constant lr in the beginning.\n    -   For cosine decay, just keep the base lr from Round 1 and extend\n        `max_train_steps` as in\n        [Chinchilla paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556).\n-   More rounds might make sense for teams with very mature modeling and tuning\n    pipelines and very long and expensive production training runs, but they\n    will often be overkill.\n    -   We've described how to transfer from Step 1 &rarr; Step 2. If we didn't care\n        about analysis time and if making efficient use of compute was the\n        overriding concern, then the ideal would be to exponentially increase\n        the length of training runs (and thus the end-to-end time to complete a\n        study) over many different rounds of tuning.\n        -   At each round we systematically ensure our choices continue to hold\n            up.\n        -   New ideas go through a pipeline that progressively derisks them\n            using increasingly long-running experiments from Step i to Step i+1.\n\n\u003C\u002Fdetails>\n\n## Additional guidance for the training pipeline\n\n### Optimizing the input pipeline\n\n***Summary:*** *The causes and interventions of input-bound pipelines are highly\ntask-dependent; use a profiler and look out for common issues.*\n\n-   Use an appropriate profiler to diagnose input-bound pipelines. For example,\n    [Perfetto](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002Fprofiling.html) for JAX or\n    [TensorFlow profiler](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fprofiler) for\n    TensorFlow.\n-   Ultimately, the specific causes and interventions will be highly\n    task-dependent. Broader engineering considerations (e.g. minimizing disk\n    footprint) may warrant worse input pipeline performance.\n-   Common causes:\n    -   Data are not colocated with the training process, causing I\u002FO latency\n        (this might happen when reading training data over a network).\n    -   Expensive online data preprocessing (consider doing this once offline\n        and saving).\n    -   Unintentional synchronization barriers that interfere with data pipeline\n        prefetching. For example, when synchronizing metrics between the device\n        and host in CommonLoopUtils\n        ([link](https:\u002F\u002Fgithub.com\u002Fgoogle\u002FCommonLoopUtils\u002Fblob\u002Ffea2518ada8814a78e1492023fd9f00edb0b0568\u002Fclu\u002Fmetrics.py#L291)).\n-   Common tips:\n    -   Instrument input pipeline to prefetch examples (e.g.\n        [tf.data.Dataset.prefetch](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fdata_performance#prefetching))\n    -   Remove unused features\u002Fmetadata from each as early in the pipeline as\n        possible.\n    -   Increase the replication of the number of jobs generating examples for\n        the input pipeline. For example, by using the\n        [tf.data service](https:\u002F\u002Fwww.tensorflow.org\u002Fapi_docs\u002Fpython\u002Ftf\u002Fdata\u002Fexperimental\u002Fservice).\n\n### Evaluating model performance\n\n***Summary:*** *Run evaluation at larger batch sizes than training. Run\nevaluations at regular step intervals, not regular time intervals.*\n\n#### Evaluation settings\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   There are several settings in which we can evaluate the performance of our\n    models.\n    -   **Online evaluation** - metrics are collected when the model is serving\n        predictions in a production environment.\n    -   **Offline evaluation** - metrics are collected when the model is run on\n        offline train\u002Fvalidation\u002Ftest sets that are representative of the\n        production environment.\n    -   **Periodic evaluations** - metrics are collected during model training\n        that might either be a proxy for the offline evaluation, and\u002For on a\n        subset of the data used in offline evaluation.\n-   Online evaluation is the gold standard, but is often impractical during the\n    model development phase.\n-   Depending on the problem, offline evaluation can be fairly involved and\n    computationally expensive.\n-   Periodic evaluations are the most practical and economical choice, but may\n    not fully represent the production environment.\n    -   Our goal during periodic evaluation is to use an expedient proxy of the\n        offline evaluation, without sacrificing the reliability of the signal we\n        get during training.\n\n\u003C\u002Fdetails>\n\n#### Setting up periodic evaluations\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   We run periodic evaluations during training to monitor its progress in real\n    time, to\n    [facilitate retrospective model checkpoint selection](#saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint),\n    and so that we can\n    [examine the training curves at the end of training](#examining-the-training-curves).\n-   The simplest configuration is to perform both training and periodic\n    evaluations within the same compute instance, periodically alternating\n    between training and evaluation.\n    -   In this case, the batch size used to perform evaluations should be *at\n        least* as large as the batch size used for training because model\n        activations don't need to be maintained during evaluation, lowering the\n        computational requirements per example.\n-   Periodic evaluations should be done at regular step intervals, not time\n    intervals.\n    -   Evaluating based on time intervals can make it harder to interpret the\n        training curves, especially when training may suffer from preemptions of\n        the training jobs, network latency issues, etc.\n-   Periodicity in valid\u002Ftest metrics (when using a shuffled\n    train\u002Fvalidation\u002Ftest split) can indicate implementation bugs such as test\n    data having overlap with training data, or training data not being properly\n    shuffled. Evaluating at regular step intervals can make these issues easier\n    to catch.\n-   Partial batches can occur when the evaluation sets are not divisible by the\n    batch size. Ensure that the padded examples are correctly weighted to prevent\n    the loss function from being biased by them. Often, these padded examples\n    can be given a weight of zero.\n-   Save sufficient information per evaluation to support offline analysis.\n    Ideally, we would save predictions on a selection of individual examples\n    since they can be invaluable for debugging.\n    -   Generating artifacts like\n        [SavedModels](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fsaved_model) make it easy\n        to do ad-hoc model inspection after evaluation jobs finish.\n\n\u003C\u002Fdetails>\n\n#### Choosing a sample for periodic evaluation\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   The periodic evaluation job might not run fast enough to compute metrics on\n    the full offline evaluation set in a reasonable amount of time. This often\n    necessitates sampling data for periodic evaluation.\n-   We consider the following factors when constructing a sampled dataset:\n    -   \u003Cins>Sample size\u003C\u002Fins>\n        -   Check that the performance computed on the sampled dataset used by\n            the periodic job matches the performance on the whole offline\n            evaluation set, i.e. there is no skew between the sampled set and\n            the full dataset.\n        -   The dataset used for periodic evaluation should be small enough that\n            it’s easy to generate model predictions over its entirety, but large\n            enough that improvements to the model can be accurately measured\n            (i.e. not overwhelmed by label noise).\n        -   It should be large enough to accommodate multiple such evaluations\n            across trials in sequence, and still produce accurate estimates.\n            That is, to avoid adaptively \"fitting\" to the validation set over\n            time, in a way that doesn't generalize to a held-out test set.\n            However, this consideration is rarely a practical concern.\n    -   \u003Cins>Imbalanced datasets\u003C\u002Fins>\n        -   For imbalanced datasets, performance on rare classes of examples\n            will often be noisy.\n        -   For datasets with a small number of examples in a class label, log\n            the number of examples predicted correctly to get more insight into\n            accuracy improvements (.05 sensitivity improvement sounds exciting,\n            but was it just one more example correct?).\n\n\u003C\u002Fdetails>\n\n### Saving checkpoints and retrospectively selecting the best checkpoint\n\n***Summary:*** *Run training for a fixed number of steps and retrospectively\nchoose the best checkpoint from the run.*\n\n-   Most deep learning frameworks support\n    [model checkpointing](https:\u002F\u002Fflax.readthedocs.io\u002Fen\u002Flatest\u002Fapi_reference\u002Fflax.training.html).\n    That is, the current state of the model is periodically preserved on disk.\n    This allows the training job to be resilient to compute instance\n    interruptions.\n-   The best checkpoint is often not the last checkpoint, particularly when the\n    validation set performance does not continue to increase over time but\n    rather fluctuates about a particular value.\n-   Set up the pipeline to keep track of the N best checkpoints seen so far\n    during training. At the end of training, model selection is then a matter of\n    choosing the best checkpoint seen during training. We call this\n    **retrospective optimal checkpoint selection**.\n-   Supporting prospective early stopping is usually not necessary, since we’re\n    pre-specifying a trial budget and are preserving the N best checkpoints seen\n    so far.\n\n### Setting up experiment tracking\n\n***Summary:*** *When tracking different experiments, make sure to note a number\nof essentials like the best performance of a checkpoint in the study, and a\nshort description of the study.*\n\n-   We've found that keeping track of experiment results in a spreadsheet has\n    been helpful for the sorts of modeling problems we've worked on. It often\n    has the following columns:\n    -   Study name\n    -   A link to wherever the config for the study is stored.\n    -   Notes or a short description of the study.\n    -   Number of trials run\n    -   Performance on the validation set of the best checkpoint in the study.\n    -   Specific reproduction commands or notes on what unsubmitted changes were\n        necessary to launch training.\n-   Find a tracking system that captures at least the information listed above\n    and is convenient for the people doing it. Untracked experiments might as\n    well not exist.\n\n### Batch normalization implementation details\n\n***Summary:*** *Nowadays batch norm can often be replaced with LayerNorm, but in\ncases where it cannot, there are tricky details when changing the batch size or\nnumber of hosts.*\n\n-   Batch norm normalizes activations using their mean and variance over the\n    current batch, but in the multi-device setting these statistics are\n    different on each device unless explicitly synchronized.\n-   Anecdotal reports (mostly on ImageNet) say calculating these normalizing\n    statistics using only ~64 examples actually works better in practice (see\n    Ghost Batch Norm from [this paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08741)).\n-   Decoupling the total batch size and the number of examples used to calculate\n    batch norm statistics is particularly useful for batch size comparisons.\n-   Ghost batch norm implementations do not always correctly handle the case\n    where the per-device batch size > virtual batch size. In this case we'd\n    actually need to subsample the batch on each device in order to get the\n    proper number of batch norm statistic examples.\n-   Exponential moving averages used in test mode batch norm are just a linear\n    combination of training statistics, so these EMAs only need to be\n    synchronized before saving them in checkpoints. However, some common\n    implementations of batch norm do not synchronize these EMAs and only save\n    the EMA from the first device.\n\n### Considerations for multi-host pipelines\n\n***Summary:*** *for logging, evals, RNGs, checkpointing, and data sharding,\nmulti-host training can make it very easy to introduce bugs!*\n\n-   Ensure the pipeline is only logging and checkpointing on one host.\n-   Make sure before evaluation or checkpointing is run, the batch norm\n    statistics are synchronized across hosts.\n-   It is critical to have RNG seeds that are the same across hosts (for model\n    initialization), and seeds that are different across hosts (for data\n    shuffling\u002Fpreprocessing), so make sure to mark them appropriately.\n-   Sharding data files across hosts is usually recommended for improved\n    performance.\n\n## FAQs\n\n### What is the best learning rate decay schedule family?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   It’s an open problem. It’s not clear how to construct a set of rigorous\n    experiments to confidently answer what the \"best\" LR decay schedule is.\n-   Although we don't know the best schedule family, we're confident that it’s\n    important to have some (non-constant) schedule and that tuning it matters.\n-   Different learning rates work best at different times during the\n    optimization process. Having some sort of schedule makes it more likely for\n    the model to hit a good learning rate.\n\n\u003C\u002Fdetails>\n\n### Which learning rate decay should I use as a default?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   Our preference is either linear decay or cosine decay, and a bunch of other\n    schedule families are probably good too.\n\n\u003C\u002Fdetails>\n\n### Why do some papers have complicated learning rate schedules?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   It’s not uncommon to see papers with complicated piecewise learning rate\n    (LR) decay schedules.\n-   Readers often wonder how the authors arrived at such a complicated schedule.\n-   Many complicated LR decay schedules are the result of tuning the schedule as\n    a function of the validation set performance in an ad hoc way:\n    1.  Start a single training run with some simple LR decay (or a constant\n        learning rate).\n    2.  Keep training running until the performance seems to stagnate. If this\n        happens, pause training. Resume it with a perhaps steeper LR decay\n        schedule (or smaller constant learning rate) from this point. Repeat\n        this process until the conference\u002Flaunch deadline.\n-   Blithely copying the resulting *schedule* is generally not a good idea since\n    the best particular schedule will be sensitive to a host of other\n    hyperparameter choices.\n    -   Better to copy the *algorithm* that produced the schedule, although this\n        is rarely possible when arbitrary human judgment produced the schedule.\n-   This type of validation-error-sensitive schedule is fine to use if it can be\n    fully automated, but human-in-the-loop schedules that are a function of\n    validation error are brittle and not easily reproducible, so we recommend\n    avoiding them.\n    -   Before publishing results that used such a schedule, please try to make\n        it fully reproducible.\n\n\u003C\u002Fdetails>\n\n### How should Adam’s hyperparameters be tuned?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   As discussed above, making general statements about search spaces and how\n    many points one should sample from the search space is very difficult. Note\n    that not all the hyperparameters in Adam are equally important. The\n    following rules of thumb correspond to different \"budgets\" for the number of\n    trials in a study.\n    -   If \u003C 10 trials in a study, only tune the (base) learning rate.\n    -   If 10-25 trials, tune learning rate and $\\beta_1$.\n    -   If 25+ trials, tune the learning rate, $\\beta_1$ and $\\epsilon$.\n    -   If one can run substantially more than 25 trials, additionally tune\n        $\\beta_2$.\n\n\u003C\u002Fdetails>\n\n### Why use quasi-random search instead of more sophisticated black box optimization algorithms during the exploration phase of tuning?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n-   Quasi-random search (based on\n    [low-discrepancy sequences](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLow-discrepancy_sequence))\n    is our preference over fancier black box optimization tools when used as\n    part of an iterative tuning process intended to maximize insight into the\n    tuning problem (what we refer to as the \"exploration phase\"). Bayesian\n    optimization and similar tools are more appropriate for the exploitation\n    phase.\n-   Quasi-random search based on randomly shifted low-discrepancy sequences can\n    be thought of as \"jittered, shuffled grid search\", since it uniformly, but\n    randomly, explores a given search space and spreads out the search points\n    more than random search.\n-   The advantages of quasi-random search over more sophisticated black box\n    optimization tools (e.g. Bayesian optimization, evolutionary algorithms)\n    include:\n    1.  Sampling the search space non-adaptively makes it possible to change the\n        tuning objective in post hoc analysis without rerunning experiments.\n        -   For example, we usually want to find the best trial in terms of\n            validation error achieved at any point in training. But the\n            non-adaptive nature of quasi-random search makes it possible to find\n            the best trial based on final validation error, training error, or\n            some alternative evaluation metric without rerunning any\n            experiments.\n    2.  Quasi-random search behaves in a consistent and statistically\n        reproducible way.\n        -   It should be possible to reproduce a study from six months ago even\n            if the implementation of the search algorithm changes, as long as it\n            maintains the same uniformity properties. If using sophisticated\n            Bayesian optimization software, the implementation might change in\n            an important way between versions, making it much harder to\n            reproduce an old search. It isn’t always possible to roll back to an\n            old implementation (e.g. if the optimization tool is run as a\n            service).\n    3.  Its uniform exploration of the search space makes it easier to reason\n        about the results and what they might suggest about the search space.\n        -   For example, if the best point in the traversal of quasi-random\n            search is at the boundary of the search space, this is a good (but\n            not foolproof) signal that the search space bounds should be\n            changed. [This section](#identifying-bad-search-space-boundaries)\n            goes into more depth. However, an adaptive black box optimization\n            algorithm might have neglected the middle of the search space\n            because of some unlucky early trials even if it happens to contain\n            equally good points, since it is this exact sort of non-uniformity\n            that a good optimization algorithm needs to employ to speed up the\n            search.\n    4.  Running different numbers of trials in parallel versus sequentially will\n        not produce statistically different results when using quasi-random\n        search (or other non-adaptive search algorithms), unlike with adaptive\n        algorithms.\n    5.  More sophisticated search algorithms may not always handle infeasible\n        points correctly, especially if they aren't designed with neural network\n        hyperparameter tuning in mind.\n    6.  Quasi-random search is simple and works especially well when many tuning\n        trials will be running in parallel.\n        -   Anecdotally[^3], it is very hard for an adaptive algorithm to beat a\n            quasi-random search that has 2X its budget, especially when many\n            trials need to be run in parallel (and thus there are very few\n            chances to make use of previous trial results when launching new\n            trials).\n        -   Without expertise in Bayesian optimization and other advanced black\n            box optimization methods, we might not achieve the benefits they\n            are, in principle, capable of providing. It is hard to benchmark\n            advanced black box optimization algorithms in realistic deep\n            learning tuning conditions. They are a very active area of current\n            research, and the more sophisticated algorithms come with their own\n            pitfalls for inexperienced users. Experts in these methods are able\n            to get good results, but in high-parallelism conditions the search\n            space and budget tend to matter a lot more.\n-   That said, if our computational resources only allow a small number of\n    trials to run in parallel and we can afford to run many trials in sequence,\n    Bayesian optimization becomes much more attractive despite making our tuning\n    results harder to interpret.\n\n[^3]: Ben Recht and Kevin Jamieson\n    [pointed out](http:\u002F\u002Fwww.argmin.net\u002F2016\u002F06\u002F20\u002Fhypertuning\u002F) how strong\n    2X-budget random search is as a baseline (the\n    [Hyperband paper](https:\u002F\u002Fjmlr.org\u002Fpapers\u002Fvolume18\u002F16-558\u002F16-558.pdf)\n    makes similar arguments), but it is certainly possible to find search\n    spaces and problems where state-of-the-art Bayesian optimization\n    techniques crush random search that has 2X the budget. However, in our\n    experience beating 2X-budget random search gets much harder in the\n    high-parallelism regime since Bayesian optimization has no opportunity to\n    observe the results of previous trials.\n\n\u003C\u002Fdetails>\n\n### Where can I find an implementation of quasi-random search?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   [Open-Source Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) has an [implementation\n    of quasi-random search](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier\u002Fblob\u002Fmain\u002Fvizier\u002F_src\u002Falgorithms\u002Fdesigners\u002Fquasi_random.py). Set `algorithm=\"QUASI_RANDOM_SEARCH\"` in [this usage example](https:\u002F\u002Foss-vizier.readthedocs.io\u002Fen\u002Flatest\u002Fguides\u002Fuser\u002Frunning_vizier.html).\n-   An alternative implementation exists\n    [here](https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Falgorithmic-efficiency\u002Fblob\u002Fmain\u002Falgorithmic_efficiency\u002Fhalton.py).\n-   Both implementations above generate a Halton sequence for a given search space (intended to\n    implement a shifted, scrambled Halton sequence as recommended in\n    https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03200).\n-   If a quasi-random search algorithm based on a low-discrepancy sequence is\n    not available, it is possible to substitute pseudo random uniform search\n    instead, although this is likely to be slightly less efficient.\n    -   In 1-2 dimensions, grid search is also acceptable, although not in\n        higher dimensions (see\n        [Bergstra & Bengio, 2012](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv13\u002Fbergstra12a.html)).\n\n\u003C\u002Fdetails>\n\n### How many trials are needed to get good results with quasi-random search?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fmain\u002Fassets\u002Fhave_we_sampled_enough.png\" width=\"49%\" alt=\"A box plot showing the importance of sampling enough\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 3:\u003C\u002Fb> A ResNet-50 was tuned on ImageNet with 100\ntrials. Via bootstrapping, different amounts of tuning budget were simulated.\nBox plots of the best performances for each trial budget are plotted above.\n\n-   There is no way to answer this question in general, but we can look at\n    specific examples.\n-   As the Figure 3 shows, the number of trials in a study can have a\n    substantial impact on the results.\n    -   Notice how large the interquartile ranges are when 6 trials were\n        sampled, versus when 20 trials were sampled.\n    -   Even with 20 trials, it is likely that the difference between especially\n        lucky and unlucky studies will be larger than the typical variation\n        between re-trains of this model on different random seeds, with fixed\n        hyperparameters, which for this workload might be around +\u002F- 0.1% on a\n        validation error rate of \\~23%.\n\n\u003C\u002Fdetails>\n\n### How can optimization failures be debugged and mitigated?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n\n***Summary:*** *If the model is experiencing optimization difficulties, it’s\nimportant to fix them before trying other things. Diagnosing and correcting\ntraining failures is an active area of research.*\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_2d62ecfebc75.png\" width=\"80%\" alt=\"Changing the strides in a single residual block in a WideResnet results in training instability.\">\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\u003Cb>Figure 4:\u003C\u002Fb> Changing the strides in a single residual block (2x2 -> 1x1) in a WideResnet results in training instability. This does not degrade performance at low learning rates, but high learning rates no longer train well due to the instability. Applying 1000 steps of learning rate warmup resolves this particular instance of instability, allowing stable training at max learning rate of .1.\u003C\u002Fp>\n\n#### Identifying unstable workloads\n\n-   Any workload will become unstable if the learning rate is too large.\n    Instability is only an issue when it forces us to use a learning rate that’s\n    too small.\n-   There are at least two types of training instability worth distinguishing:\n    1.  Instability at initialization\u002Fearly in training.\n    2.  Sudden instability in the middle of training.\n-   We can take a systematic approach to identifying stability issues in our\n    workload.\n    1.  Do a learning rate sweep and find the best learning rate lr*.\n    2.  Plot training loss curves for learning rates just above lr*.\n    3.  If the learning rates > lr* show loss instability (loss goes up not down\n        during periods of training), then it is likely that fixing the\n        instability will result in better training.\n-   Log the L2 norm of the full loss gradient during training, outlier values\n    can result in spurious instability in the middle of training. This can\n    inform how to pick gradient\u002Fupdate clipping.\n\n**NOTE:** Some models show very early instability followed by a recovery that\nresults in slow but stable training. **Common evaluation schedules can miss\nthese issues by not evaluating frequently enough!**\n\nTo check for this, we can train for an abbreviated run of just \\~500 steps using\n`lr = 2 * current best`, but evaluate every step.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_0ec4845b9f33.png\" width=\"80%\" alt=\"Illustration of the value of more frequent evaluations at the start of\ntraining.\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 5:\u003C\u002Fb> Illustration of the value of more frequent evaluations at the start of training. Useful if there’s a suspicion that the model suffers from early training instability.\u003C\u002Fp>\n\n#### Potential fixes for common instability patterns\n\n-   Apply learning rate warmup\n    -   Best for early training instability.\n-   Apply gradient clipping\n    -   Good for both early and mid training instability, may fix some bad inits\n        that warmup cannot.\n-   Try a new optimizer\n    -   Sometimes Adam can handle instabilities that Momentum can’t. This is an\n        active area of research.\n-   We can ensure that we’re using best practices\u002Finitializations for our model\n    architecture (examples below).\n    -   Add residual connections and normalization if the model doesn't contain\n        it already.\n-   Normalization should be inside the residual. E.g. x + f(Norm(x)).\n-   Norm(x + f(x)) known to cause issues.\n-   Try initializing residual branches to 0 (e.g.\n    [ReZero init](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.04887)).\n-   Lower the learning rate\n    -   This is a last resort.\n\n#### Learning rate warmup\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b597576e155f.png\" width=\"80%\" alt=\"An example of instability during a warmup period (note the horizontal axis log\nscale).\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 6:\u003C\u002Fb> An example of instability during a warmup period (note the horizontal axis log scale). 40k steps of warmup was needed for successful training in this case.\u003C\u002Fp>\n\n##### When to apply learning rate warmup\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_903b5b57a5a2.png\" width=\"49%\" alt=\"Axis plot for model with instability\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 7a:\u003C\u002Fb> An example of a hyperparameter axis plot for a model exhibiting training instability. The best learning rate is at the edge of what is feasible. An \"infeasible\" trial is defined as one that either produces NaNs or uncharacteristically high values of the loss.\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b8f315d09bbb.png\" width=\"49%\" alt=\"Loss curve for model with instability\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 7b:\u003C\u002Fb> The training loss of a model trained with a learning rate where we see instability.\u003C\u002Fp>\n\n-   Figure 7a shows a hyperparameter axis plot that indicates a model\n    experiencing optimization instabilities, because the best learning rate is\n    right at the edge of instability.\n-   Figure 7b shows how this can be double-checked by examining the training\n    loss of a model trained with a learning rate either 5x or 10x larger than\n    this peak. If that plot shows a sudden rise in the loss after a steady\n    decline (e.g. at step \\~10k in the figure above), then the model likely\n    suffers from optimization instability.\n\n##### How to apply learning rate warmup\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_a7fd3b70ea1f.png\" width=\"80%\" alt=\"Beneficial effect of warmup on training instabilities\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 8:\u003C\u002Fb> Beneficial effect of learning rate warmup on addressing training instabilities.\u003C\u002Fp>\n\n-   Using the section immediately above, we assume that the practitioner has\n    already identified the learning rate at which the model becomes unstable.\n    This is the `unstable_base_learning_rate`.\n-   Warmup involves prepending a learning rate schedule that ramps up the\n    learning rate from 0 to some stable `base_learning_rate`, that is at least\n    one order of magnitude larger than `unstable_base_learning_rate`. The\n    default would be to try a `base_learning_rate` that’s 10x\n    `unstable_base_learning_rate`. Although note that it’d be possible to run\n    this entire procedure again for something like 100x\n    `unstable_base_learning_rate`. The specific schedule is:\n    -   Ramp up from 0 to `base_learning_rate` over `warmup_steps`.\n    -   Train at a constant rate for `post_warmup_steps`.\n-   Our goal is to find the shortest number of `warmup_steps` that allows us to\n    access peak learning rates that are much higher than\n    `unstable_base_learning_rate`.\n-   So for each `base_learning_rate`, we need to tune `warmup_steps` and\n    `post_warmup_steps`. It’s usually fine to set `post_warmup_steps` to be\n    `2*warmup_steps`.\n-   Warmup can be tuned independently of an existing decay schedule.\n    `warmup_steps` should be swept at a few different orders of magnitude. For\n    example, an example study could try [10, 10\u003Csup>3\u003C\u002Fsup>, 10\u003Csup>4\u003C\u002Fsup>,\n    10\u003Csup>5\u003C\u002Fsup>]. The largest feasible point shouldn't be more than 10% of\n    `max_train_steps`.\n-   Once a `warmup_steps` that doesn't blow up training at `base_learning_rate`\n    has been established, it should be applied to the baseline model.\n    Essentially, we prepend this schedule onto the existing schedule, and use\n    the optimal checkpoint selection discussed above to compare this experiment\n    to the baseline. For example, if we originally had 10,000 `max_train_steps`\n    and did `warmup_steps` for 1000 steps, the new training procedure should run\n    for 11,000 steps total.\n-   If long `warmup_steps` are required for stable training (>5% of\n    `max_train_steps`), `max_train_steps` may need to be increased to account\n    for this.\n-   There isn't really a \"typical\" value across the full range of workloads.\n    Some models only need 100 steps, while others (particularly transformers)\n    may need 40k+.\n\n#### Gradient clipping\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_6b07e7c74517.png\" width=\"80%\" alt=\"Gradient clipping on early training instabilities\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>Figure 9:\u003C\u002Fb> Illustration of gradient clipping correcting early training instability.\u003C\u002Fp>\n\n-   Gradient clipping is most useful when large or outlier gradient issues\n    occur.\n-   Clipping can fix either early training instability (large gradient norm\n    early), or mid training instabilities (sudden gradient spikes mid training).\n-   Sometimes longer warmup periods can correct instabilities that clipping does\n    not: see [this section above](#How-to-apply-learning-rate-warmup).\n    -   🤖 What about clipping during warmup?\n-   The ideal clip thresholds are just above the \"typical\" gradient norm.\n-   Here’s an example of how gradient clipping could be done:\n    -   If the norm of the gradient $\\left | g \\right |$ is greater than the\n        gradient clipping threshold $\\lambda$, then do ${g}'= \\lambda \\times \\frac{g}{\\left | g \\right |}$ where ${g}'$ is the new gradient.\n-   Log the unclipped gradient norm during training. By default, generate:\n    -   A plot of gradient norm vs step\n    -   A histogram of gradient norms aggregated over all steps\n-   Choose a gradient clipping threshold based on the 90th percentile of\n    gradient norms.\n    -   The threshold will be workload dependent, but 90% is a good starting\n        point. If it doesn't work, this threshold can be tuned.\n    -   🤖 What about some sort of adaptive strategy?\n-   If we try gradient clipping and the instability issues remain, we can try it\n    harder (i.e. make the threshold smaller).\n-   Extremely aggressive gradient clipping is in essence a strange way of\n    reducing the learning rate. If we find ourselves using extremely aggressive\n    clipping, we probably should just cut the learning rate instead.\n-   We would usually consider having >50% of the updates getting clipped somehow\n    as \"extremely aggressive\".\n-   If we need to do extremely aggressive gradient clipping to deal with our\n    instability issues, then we might as well reduce the learning rate.\n\n\u003C\u002Fdetails>\n\n### Why do you call the learning rate and other optimization parameters hyperparameters? They are not parameters of any prior distribution.\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   It is true that the term \"hyperparameter\" has a precise\n    [meaning](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHyperparameter) in Bayesian machine\n    learning and referring to the learning rate and most of the other parameters\n    we tune in deep learning as \"hyperparameters\" is an abuse of terminology.\n-   We would prefer to use the term \"metaparameter\" for learning rates,\n    architectural parameters, and all the other things we tune in deep learning,\n    since it avoids the potential for confusion that comes from misusing the\n    word \"hyperparameter\" (confusion that is especially likely when discussing\n    Bayesian optimization where the probabilistic response surface models have\n    their own true hyperparameters).\n-   Unfortunately, although potentially confusing, the term hyperparameter has become\n    extremely common in the deep learning community.\n-   Therefore, for a document, such as this one, intended for a wide audience\n    that includes many people who are unlikely to be aware of this technicality,\n    we made the choice to contribute to one source of confusion in the\n    field in hopes of avoiding another.\n-   That said, we might make a different choice when publishing a research\n    paper, and we would encourage others to use \"metaparameter\" instead in most\n    contexts.\n\n\u003C\u002Fdetails>\n\n### Why shouldn't the batch size be tuned to directly improve validation set performance?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   Changing the batch size *without changing any other details of the training pipeline* will often affect the validation set performance.\n-   However, the difference in validation set performance between two batch sizes typically goes away if the training pipeline is optimized independently for each batch size.\n-   The hyperparameters that interact most strongly with the batch size, and therefore are most important to tune separately for each batch size, are the optimizer hyperparameters (e.g. learning rate, momentum) and the regularization hyperparameters.\n    - Smaller batch sizes introduce more noise into the training algorithm due to sample variance, and this noise can have a regularizing effect. Thus, larger batch sizes can be more prone to overfitting and may require stronger regularization and\u002For additional regularization techniques.\n- In addition, [the number of training steps may need to be adjusted](#choosing-the-batch-size-to-minimize-training-time) when changing the batch size.\n-   Once all these effects are taken into account, there is currently no convincing evidence that the batch size affects the maximum achievable validation performance (see [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)).\n\n\u003C\u002Fdetails>\n\n### What are the update rules for all the popular optimization algorithms?\n\n\u003Cdetails>\u003Csummary>\u003Cem>[Click to expand]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n#### Stochastic gradient descent (SGD)\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t \\nabla \\mathcal{l}(\\theta_t)$$\n\n#### Momentum\n\n$$v_0 = 0$$\n\n$$v_{t+1} = \\gamma v_{t} + \\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t v_{t+1}$$\n\n#### Nesterov\n\n$$v_0 = 0$$\n\n$$v_{t+1} = \\gamma v_{t} + \\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t( \\gamma v_{t+1} + \\nabla \\mathcal{l}(\\theta_{t}))$$\n\n#### RMSProp\n\n$$v_0 = 1 \\text{,} m_0 = 0$$\n\n$$v_{t+1} = \\rho v_{t} + (1 - \\rho) \\nabla \\mathcal{l}(\\theta_t)^2$$\n\n$$m_{t+1} = \\gamma m_{t} + \\frac{\\eta_t}{\\sqrt{v_{t+1} + \\epsilon}}\\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - m_{t+1}$$\n\n#### ADAM\n\n$$m_0 = 0 \\text{,} v_0 = 0$$\n\n$$m_{t+1} = \\beta_1 m_{t} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)$$\n\n$$v_{t+1} = \\beta_2 v_{t} + (1 - \\beta_2) \\nabla \\mathcal{l}(\\theta_t)^2$$\n\n$$b_{t+1} = \\frac{\\sqrt{1 - \\beta_2^{t+1}}}{1 - \\beta_1^{t+1}}$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\alpha_t \\frac{m_{t+1}}{\\sqrt{v_{t+1}} + \\epsilon} b_{t+1}$$\n\n#### NADAM\n\n$$m_0 = 0 \\text{,} v_0 = 0$$\n\n$$m_{t+1} = \\beta_1 m_{t} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)$$\n\n$$v_{t+1} = \\beta_2 v_{t} + (1 - \\beta_2) \\nabla \\mathcal{l} (\\theta_t)^2$$\n\n$$b_{t+1} = \\frac{\\sqrt{1 - \\beta_2^{t+1}}}{1 - \\beta_1^{t+1}}$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\alpha_t \\frac{\\beta_1 m_{t+1} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)}{\\sqrt{v_{t+1}} + \\epsilon} b_{t+1}$$\n\n\u003C\u002Fdetails>\n\n## Acknowledgments\n\n-   We owe a debt of gratitude to Max Bileschi, Roy Frostig, Zelda Mariet, Stan\n    Bileschi, Mohammad Norouzi, Chris DuBois and Charles Sutton for reading the\n    manuscript and providing valuable feedback.\n-   We reused some experimental data for several plots that were originally\n    produced by Naman Agarwal for other joint research.\n-   We would like to thank Will Chen for invaluable advice on the presentation of the document.\n-   We would also like to thank Rohan Anil for useful discussions.\n\n## Citing\n\n```\n@misc{tuningplaybookgithub,\n  author = {Varun Godbole and George E. Dahl and Justin Gilmer and Christopher J. Shallue and Zachary Nado},\n  title = {Deep Learning Tuning Playbook},\n  url = {http:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook},\n  year = {2023},\n  note = {Version 1.0}\n}\n```\n\n## Contributing\n\n-   This is not an officially supported Google product.\n\n-   We'd love to hear your feedback!\n\n    -   If you like the playbook, please [leave a star](https:\u002F\u002Fdocs.github.com\u002Fen\u002Fget-started\u002Fexploring-projects-on-github\u002Fsaving-repositories-with-stars#starring-a-repository)! Or email\n        deep-learning-tuning-playbook \\[at\\] googlegroups.com. Testimonials help\n        us justify creating more resources like this.\n    -   If anything seems incorrect, please file an issue to start a discussion.\n        For questions or other messages where an issue isn't appropriate, please\n        open a new discussion topic on GitHub.\n\n-   As discussed in the preamble, this is a living document. We anticipate\n    making periodic improvements, both small and large. If you’d like to be\n    notified, please watch our repository (see [instructions](https:\u002F\u002Fdocs.github.com\u002Fen\u002Faccount-and-profile\u002Fmanaging-subscriptions-and-notifications-on-github\u002Fsetting-up-notifications\u002Fconfiguring-notifications#configuring-your-watch-settings-for-an-individual-repository)).\n\n-   Please don't file a pull request without first coordinating with the authors\n    via the issue tracking system.\n\n### Contributor License Agreement\n\nContributions to this project must be accompanied by a Contributor License\nAgreement (CLA). You (or your employer) retain the copyright to your\ncontribution; this simply gives us permission to use and redistribute your\ncontributions as part of the project. Head over to\n\u003Chttps:\u002F\u002Fcla.developers.google.com\u002F> to see your current agreements on file or\nto sign a new one.\n\nYou generally only need to submit a CLA once, so if you've already submitted one\n(even if it was for a different project), you probably don't need to do it\nagain.\n\n### Code Reviews\n\nAll submissions, including submissions by project members, require review. We\nuse GitHub pull requests for this purpose. Consult\n[GitHub Help](https:\u002F\u002Fhelp.github.com\u002Farticles\u002Fabout-pull-requests\u002F) for more\ninformation on using pull requests.\n\n### Community Guidelines\n\nThis project follows\n[Google's Open Source Community Guidelines](https:\u002F\u002Fopensource.google\u002Fconduct\u002F).\n","# 深度学习调参手册\n\n*本文档并非 Google 官方支持的产品。*\n\n**Varun Godbole\u003Csup>&dagger;\u003C\u002Fsup>, George E. Dahl\u003Csup>&dagger;\u003C\u002Fsup>, Justin Gilmer\u003Csup>&dagger;\u003C\u002Fsup>, Christopher J. Shallue\u003Csup>&Dagger;\u003C\u002Fsup>, Zachary Nado\u003Csup>&dagger;\u003C\u002Fsup>**\n\n\n&dagger; Google Research，Brain 团队\n\n&Dagger; 哈佛大学\n\n## 目录\n\n-   [本文档面向哪些读者？](#who-is-this-document-for)\n-   [为何需要一份调参手册？](#why-a-tuning-playbook)\n-   [新项目启动指南](#guide-for-starting-a-new-project)\n    -   [模型架构的选择](#choosing-the-model-architecture)\n    -   [优化器的选择](#choosing-the-optimizer)\n    -   [批量大小的选择](#choosing-the-batch-size)\n    -   [初始配置的选择](#choosing-the-initial-configuration)\n-   [提升模型性能的科学方法](#a-scientific-approach-to-improving-model-performance)\n    -   [增量式调参策略](#the-incremental-tuning-strategy)\n    -   [探索与利用](#exploration-vs-exploitation)\n    -   [确定下一轮实验的目标](#choosing-the-goal-for-the-next-round-of-experiments)\n    -   [设计下一轮实验](#Designing-the-next-round-of-experiments)\n    -   [判断是否采纳训练流水线的改动或超参数配置](#Determining-whether-to-adopt-a-training-pipeline-change-or-hyperparameter-configuration)\n    -   [探索阶段结束后](#After-exploration-concludes)\n-   [确定每次训练的步数](#Determining-the-number-of-steps-for-each-training-run)\n    -   [非计算瓶颈时的训练时长决策](#Deciding-how-long-to-train-when-training-is-not-compute-bound)\n    -   [计算瓶颈时的训练时长决策](#Deciding-how-long-to-train-when-training-is-compute-bound)\n-   [训练流水线的其他指导](#Additional-guidance-for-the-training-pipeline)\n    -   [输入流水线的优化](#Optimizing-the-input-pipeline)\n    -   [模型性能评估](#evaluating-model-performance)\n    -   [保存检查点并事后选择最佳检查点](#Saving-checkpoints-and-retrospectively-selecting-the-best-checkpoint)\n    -   [设置实验跟踪](#Setting-up-experiment-tracking)\n    -   [批归一化实现细节](#Batch-normalization-implementation-details)\n    -   [多主机流水线的注意事项](#Considerations-for-multi-host-pipelines)\n-   [常见问题解答](#faqs)\n-   [致谢](#acknowledgments)\n-   [引用](#citing)\n-   [贡献](#contributing)\n\n## 本文档面向哪些读者？\n\n本文档面向对**最大化深度学习模型性能**感兴趣的工程师和研究人员（个人及团队）。我们假定读者具备机器学习和深度学习的基本知识。\n\n我们的重点在于**超参数调优的过程**。虽然我们也涉及深度学习训练的其他方面，例如流水线的实现与优化，但这些内容并非全面覆盖。\n\n我们假设所处理的机器学习问题属于监督学习，或者与之非常相似的问题（如自监督学习）。不过，本文档中的一些建议也可能适用于其他类型的问题。\n\n## 为什么需要一份调参指南？\n\n目前，在实践中让深度神经网络真正发挥出色效果，仍然需要耗费大量精力，并且很大程度上依赖于猜测和试错。更糟糕的是，人们用来获得良好深度学习结果的具体方法往往很少被记录下来。论文为了呈现更简洁的故事，通常会略过导致最终结果的过程；而从事商业项目的技术人员则很少有时间退后一步，总结并提炼出通用的流程。教科书倾向于回避实用指导，转而强调基础原理，即便作者具备丰富的应用经验，能够提供有价值的建议。在着手编写这份文档时，我们未能找到任何全面阐述“如何通过深度学习获得良好结果”的尝试。相反，我们只在博客文章和社交媒体上零星地看到一些建议，在研究论文的附录中瞥见一些技巧，偶尔还能找到关于某个特定项目或工作流的案例研究，但整体上仍充满困惑。深度学习专家与技能较弱的从业者之间，即使使用看似相似的方法，所取得的结果也存在巨大差距。与此同时，这些专家也坦承，他们的一些做法可能并没有充分的理论依据。随着深度学习逐渐成熟并对世界产生越来越大的影响，社区亟需更多涵盖实用方法的资源，尤其是那些对获得良好结果至关重要的细节。\n\n我们是一支由五位研究人员和工程师组成的团队，多年来一直从事深度学习相关工作，其中一些人早在2006年就开始涉足这一领域。我们曾将深度学习应用于从语音识别到天文学等各个领域的实际问题，并在此过程中积累了丰富的经验。本文正是基于我们在训练神经网络、培训新入职的机器学习工程师以及为同事提供深度学习实践指导方面的亲身经历而撰写而成。尽管看到深度学习从少数学术实验室中的研究方法，发展成为支撑数十亿人日常使用的产品的核心技术，令人倍感欣慰，但作为一门工程学科，深度学习仍然处于起步阶段。我们希望这份文档能够激励更多人共同推动该领域实验协议的系统化。\n\n本文源于我们试图梳理自身在深度学习实践中的方法论，因此它反映的是作者写作时的观点，而非某种客观真理。我们在超参数调优方面遇到的困难使这部分内容成为我们的重点，但我们也涵盖了工作中遇到的其他重要问题（或目睹他人犯下的错误）。我们的初衷是让这份文档成为一个不断生长和演进的活文档，随着我们认知的变化而更新。例如，关于调试和应对训练失败的内容，两年前我们还无法撰写，因为它基于近期的研究成果和持续的探索。不可避免地，部分建议需要根据新的研究成果和改进的工作流程进行更新。我们并不知道“最优”的深度学习配方，但在社区开始系统性地记录和讨论不同实践方案之前，我们也无法期望找到它。为此，我们鼓励读者在发现我们建议存在问题时，提出替代性建议，并附上充分的证据，以便我们能够不断完善这份指南。同时，我们也期待出现更多具有不同建议的替代性指南和手册，从而推动整个社区朝着最佳实践的方向迈进。最后，所有标有🤖表情符号的部分，都代表着我们希望进一步深入研究的领域。只有在尝试编写这份指南之后，我们才彻底意识到，深度学习从业者的日常工作流程中其实隐藏着大量有趣却长期被忽视的研究课题。\n\n## 新项目启动指南\n\n在调参过程中所做的许多决策，其实只需在项目初期做出一次，之后仅在情况发生变化时才需重新审视。\n\n以下指导基于以下假设：\n\n-   关键的问题定义、数据清洗等工作已经完成，此时投入时间优化模型架构和训练配置是合理的。\n-   已经搭建好用于训练和评估的工作流，能够方便地运行不同模型的训练和预测任务。\n-   已经选择了合适的评估指标，并将其落实到位。这些指标应尽可能贴近实际部署环境中的衡量标准。\n\n### 模型架构的选择\n\n***要点：*** *在启动新项目时，尽量复用已有的成熟模型。*\n\n-   首先选择一个经过验证、广泛使用的模型架构来快速落地。自定义模型可以在后续阶段再行构建。\n-   模型架构通常包含多种超参数，用于决定模型的规模及其他细节（如层数、每层宽度、激活函数类型等）。\n    -   因此，选择架构实际上是在选择一系列具有不同超参数设置的模型。\n    -   模型超参数的选择问题将在[初始配置的选择](#choosing-the-initial-configuration)和[提升模型性能的科学方法](#a-scientific-approach-to-improving-model-performance)中详细讨论。\n-   尽可能寻找一篇与当前问题尽可能接近的研究论文，并以该论文中的模型作为起点进行复现。\n\n### 选择优化器\n\n***总结：*** *从当前问题类型中最流行的优化器开始。*\n\n-   没有一种优化器在所有类型的机器学习问题和模型架构中都是“最佳”的。即使只是\n    [比较不同优化器的性能也是一项艰巨的任务](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446)。\n    🤖\n-   我们建议坚持使用经过验证的、流行的优化器，尤其是在启动新项目时。\n    -   理想情况下，选择与相同类型问题最常用的优化器。\n-   要准备好关注所选优化器的**\\*****所有****\\***超参数。\n    -   具有更多超参数的优化器可能需要更多的调优工作才能找到最佳配置。\n    -   这在项目的初期阶段尤为重要，因为我们正试图在将优化器超参数视为\n        [干扰参数](#identifying-scientific-nuisance-and-fixed-hyperparameters) 的同时，寻找其他各种超参数（例如架构超参数）的最佳值。\n    -   在项目的初始阶段，可能更适合先使用较简单的优化器（例如带有固定动量的SGD，或$\\epsilon$、$\\beta_{1}$和$\\beta_{2}$均固定的Adam），然后在后期再切换到更通用的优化器。\n-   我们喜欢的一些经过验证的优化器包括（但不限于）：\n    -   [带有动量的SGD](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms)\n        （我们偏爱Nesterov变体）\n    -   [Adam和NAdam](#what-are-the-update-rules-for-all-the-popular-optimization-algorithms)，它们比带有动量的SGD更为通用。请注意，Adam有4个可调超参数\n        [且它们都可能很重要](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.05446)！\n        -   参见\n            [如何调整Adam的超参数？](#how-should-adams-hyperparameters-be-tuned)\n\n### 选择批量大小\n\n***总结：*** *批量大小决定了训练速度，不应被用来直接优化验证集性能。通常，理想的批量大小是现有硬件支持的最大批量大小。*\n\n-   批量大小是决定*训练时间*和*计算资源消耗*的关键因素。\n-   增大批量大小通常会缩短训练时间。这非常有益，因为它可以：\n    -   在固定时间内更充分地调优超参数，从而可能得到更好的最终模型。\n    -   缩短开发周期的延迟，使新想法能够更频繁地得到验证。\n-   增大批量大小可能会减少、增加或不改变资源消耗。\n-   批量大小*不应*被视为用于优化验证集性能的可调超参数。\n    -   只要所有超参数都已调优到位（尤其是学习率和正则化超参数），并且训练步数足够，那么无论使用何种批量大小，都应该能达到相同的最终性能（参见\n        [Shallue等，2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)）。\n    -   请参阅[为什么不应该通过调整批量大小来直接提升验证集性能？](#why-shouldnt-the-batch-size-be-tuned-to-directly-improve-validation-set-performance)\n\n#### 确定可行的批量大小并估算训练吞吐量\n\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   对于给定的模型和优化器，通常会有一系列由现有硬件支持的批量大小。限制因素通常是加速器的显存。\n-   不幸的是，在不运行或至少编译完整训练程序的情况下，很难计算出哪些批量大小能够装入显存。\n-   最简单的方法通常是用不同的批量大小（例如2的幂次方）运行训练任务，每次只进行少量步骤，直到某个任务超出可用显存。\n-   对于每个批量大小，我们都应训练足够长的时间，以获得对*训练吞吐量*的可靠估计。\n\n\u003Cp align=\"center\">训练吞吐量 = （每秒处理的样本数）\u003C\u002Fp>\n\n\u003Cp align=\"center\">或者，等价地，就是\u003Cem>每步耗时\u003C\u002Fem>。\u003C\u002Fp>\n\n\u003Cp align=\"center\">每步耗时 = （批量大小） \u002F （训练吞吐量）\u003C\u002Fp>\n\n-   当加速器尚未饱和时，如果批量大小翻倍，训练吞吐量也应该翻倍（至少接近翻倍）。等价地，随着批量大小的增加，每步耗时应该保持恒定（至少接近恒定）。\n-   如果情况并非如此，则说明训练流水线存在瓶颈，例如I\u002FO或计算节点之间的同步问题。在继续之前，值得诊断并解决这些问题。\n-   如果训练吞吐量仅在达到某个最大批量大小时才停止增长，那么即使硬件支持更大的批量大小，我们也只需考虑不超过该最大批量大小的设置。\n    -   使用更大批量大小的所有优势都建立在训练吞吐量会随之增加的基础上。如果吞吐量没有增加，请先解决瓶颈问题，或者改用较小的批量大小。\n    -   **梯度累积**虽然可以模拟超过硬件支持的更大批量，但并不会带来任何吞吐量上的好处。因此，在实际应用中通常应避免使用。\n-   每当模型或优化器发生变化时（例如，不同的模型架构可能允许更大的批量大小装入显存），这些步骤可能都需要重复进行。\n\n\u003C\u002Fdetails>\n\n#### 选择最小化训练时间的批量大小\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\">训练时间 = （每步耗时） × （总步数）\u003C\u002Fp>\n\n-   对于所有可行的批量大小，我们通常可以认为每一步所需的时间大致恒定。这种情况发生在没有并行计算开销、且所有训练瓶颈已被诊断并解决时（有关如何识别训练瓶颈，请参阅\n    [前一节](#determining-the-feasible-batch-sizes-and-estimating-training-throughput)）。但在实践中，增加批量大小通常至少会带来一定的开销。\n-   随着批量大小的增加，达到固定性能目标所需的总步数通常会减少（前提是批量大小改变时所有相关超参数都经过重新调整；\n    [Shallue 等人, 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)）。\n    -   例如，将批量大小翻倍可能会使所需总步数减半。这被称为**完美缩放**。\n    -   完美缩放适用于所有小于临界批量大小的批量，超过该值则会进入收益递减阶段。\n    -   最终，继续增大批量大小将不再减少训练步数（但也不会增加）。\n-   因此，能够最小化训练时间的批量大小通常是仍能减少所需训练步数的最大批量大小。\n    -   该批量大小取决于数据集、模型和优化器，而如何在不通过实验逐一确定的情况下计算出这一值，目前仍是一个开放问题。🤖\n    -   在比较不同批量大小时，需注意区分“样本预算”与“轮次预算”（即固定训练样本遍历次数的所有实验）和“步数预算”（即固定训练步数的所有实验）之间的区别。\n        -   使用轮次预算比较批量大小时，只能考察完美缩放区间，即便更大的批量大小仍可通过减少所需训练步数带来显著加速。\n    -   常常，现有硬件支持的最大批量大小会小于临界批量大小。因此，在未进行任何实验的情况下，一个较好的经验法则是尽可能使用最大的批量大小。\n-   如果增大批量大小反而导致训练时间延长，则没有必要再进一步增大。\n\n\u003C\u002Fdetails>\n\n#### 选择最小化资源消耗的批量大小\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   增大批量大小会带来两类资源成本：\n    1.  *前期成本*，例如购买新硬件或为实现多 GPU\u002F多 TPU 训练而重写训练流水线。\n    2.  *使用成本*，例如团队资源预算的支出、云服务提供商的计费、电力及维护费用等。\n-   若增大批量大小需要较高的前期成本，则可能最好将其推迟到项目成熟后再实施，以便更准确地评估成本效益。实施多主机并行训练程序可能会引入\n    [错误](#considerations-for-multi-host-pipelines) 和\n    [细微问题](#batch-normalization-implementation-details)，因此从一开始就采用较简单的流水线或许更为妥当。（另一方面，在早期阶段，如果需要进行大量调参实验，大幅缩短训练时间可能会非常有益）。\n-   我们将总使用成本（可能包含多种不同类型的成本）称为“资源消耗”。资源消耗可分解为以下组成部分：\n\n\u003Cp align=\"center\">资源消耗 = （每步资源消耗）×（总步数）\u003C\u002Fp>\n\n-   增大批量大小通常可以\n    [减少总步数](#choosing-the-batch-size-to-minimize-training-time)。至于资源消耗是增加还是减少，则取决于每步资源消耗的变化情况。\n    -   增大批量大小可能会*降低*资源消耗。例如，如果使用较大批量的每一步可以在与较小批量相同的硬件上运行（仅每步耗时略有增加），那么每步资源消耗的任何增加都可能被步数减少所抵消。\n    -   增大批量大小也可能*不改变*资源消耗。例如，若批量大小翻倍使所需步数减半、同时使用的 GPU 数量也翻倍，则以 GPU 小时计的总消耗将保持不变。\n    -   增大批量大小有时也会*增加*资源消耗。例如，若增大批量大小需要升级硬件设备，则每步资源消耗的增加可能会超过步数减少带来的节省。\n\n\u003C\u002Fdetails>\n\n#### 更改批量大小需要重新调整大多数超参数\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   大多数超参数的最优值对批量大小非常敏感。因此，更改批量大小通常意味着需要重新开始调参过程。\n-   与批量大小关联最紧密、因而需要针对每个批量大小单独调整的重要超参数包括优化器超参数（如学习率、动量）以及正则化超参数。\n-   在项目初期选择批量大小时请务必考虑到这一点。如果后续需要切换到不同的批量大小，为新批量重新调整所有超参数可能会非常困难、耗时且昂贵。\n\n\u003C\u002Fdetails>\n\n#### 批量归一化与批量大小的关系\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   批量归一化较为复杂，一般而言，其用于计算统计量的批量大小应与梯度计算所用的批量大小不同。有关详细讨论，请参阅\n    [批量归一化章节](#batch-normalization-implementation-details)。\n\n\u003C\u002Fdetails>\n\n### 选择初始配置\n\n-   在开始超参数调优之前，我们必须确定起点。这包括指定 (1) 模型配置（例如层数）、(2) 优化器的超参数（例如学习率），以及 (3) 训练步数。\n-   确定这一初始配置需要进行一些手动配置的训练运行，并通过试错来调整。\n-   我们的指导原则是找到一个简单、相对快速、资源消耗较低的配置，能够获得“合理”的结果。\n    -   “简单”意味着尽可能避免使用复杂的功能；这些功能可以随时在后期添加。即使这些复杂功能在未来被证明有用，在初始配置中加入它们也可能导致浪费时间去调优无益的特性，或者将不必要的复杂性固化下来。\n        -   例如，在引入复杂的衰减策略之前，先从固定的学习率开始。\n    -   选择一个快速且资源消耗少的初始配置，将使超参数调优更加高效。\n        -   例如，可以从较小的模型开始。\n    -   “合理”的性能取决于具体问题，但至少意味着训练后的模型在验证集上的表现要远优于随机猜测水平（尽管其性能可能仍然不足以部署）。\n-   选择训练步数时需要权衡以下矛盾：\n    -   一方面，增加训练步数可以提升性能，并使超参数调优更容易（参见 [Shallue et al. 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)）。\n    -   双方面，减少训练步数可以使每次训练运行更快、资源消耗更少，从而提高调优效率——缩短迭代周期，并允许更多实验并行进行。此外，如果最初选择了过大的步数预算，后续可能难以更改，例如当学习率调度已经针对该步数进行了调优时。\n\n## 提升模型性能的科学方法\n\n就本文而言，机器学习开发的最终目标是最大化已部署模型的效用。尽管不同应用场景的开发过程在许多方面存在差异（如所需时间、可用计算资源、模型类型等），但我们通常可以在任何问题上采用相同的基本步骤和原则。\n\n以下指导假设：\n\n-   已经存在一个完全运行的训练流水线，以及一个能够获得合理结果的配置。\n-   具有足够的计算资源来进行有意义的调优实验，并至少能够并行运行多个训练任务。\n\n### 增量式调优策略\n\n***摘要：*** *从简单配置开始，逐步改进，同时加深对问题的理解。确保每一次改进都有充分的证据支持，以避免引入不必要的复杂性。*\n\n-   我们的最终目标是找到能够最大化模型性能的配置。\n    -   在某些情况下，我们的目标是在固定截止日期前尽可能地提升模型性能（例如参加竞赛）。\n    -   在另一些情况下，则希望持续不断地改进模型（例如持续优化生产环境中使用的模型）。\n-   从理论上讲，我们可以通过算法自动搜索所有可能的配置空间来实现性能最大化，但这并不现实。\n    -   可能的配置空间极其庞大，目前尚不存在足够智能的算法能够在没有人工指导的情况下高效地搜索这一空间。\n-   大多数自动化搜索算法依赖于人工设计的“搜索空间”，该空间定义了要搜索的配置集合，而不同的搜索空间可能会产生显著影响。\n-   最有效的性能提升方式是从一个简单的配置开始，逐步添加功能并进行改进，同时不断加深对问题的理解。\n    -   我们在每一轮调优中都会使用自动化搜索算法，并随着理解的加深不断更新搜索空间。\n-   随着探索的深入，我们会自然而然地发现越来越好的配置，因此我们的“最佳”模型也会不断进步。\n    -   当我们更新最佳配置时，就称之为一次“发布”（这不一定对应于生产模型的实际上线）。\n    -   对于每一次发布，我们都必须确保变更有充分的证据支持——而不仅仅是基于幸运配置的偶然结果——以避免向训练流水线中引入不必要的复杂性。\n\n从宏观层面来看，我们的增量式调优策略包括重复以下四个步骤：\n\n1.  确定下一轮实验的适当目标范围。\n2.  设计并运行一组能够朝着该目标取得进展的实验。\n3.  从实验结果中学习有价值的信息。\n4.  考虑是否发布新的最佳配置。\n\n本节的剩余部分将更详细地探讨这一策略。\n\n### 探索与利用\n\n***摘要：*** *大多数情况下，我们的首要目标是深入了解问题。*\n\n-   尽管人们可能会认为我们应将大部分时间用于最大化验证集上的性能，但在实践中，我们通常会花更多的时间去理解问题的本质，而较少地单纯追求验证误差的最小化。\n    -   换句话说，我们主要进行“探索”，只有少量时间用于“利用”。\n-   从长远来看，若要最大化最终性能，深入理解问题至关重要。优先考虑洞察力而非短期收益，可以帮助我们：\n    -   避免盲目引入那些仅因偶然因素而在表现较好的实验中出现的改动。\n    -   确定哪些超参数对验证误差最为敏感，哪些超参数之间存在强烈的交互作用需要同时调整，以及哪些超参数相对不敏感，可以在后续实验中固定下来。\n    -   提出可尝试的新特性建议，例如在过拟合问题严重时引入新的正则化方法。\n    -   找出对模型无益的特征并将其移除，从而降低未来实验的复杂度。\n    -   判断超参数调优带来的改进是否已趋于饱和。\n    -   缩小搜索空间至最优值附近，以提高调优效率。\n-   当我们准备好专注于优化验证误差时，即使实验本身无法充分揭示调优问题的结构，也可以直接聚焦于验证误差的最小化。\n\n### 选择下一轮实验的目标\n\n***摘要：*** *每一轮实验都应有明确的目标，并且范围足够狭窄，以便实验能够切实朝着目标推进。*\n\n-   每一轮实验都应设定清晰的目标，且实验范围不宜过于宽泛，否则难以区分不同因素对结果的具体影响。\n-   常见的目标示例包括：\n    -   尝试对流水线进行潜在改进（如引入新的正则化方法、预处理方式等）。\n    -   探究某一特定模型超参数的影响（如激活函数的选择）。\n    -   盲目地最小化验证误差。\n\n### 设计下一轮实验\n\n***摘要：*** *针对实验目标，明确哪些超参数是研究对象（科学超参数），哪些是干扰因素（干扰超参数），哪些则是固定不变的超参数。设计一系列实验，在优化干扰超参数的同时，比较研究对象超参数的不同取值。合理选择干扰超参数的搜索空间，以在资源成本与科学价值之间取得平衡。*\n\n#### 确定科学超参数、干扰超参数和固定超参数\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n- 对于给定的目标，所有超参数都将属于以下三类之一：**科学超参数**、**干扰超参数**或**固定超参数**。\n  - 科学超参数是指我们试图衡量其对模型性能影响的超参数。\n  - 干扰超参数是指为了公平比较科学超参数的不同取值而需要进行优化的超参数。这类似于统计学中的“干扰参数”概念（[nuisance parameter](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNuisance_parameter)）。\n  - 固定超参数则会在当前一轮实验中保持不变。这些超参数在比较科学超参数的不同取值时，其值不需要（或我们不希望其）改变。\n    - 通过为一组实验固定某些超参数，我们必须接受这样一个事实：从这些实验得出的结论可能不适用于固定超参数的其他设置。换句话说，固定超参数会为我们从实验中得出的任何结论带来一定的限制条件。\n- 例如，如果我们的目标是“确定增加隐藏层的数量是否能降低验证误差”，那么隐藏层的数量就是科学超参数。\n  - 学习率则是干扰超参数，因为只有为每种不同的隐藏层数量分别调优学习率，我们才能公平地比较不同隐藏层数量的模型（最优的学习率通常取决于模型架构）。\n  - 激活函数可以作为固定超参数，如果我们先前的实验已经表明最佳激活函数的选择与模型深度无关，或者我们愿意将关于隐藏层数量的结论仅限于这一特定的激活函数选择。当然，如果我们要为每种隐藏层数量单独调优激活函数，它也可以被视为干扰超参数。\n- 某个特定超参数究竟是科学超参数、干扰超参数还是固定超参数，并不是该超参数本身的固有属性，而是取决于实验目标。\n  - 例如，激活函数的选择可以是科学超参数（对于我们的问题，ReLU和tanh哪个更好？），也可以是干扰超参数（当我们允许使用几种不同的激活函数时，最好的5层模型是否优于最好的6层模型？），或者固定超参数（对于ReLU网络，在某个特定位置添加批归一化是否有帮助？）。\n- 在设计新一轮实验时，我们首先需要明确针对实验目标的科学超参数。\n  - 在这一阶段，我们将所有其他超参数视为干扰超参数。\n- 接着，我们会将部分干扰超参数转化为固定超参数。\n  - 如果资源无限，我们可以将所有非科学超参数都保留为干扰超参数，这样我们从实验中得出的结论就不会受到固定超参数取值的限制。\n  - 然而，我们尝试调优的干扰超参数越多，就越有可能无法为科学超参数的每个设置充分调优，从而导致实验结论出现偏差。\n    - 如下文所述（[如何在信息丰富与经济实惠之间取得平衡](#striking-a-balance-between-informative-and-affordable-experiments)），我们可以通过增加计算预算来缓解这一风险，但通常我们的资源上限往往不足以对所有非科学超参数进行全面调优。\n  - 当我们认为固定某个干扰超参数所带来的限制小于将其作为干扰超参数的成本时，就会选择将其固定。\n    - 某个干扰超参数与科学超参数的交互越强烈，固定它的值就越不利。例如，权重衰减强度的最佳取值通常依赖于模型大小，因此假设权重衰减采用单一固定值来比较不同模型大小的意义并不大。\n- 尽管我们为每个超参数分配的类型取决于实验目标，但对于某些类别的超参数，我们仍有一些经验法则：\n  - 在各种优化器超参数中（如学习率、动量、学习率调度参数、Adam的beta参数等），至少有一部分会是干扰超参数，因为它们往往与其他变化的交互最为密切。\n    - 它们很少成为科学超参数，因为像“当前流水线的最佳学习率是多少？”这样的目标并不能提供太多洞见——最佳设置很可能随着下一次流水线的变更而改变。\n    - 虽然有时由于资源限制，或者当我们有强有力的证据表明它们与科学超参数无关时，可能会将其中一些固定，但我们通常应假定优化器超参数必须单独调优，以实现科学超参数不同设置之间的公平比较，因此不应将其固定。\n    - 此外，我们并没有先验理由偏好某一组优化器超参数的取值（例如，它们通常不会以任何方式影响前向传播或梯度计算的开销）。\n  - 相反，优化器的选择通常是科学超参数或固定超参数。\n    - 如果实验目标涉及对两种或多种不同优化器进行公平比较（例如，“确定在给定步数内哪种优化器能产生最低的验证误差”），那么优化器的选择就是科学超参数。\n    - 另一方面，出于多种原因，我们也可以将其设为固定超参数，包括：(1) 先前的实验表明，我们问题的最佳优化器对当前的科学超参数不敏感；以及\u002F或者 (2) 我们更倾向于使用这种优化器来比较科学超参数的不同取值，因为它产生的训练曲线更容易分析；以及\u002F或者 (3) 我们更喜欢使用这种优化器，因为它比其他优化器占用的内存更少。\n  - 正则化技术引入的超参数通常是干扰超参数，但是否采用正则化技术本身则是科学超参数或固定超参数。\n    - 例如，Dropout会增加代码复杂度，因此在决定是否使用它时，我们会将“不使用Dropout”与“使用Dropout”设为科学超参数，而Dropout率则为干扰超参数。\n    - 如果我们根据这次实验决定在流水线中加入Dropout，那么在未来的实验中，Dropout率就将成为干扰超参数。\n  - 架构相关的超参数往往是科学超参数或固定超参数，因为架构的变化会影响推理和训练成本、延迟以及内存需求。\n    - 例如，层的数量通常是科学超参数或固定超参数，因为它往往会对训练速度和内存使用产生显著影响。\n- 在某些情况下，干扰超参数和固定超参数的集合会依赖于科学超参数的取值。\n  - 例如，假设我们要确定Nesterov动量和Adam这两种优化器中哪一种能产生最低的验证误差。这里的科学超参数是`optimizer`，其取值为`{\"Nesterov_momentum\", \"Adam\"}`。当`optimizer=\"Nesterov_momentum\"`时，会引入干扰\u002F固定超参数`{learning_rate, momentum}`；而当`optimizer=\"Adam\"`时，则会引入干扰\u002F固定超参数`{learning_rate, beta1, beta2, epsilon}`。\n  - 那些仅在科学超参数取特定值时才存在的超参数被称为**条件超参数**。\n  - 我们不能仅仅因为两个条件超参数名称相同就认为它们是一样的！在上述例子中，名为`learning_rate`的条件超参数在`optimizer=\"Nesterov_momentum\"`和`optimizer=\"Adam\"`时是完全不同的超参数。尽管它们在这两种算法中的作用相似（但并不完全相同），但每种优化器中表现良好的取值范围通常相差几个数量级。\n\n\u003C\u002Fdetails>\n\n#### 创建一组研究\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   一旦我们确定了科学超参数和干扰超参数，就会设计一项“研究”或一系列研究，以逐步推进实验目标。\n    -   研究会指定一组需要运行的超参数配置，用于后续分析。每个配置称为一个“试验”。\n    -   创建研究通常涉及选择在不同试验中变化的超参数、这些超参数可以取哪些值（即“搜索空间”）、试验的数量，以及选择一种自动搜索算法来从搜索空间中采样相应数量的试验。或者，我们也可以手动指定一组超参数配置来创建研究。\n-   这些研究的目的是使用科学超参数的不同值来运行流水线，同时**“优化掉”**（或“优化过”）干扰超参数，从而使对科学超参数不同值之间的比较尽可能公平。\n-   在最简单的情况下，我们会为科学参数的每种配置单独进行一项研究，其中每项研究都会针对干扰超参数进行调优。\n    -   例如，如果我们的目标是从Nesterov动量和Adam中选择最佳优化器，我们可以创建一项研究，其中`optimizer=\"Nesterov_momentum\"`，干扰超参数为`{learning_rate, momentum}`；再创建另一项研究，其中`optimizer=\"Adam\"`，干扰超参数为`{learning_rate, beta1, beta2, epsilon}`。然后通过分别从两项研究中选出表现最好的试验来比较这两种优化器。\n    -   我们可以使用任何无梯度优化算法，包括贝叶斯优化或进化算法等，来优化干扰超参数，尽管[我们更倾向于](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)在调优的[探索阶段](#exploration-vs-exploitation)使用准随机搜索，因为它在这种情况下具有多种优势。[探索结束后](#after-exploration-concludes)，如果有最先进的贝叶斯优化软件可用，那将是我们的首选。\n-   在更复杂的情况下，如果我们希望比较大量科学超参数的取值，而逐一进行那么多独立研究并不现实，那么可以将科学参数与干扰超参数放在同一个搜索空间中，使用一种搜索算法在一个研究中同时采样科学和干扰超参数的值。\n    -   采用这种方法时，条件超参数可能会带来问题，因为除非所有科学超参数取值对应的干扰超参数集合都相同，否则很难定义一个统一的搜索空间。\n    -   在这种情况下，[我们更倾向于](#why-use-quasi-random-search-instead-of-more-sophisticated-black-box-optimization-algorithms-during-the-exploration-phase-of-tuning)使用准随机搜索而非更复杂的黑箱优化工具，因为这样能够确保对科学超参数的取值进行相对均匀的采样。无论使用哪种搜索算法，我们都必须确保它能均匀地搜索科学超参数。\n\n\u003C\u002Fdetails>\n\n#### 平衡信息性和经济性的实验\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   在设计一项或一系列研究时，我们需要在有限的预算内合理分配资源，以充分满足以下三个要求：\n    1.  比较足够多的科学超参数取值。\n    2.  在足够大的搜索空间内调优干扰超参数。\n    3.  对干扰超参数的搜索空间进行足够密集的采样。\n-   我们在这三个要求上做得越好，就能从实验中获得越多的洞见。\n    -   尽可能多地比较科学超参数的取值，可以拓宽我们从实验中获得的洞察范围。\n    -   包含尽可能多的干扰超参数，并让每个干扰超参数在其尽可能宽的范围内变化，可以增加我们对这样一个事实的信心：对于科学超参数的每种配置，干扰超参数的“良好”取值**确实存在于**其搜索空间中。\n        -   否则，由于没有搜索到某些科学参数取值下可能存在更好干扰超参数值的区域，我们可能会对科学超参数的取值做出不公平的比较。\n    -   尽可能密集地采样干扰超参数的搜索空间，可以增加我们对这样一个事实的信心：只要干扰超参数的良好设置确实存在于我们的搜索空间中，搜索过程就一定能找到它们。\n        -   否则，由于某些取值在干扰超参数的采样过程中运气更好，我们可能会对科学超参数的取值做出不公平的比较。\n-   不幸的是，要在这三个维度中的任何一个方面取得改进，要么需要增加试验次数，从而提高资源成本，要么需要在其他某个维度上找到节省资源的方法。\n    -   每个问题都有其独特性及计算约束，因此如何在这三个要求之间分配资源，需要一定的领域知识。\n    -   在完成一项研究后，我们总是会评估该研究是否充分调优了干扰超参数（即是否在足够大的空间内进行了足够广泛的搜索），以便公平地比较科学超参数（具体细节请参阅[下方内容](#extracting-insight-from-experimental-results)）。\n\n\u003C\u002Fdetails>\n\n\n\n### 从实验结果中提取洞见\n\n***摘要：*** *除了努力实现每组实验的原始科学目标外，还应对照一份附加问题清单逐一检查；若发现问题，则需修改实验并重新运行。*\n\n-   归根结底，每组实验都有其特定目标，我们希望评估这些实验为此目标所提供的证据。\n    -   然而，如果我们提出正确的问题，往往会发现一些需要在实验取得实质性进展之前加以修正的问题。\n        -   如果不提出这些问题，我们可能会得出错误的结论。\n    -   由于运行实验可能成本较高，我们也希望借此机会从每组实验中挖掘其他有用的信息，即便这些信息与当前目标并非直接相关。\n-   在分析某组实验以推进其原始目标之前，我们应先问自己以下几个问题：\n    -   [搜索空间是否足够大？](#识别不良搜索空间边界)\n        -   如果某次研究中的最优点在一个或多个维度上接近搜索空间的边界，则说明搜索范围可能不够宽。在这种情况下，我们应该开展另一项扩大搜索空间的研究。\n    -   [我们是否已从搜索空间中采样了足够多的点？](#搜索空间采样不足)\n        -   如果没有，可以增加采样点数，或者适当降低调参目标的雄心程度。\n    -   每次研究中有多少比例的试验是**不可行的**（即发散、损失值极差，或因违反某些隐含约束而根本无法运行）？\n        -   当一项研究中“不可行”的点占很大比例时，我们应尝试调整搜索空间，避免采样到这类点；有时这甚至需要重新参数化搜索空间。\n        -   在某些情况下，大量不可行的点可能表明训练代码中存在缺陷。\n    -   [模型是否存在优化问题？](#如何调试和缓解优化失败)\n    -   [我们能从最佳试验的训练曲线中获得哪些启示？](#检查训练曲线)\n        -   例如，最佳试验的训练曲线是否显示出有问题的过拟合现象？\n-   如有必要，根据上述问题的答案，我们可以对最近一次研究（或一组研究）进行调整，以改进搜索空间和\u002F或增加试验次数，或采取其他纠正措施。\n-   一旦回答了上述问题，我们就可以继续评估实验为我们原始目标所提供的证据（例如，[评估某项改动是否有用](#通过隔离图检测改动是否有效)）。\n\n#### 识别不良搜索空间边界\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   如果从某个搜索空间中采样的最佳点非常接近其边界，那么这个搜索空间就值得怀疑。如果我们朝那个方向扩展搜索范围，很可能会找到更好的点。\n-   为了检查搜索空间的边界，我们通常会绘制所谓的**基础超参数轴图**，即以验证目标值为纵轴、某一超参数（如学习率）为横轴的图表。图表中的每个点都对应一次试验。\n    -   每次试验的验证目标值通常应为其在整个训练过程中达到的最优值。\n\n\u003Cp align=\"center\" id=\"figure-1\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_125f971c0a71.png\" width=\"49%\" alt=\"不良搜索空间边界的示例\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b81cb982cf68.png\" width=\"49%\" alt=\"良好搜索空间边界的示例\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图1：\u003C\u002Fb> 不良搜索空间边界与可接受搜索空间边界的示例。\u003C\u002Fp>\n\n-   [图1](#figure-1) 中的图表展示了误差率（越低越好）与初始学习率之间的关系。\n-   如果最佳点集中在搜索空间的边缘（在某个维度上），则可能需要扩大搜索空间的边界，直到观测到的最佳点不再靠近边界。\n-   通常，一项研究中会包含一些“不可行”的试验，这些试验要么发散，要么得到非常糟糕的结果（如上方图表中用红色叉号标记的点）。\n    -   如果所有学习率高于某个阈值的试验都不可行，且表现最好的试验的学习率恰好位于该区域的边缘，那么模型【可能因稳定性问题而无法使用更高的学习率】（见#如何调试和缓解优化失败）。\n\n\u003C\u002Fdetails>\n\n#### 搜索空间采样不足\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n-   总体而言，\n    [很难判断](#使用准随机搜索获得良好结果需要多少次试验)\n    搜索空间是否已被充分采样。🤖\n-   当然，运行更多试验总是更好，但这也意味着更高的成本。\n-   由于难以确定何时才算采样充分，我们通常会根据预算尽可能多地采样，并通过反复查看各种超参数轴图来校准自己的直觉信心，从而大致判断搜索空间中“良好”区域有多少个点。\n\n\u003C\u002Fdetails>\n\n#### 检查训练曲线\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n***摘要：*** *检查训练曲线是一种简便的方法，可以帮助我们识别常见的失败模式，并据此优先安排下一步行动。*\n\n-   尽管在许多情况下，我们的实验主要目标只需关注每次试验的验证误差，但在将每次试验简化为一个单一数值时仍需谨慎，因为这可能会掩盖底层正在发生的重要细节。\n-   对于每项研究，我们始终会查看至少前几轮表现最佳试验的**训练曲线**（即在整个训练过程中，训练误差和验证误差随训练步数的变化趋势）。\n-   即使这并非解决主要实验目标所必需，检查训练曲线也是一种简便的方法，可以帮助我们识别常见的失败模式，并优先确定下一步应采取的行动。\n-   在检查训练曲线时，我们主要关注以下几个问题：\n-   是否有试验表现出**严重过拟合**？\n    -   严重过拟合是指在训练过程中，验证误差开始*上升*的现象。\n    -   在通过选择每组科学超参数设置下的“最佳”试验来优化并消除干扰超参数的实验环境中，我们应至少检查与所比较的科学超参数设置对应的每一组最佳试验是否存在严重过拟合。\n        -   如果任何一组最佳试验出现严重过拟合，我们通常会在比较科学超参数值之前，重新运行实验，加入额外的正则化技术，或进一步调优现有的正则化参数。\n            -   如果科学超参数本身包含正则化参数，则此规则可能不适用，因为在这种情况下，较低强度的正则化参数设置导致严重过拟合并不令人意外。\n        -   使用常见的正则化技术（如丢弃法、标签平滑、权重衰减等）来减少过拟合通常较为简单，且不会显著增加代码复杂度或计算开销，因此在下一轮实验中加入一两种此类技术通常并无太大困难。\n        -   例如，若科学超参数是“隐藏层数量”，而使用最多隐藏层的最佳试验出现了严重过拟合，则我们通常会选择在加入更多正则化措施后再次尝试，而不是立即选择较少的隐藏层数量。\n        -   即使所有“最佳”试验均未表现出严重过拟合，如果其他试验中存在这种情况，仍然可能存在潜在问题。\n            -   选择最佳试验的过程会倾向于排除那些表现出严重过拟合的配置，而偏向于没有过拟合的配置。换句话说，它会倾向于采用更强的正则化。\n            -   然而，任何使训练效果变差的因素都可能起到正则化的作用，即便这些因素并非设计初衷。例如，降低学习率可以通过拖慢优化进程来实现正则化，但通常我们并不希望以这种方式调整学习率。\n            -   因此，我们必须意识到，针对每组科学超参数设置选出的“最佳”试验，其结果可能受到某些科学或干扰超参数不良取值的影响。\n-   在训练后期，训练误差或验证误差是否表现出较大的步间波动？\n    -   如果存在这种波动，可能会干扰我们比较不同科学超参数值的能力（因为每次试验最终停留在某个“幸运”或“不幸”的步数上），同时也会影响我们在生产环境中复现最佳试验结果的能力（因为生产模型可能不会恰好停在与研究中相同的“幸运”步数上）。\n    -   步间波动最常见的原因包括：批次间的随机性（由于每个批次从训练集中随机采样样本）、验证集规模较小，以及在训练后期使用了过高的学习率。\n    -   可能的解决方法包括：增大批次大小、获取更多的验证数据、使用学习率衰减策略，或采用Polyak平均法。\n-   在训练结束时，各试验的表现是否仍在持续改善？\n    -   如果是，这表明我们正处于【计算受限】状态（参见#确定每次训练的步数），此时可以通过【增加训练步数】（参见#当训练处于计算受限状态时如何决定训练时长）或调整学习率调度来进一步提升性能。\n-   训练集和验证集上的性能是否在最终训练步数之前就已饱和？\n    -   如果是，这表明我们处于【非计算受限】状态（参见#确定每次训练的步数），此时可以考虑【减少训练步数】（参见#当训练非计算受限时如何决定训练时长）。\n-   虽然无法一一列举，但从训练曲线中还可以观察到许多其他行为特征（例如，训练损失在训练过程中*增加*通常意味着训练流水线中存在错误）。\n\n\u003C\u002Fdetails>\n\n#### 使用隔离图检测变更是否有效\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n\n\u003Cp align=\"center\" id=\"figure-2\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_3e7ce3a4ed0d.png\" width=\"49%\" alt=\"用于探究ResNet-50在ImageNet数据集上训练时最佳权重衰减值的隔离图。\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图2：\u003C\u002Fb>用于探究ResNet-50在ImageNet数据集上训练时最佳权重衰减值的隔离图。\u003C\u002Fp>\n\n-   通常，一组实验的目标是比较科学超参数的不同取值。\n    -   例如，我们可能希望确定使验证误差最小的权重衰减值。\n-   **隔离图**是基本超参数轴图的一种特殊情况。隔离图上的每个点对应于在某些（或全部）干扰超参数上表现*最佳*的试验结果。\n    -   换句话说，我们在“优化掉”干扰超参数后绘制模型性能。\n-   隔离图使得对科学超参数的不同取值进行更直接的比较成为可能。\n-   例如，[图2](#figure-2)展示了在ImageNet数据集上训练的特定ResNet-50配置下，能够产生最佳验证性能的权重衰减值。\n    -   如果我们的目标是决定是否使用权重衰减，那么就需要将该图中的最佳点与不使用权重衰减的基线进行比较。为了公平比较，基线的学习率也应经过同样良好的调优。\n-   当我们拥有通过（准）随机搜索生成的数据，并且考虑为隔离图选择一个连续型超参数时，可以通过对基本超参数轴图的x轴值进行分桶，并取每个桶所定义的垂直切片中的最佳试验来近似得到隔离图。\n\n\u003C\u002Fdetails>\n\n#### 自动化生成通用性较强的图表\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   绘制图表所需的工作量越大，我们就越少去查看它们，因此有必要建立基础设施，尽可能自动地生成这些图表。\n-   至少，我们会自动为实验中变化的所有超参数生成基本超参数轴图。\n-   此外，我们还会自动为所有试验生成训练曲线，并尽可能简化查找每次研究中表现最好的几个试验及其训练曲线的过程。\n-   还有许多其他潜在的图表和可视化方式可以添加，以提供帮助。尽管上述方法是一个不错的起点，但正如杰弗里·辛顿所说：“每当你绘制一个新的图表时，你就会学到新的东西。”\n\n\u003C\u002Fdetails>\n\n\n\n### 判断是否采用训练流程的改动或超参数配置\n\n***摘要：*** *在决定是否对模型或训练流程进行更改，或者在未来采用新的超参数配置时，我们需要意识到结果中存在的不同来源的变异性。*\n\n-   当我们试图改进模型时，可能会发现某个候选方案最初比现有配置获得了更低的验证误差，但重复实验后却发现并无持续优势。非正式地讲，我们可以将可能导致这种不一致结果的重要变异性来源归纳为以下几类：\n    -   **训练过程方差**、**重训练方差**或**试验方差**：指使用相同超参数但不同随机种子进行的多次训练之间观察到的差异。\n        -   例如，不同的随机初始化、训练数据打乱顺序、dropout掩码、数据增强操作模式以及并行计算操作的执行顺序等，都可能是试验方差的来源。\n    -   **超参数搜索方差**或**研究方差**：由我们选择超参数的过程所导致的结果差异。\n        -   例如，我们可能使用相同的搜索空间运行同一实验，但分别采用两个不同的准随机搜索种子，最终选出的超参数值却不同。\n    -   **数据收集与采样方差**：由于训练、验证和测试数据的随机划分，或更一般意义上的训练数据生成过程所带来的变异性。\n-   使用严谨的统计检验来比较基于有限验证集估算的验证误差固然重要，但往往仅凭试验方差就足以在使用相同超参数设置的两个不同模型之间产生统计显著差异。\n-   我们最关心研究方差的情况，是在试图得出超出单个超参数点之外的结论时。\n    -   研究方差取决于试验次数和搜索空间，我们既见过其大于试验方差的情形，也见过其远小于试验方差的情形。\n-   因此，在采纳某个候选方案之前，建议对最佳试验重复运行N次，以评估其试验间的方差。\n    -   通常情况下，只有在流程发生重大变化时才需要重新评估试验方差；但在某些应用中，则可能需要更频繁地更新这一评估。\n    -   而在另一些应用中，评估试验方差的成本过高，不值得投入。\n-   归根结底，虽然我们只希望采纳那些确实能带来改进的变更（包括新的超参数配置），但一味追求绝对确定性也并非明智之举。\n-   因此，如果一个新的超参数点（或其他变更）相较于基线取得了更好的结果（在尽可能考虑到新点和基线各自的重训练方差之后），那么我们很可能应该将其作为未来比较的新基线。\n    -   不过，我们应当只采纳那些带来的改进能够超过其增加的复杂性的变更。\n\n### 探索阶段结束后\n\n***总结：*** *在我们完成对良好搜索空间的探索，并确定究竟需要调优哪些超参数之后，贝叶斯优化工具便成为极具吸引力的选择。*\n\n-   到某一时刻，我们的优先级将从进一步了解调优问题，转变为生成一个最佳配置用于部署或实际应用。\n-   此时，应已形成一个经过精炼的搜索空间，该空间能够合理地覆盖当前观测到的最佳试验点附近区域，并且已被充分采样。\n-   通过前期的探索工作，我们应当已经识别出最需要调优的超参数及其合理的取值范围，从而构建出最终自动化调优研究所需的搜索空间，并尽可能分配较大的调优预算。\n-   由于此时我们不再关注最大化对调优问题的理解，[准随机搜索的优势](#为什么在调优的探索阶段使用准随机搜索而非更复杂的黑盒优化算法)中的许多优势已不再适用，因此应采用贝叶斯优化工具来自动寻找最优的超参数配置。\n    -   [开源 Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) 实现了多种用于机器学习模型调优的先进算法，其中包括贝叶斯优化算法。\n    -   如果搜索空间中存在大量发散点（即训练损失为 NaN，或显著偏离均值的点），则必须使用能够妥善处理发散试验的黑盒优化工具（参见 [带有未知约束的贝叶斯优化](https:\u002F\u002Farxiv.org\u002Fabs\u002F1403.5607) 一文，其中介绍了一种优秀的解决方案）。[开源 Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) 支持通过将发散试验标记为不可行的方式来处理此类情况，但其具体实现方式可能并不完全符合 [Gelbart 等人](https:\u002F\u002Farxiv.org\u002Fabs\u002F1403.5607) 提出的优选方案，这取决于配置的具体设置。\n-   在此阶段，我们也应考虑评估模型在测试集上的性能。\n    -   原则上，甚至可以将验证集并入训练集，并基于贝叶斯优化找到的最佳配置重新训练模型。然而，这种做法仅适用于不会再次针对相同任务进行部署的情况（例如一次性 Kaggle 比赛）。\n\n## 确定每次训练的步数\n\n-   工作负载可分为两类：计算受限型和非计算受限型。\n-   当训练处于 **计算受限** 状态时，训练过程主要受制于我们愿意等待的时间，而非训练数据量或其他因素。\n    -   在这种情况下，若能以更长的时间或更高的效率进行训练，通常会观察到更低的训练损失；而在适当调优后，验证损失也会相应改善。\n    -   换言之，*加快* 训练速度等同于*提升* 训练效果，“最优”的训练时长始终是“在资源允许范围内尽可能长”。\n    -   尽管如此，即使工作负载受计算能力限制，延长或加速训练也并非提升效果的唯一途径。\n-   当训练 **非计算受限** 时，我们可以根据需求自由选择训练时长；但在某一点之后，继续延长训练时间不仅难以带来明显收益，还可能导致严重的过拟合问题。\n    -   在这种情况下，我们有望将训练损失降至极低水平，即便进一步延长训练时间，也只能小幅降低训练损失，而对验证损失的影响则微乎其微。\n    -   特别是在非计算受限的情况下，更为充裕的训练时间预算往往能使调优过程更加容易，尤其是在调整学习率衰减策略时，因为这类策略与训练预算之间存在强烈的交互作用。\n        -   换言之，如果训练时间预算非常有限，可能就需要将学习率衰减策略调整至近乎完美的程度，才能获得理想的误差率。\n-   不论工作负载是否属于计算受限类型，任何增加梯度方差（跨批次）的方法通常都会导致训练进度放缓，从而可能需要更多的训练步数才能达到特定的验证损失目标。造成梯度方差增大的常见原因包括：\n    -   使用较小的批量大小\n    -   应用数据增强技术\n    -   添加某些类型的正则化方法（如 Dropout）\n\n### 当训练并非计算受限时，如何确定训练时长\n\n- 我们的主要目标是确保模型经过足够长时间的训练以达到最佳效果，同时避免在训练步数上过度浪费。\n- 如有疑问，宁可多训练一些。只要正确使用事后（最优）检查点选择机制，并且检查点设置得足够频繁，延长训练时间通常不会导致性能下降。\n- 在超参数调优实验中，切勿调整 `max_train_steps` 参数。应选定一个固定值，并在整个实验中保持一致。然后根据这些实验的结果，绘制出事后检查点选择机制所找到的最佳训练步数，以此来优化 `max_train_steps` 的取值。\n  - 例如，如果最佳步数总是出现在前10%的训练阶段，那么最大训练步数就设定得过高了。\n  - 相反，如果最佳步数持续位于最后25%的训练阶段，则可能需要进一步延长训练时间，并重新调整学习率衰减策略。\n- 当模型架构或数据发生变化时（如引入数据增强），理想的训练步数也可能随之改变。\n- 下面我们将介绍如何基于使用固定学习率“完美拟合”训练集所需的步数，来初步确定 `max_train_steps` 的候选值。\n  - 需要注意的是，“完美拟合训练集”在这里并不是一个精确或数学上严格定义的概念，而仅仅是一个非正式的描述，用来表示训练损失非常低的状态。\n    - 例如，在使用对数损失函数进行训练时，如果没有正则化项，训练损失可能会持续缓慢下降，直到网络权重无限增长、模型对训练集的预测越来越自信，最终达到浮点数精度的极限。在这种情况下，我们可能会认为当训练集上的分类错误率为零时，模型就已经“完美拟合”了训练集。\n  - 如果训练过程中梯度噪声增加，我们可能需要提高 `max_train_steps` 的初始值。\n    - 比如，当模型引入了数据增强或诸如 Dropout 等正则化方法时。\n  - 另一方面，如果训练过程有所改进，也有可能减少 `max_train_steps`。\n    - 例如，通过更精细地调整优化器或学习率调度策略。\n\n#### 使用学习率扫描法确定 `max_train_steps` 初始候选值的算法\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n- 本流程假设不仅能够“完美拟合”训练集，而且可以使用固定的学习率调度来实现这一点。\n- 如果确实能够完美拟合整个训练集，则必然存在某个配置（即某个特定的 `max_train_steps` 值）能够做到这一点。找到任意这样的配置，并将其 `max_train_steps` 值作为起点 `N`。\n- 接着，在不使用数据增强和正则化的情况下，运行一次固定学习率扫描实验（即对学习率进行网格搜索），每一轮实验都训练 `N` 步。\n- 扫描实验中，速度最快的一轮达到完美训练表现所需的步数，即为我们对 `max_train_steps` 的初始猜测。\n- **注意：** 不恰当的搜索空间可能导致自我欺骗。\n  - 例如，如果所有学习率都过小，我们可能会错误地得出结论，认为需要一个非常大的 `max_train_steps` 值。\n  - 至少应检查实验中的最优学习率是否位于搜索空间的边界上。\n\n\u003C\u002Fdetails>\n\n### 在计算资源受限的训练中决定训练时长\n\n-   在某些情况下，训练损失会持续改进，直到我们的耐心和计算资源成为限制因素。\n-   如果训练损失（甚至验证损失）持续改进，我们是否应该尽可能长时间地训练？不一定。\n    -   我们可以通过运行更多较短的实验来更有效地调参，并将最长的“生产级”训练留给那些我们希望上线的模型。\n    -   当每次试验的训练时间接近我们的耐心极限时，调参实验对潜在的上线候选模型变得更加重要，但我们可以完成的实验数量也会减少。\n    -   或许我们可以在只训练到生产长度约10%的情况下回答许多问题，但始终存在这样的风险：在这一时间限制下得出的结论可能并不适用于训练到20%或100%的情况。\n-   采用多轮调参、逐步增加每次试验的训练步数上限是一种合理的方法。\n    -   我们可以根据需要进行任意多轮调参，但通常1到3轮最为实用。\n    -   基本思路是利用周转时间极短的试验，尽可能多地理解问题，同时在调参的全面性与最终最长时间运行的相关性之间做出权衡。\n    -   一旦某个训练时长上限产生了有用的见解，就可以延长训练时间并继续调参，必要时再对之前短时间运行的结论进行复核。\n-   作为起点，我们建议进行两轮调参：\n    -   第一轮：通过较短的运行找到较好的模型和优化器超参数。\n    -   第二轮：在表现良好的超参数基础上，进行少量长时间运行，以获得最终模型。\n-   从第`i`轮过渡到第`i+1`轮时，最大的问题是如何调整学习率衰减策略。\n    -   在不同轮次之间调整学习率策略时，常见的陷阱是用过多的额外训练步数却使用过小的学习率。\n\n#### 第一轮\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   不幸的是，无法保证在短暂、不完整的训练中找到的优秀超参数，在显著延长训练时长后仍然适用。不过，对于某些类型的超参数，它们之间往往存在足够的相关性，因此第一轮调参仍然具有参考价值。\n-   那么，我们在较短运行中发现的哪些超参数值有望迁移到较长的训练中呢？这仍需进一步研究。但根据目前所知，作者按迁移可能性由高到低列出以下猜测：\n    -   很有可能迁移\n        -   初期训练中的不稳定现象可以通过较少的训练步数在第一轮调参中解决。这些超参数可能是我们目前认为最有可能成功迁移的选择。\n            -   热身步数\n            -   初始化方法\n    -   可能迁移\n        -   模型架构——如果在模型架构上取得显著提升，通常可以迁移，但也可能存在许多反例。\n    -   有可能迁移\n        -   优化算法\u002F优化器超参数——我们认为这些超参数可能会“松散地”迁移。其迁移的可能性显然弱于上述内容。\n        -   数据增强\n        -   正则化\n            -   如果无法完美拟合训练集，模型可能处于正则化效果有限的区间。\n    -   不太可能迁移\n        -   学习率调度策略：几乎不可能完全迁移。\n            -   [这篇论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556)指出，即使是衰减策略也可能迁移，但我们并不认同这一观点。例如，在少量训练步数上调整平方根衰减策略，然后扩展到大量步数时，大部分训练都将在过小的学习率下进行。\n            -   在极端的计算预算下，大多数学习率调度策略都能达到“足够好”的效果，但如果针对具体情况进行调优，则很可能带来明显的性能提升。\n            -   [关于随机元优化中短期视野偏差的理解](https:\u002F\u002Farxiv.org\u002Fabs\u002F1803.02021)描述了盲目选择学习率所带来的风险。\n\n\u003C\u002Fdetails>\n\n#### 第二轮\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   使用第一轮中表现最好的超参数配置进行训练。\n-   **（推测）** 🤖 利用额外的步数延长高学习率阶段的持续时间。\n    -   例如，如果是线性调度，可以保持第一轮中衰减阶段的长度不变，同时延长初始阶段的恒定学习率部分。\n    -   对于余弦衰减策略，只需沿用第一轮的基础学习率，并按照[Chinchilla论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556)的做法延长`max_train_steps`。\n-   对于拥有非常成熟建模和调参流程，且生产训练耗时又昂贵的团队来说，增加调参轮次或许有意义，但通常会显得过于复杂。\n    -   我们已经描述了如何从第一步过渡到第二步。如果完全不考虑分析时间，而首要关注点是高效利用计算资源，那么理想的做法是在多轮调参中呈指数级地增加训练时长（从而延长整个研究的端到端时间）。\n        -   在每一轮中，我们都系统地验证之前的决策是否仍然有效。\n        -   新的想法会经过一个逐步降低风险的流程，通过从第`i`步到第`i+1`步不断延长实验时长来实现。\n\n\u003C\u002Fdetails>\n\n## 训练流水线的其他指导建议\n\n### 优化输入流水线\n\n***摘要：*** 输入受限型流水线的原因及应对措施高度依赖具体任务；应使用性能分析工具，并留意常见问题。\n\n- 使用合适的性能分析工具来诊断输入受限型流水线。例如，对于 JAX 可使用 [Perfetto](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002Fprofiling.html)，对于 TensorFlow 则可使用 [TensorFlow 性能分析器](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fprofiler)。\n- 最终，具体原因和解决方案将高度依赖于任务特性。在某些情况下，出于更广泛的工程考量（如尽量减少磁盘占用），可能需要接受较低的输入流水线性能。\n- 常见原因：\n    - 数据未与训练过程部署在同一位置，导致 I\u002FO 延迟（例如通过网络读取训练数据时）。\n    - 在线数据预处理开销过大（建议改为离线一次性完成并保存）。\n    - 意外的同步屏障干扰了数据流水线的预取。例如，在 CommonLoopUtils 中同步设备端与主机端的指标时（[链接](https:\u002F\u002Fgithub.com\u002Fgoogle\u002FCommonLoopUtils\u002Fblob\u002Ffea2518ada8814a78e1492023fd9f00edb0b0568\u002Fclu\u002Fmetrics.py#L291)）。\n- 常用技巧：\n    - 对输入流水线进行插桩以实现示例预取（例如使用 [tf.data.Dataset.prefetch](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fdata_performance#prefetching)）。\n    - 尽早从流水线中移除每个样本中未使用的特征或元数据。\n    - 增加用于生成输入流水线示例的工作进程数量。例如，可以使用 [tf.data service](https:\u002F\u002Fwww.tensorflow.org\u002Fapi_docs\u002Fpython\u002Ftf\u002Fdata\u002Fexperimental\u002Fservice)。\n\n### 评估模型性能\n\n***摘要：*** *在比训练时更大的批处理大小下运行评估。按固定的步数间隔而非固定的时间间隔进行评估。*\n\n#### 评估设置\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   我们可以从多个角度来评估模型的性能。\n    -   **在线评估** - 在模型处于生产环境并提供预测服务时收集指标。\n    -   **离线评估** - 在模型对代表生产环境的离线训练\u002F验证\u002F测试数据集上运行时收集指标。\n    -   **定期评估** - 在模型训练过程中收集指标，这些指标可能是离线评估的代理，或者是在离线评估所用数据的一个子集上进行的。\n-   在线评估是黄金标准，但在模型开发阶段通常不切实际。\n-   根据具体问题，离线评估可能相当复杂且计算成本高昂。\n-   定期评估是最实用且经济的选择，但可能无法完全反映生产环境。\n    -   我们在定期评估中的目标是使用一种便捷的离线评估替代方案，同时不牺牲训练过程中信号的可靠性。\n\n\u003C\u002Fdetails>\n\n#### 设置定期评估\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   我们在训练过程中进行定期评估，以实时监控训练进度，以便\n    [便于事后选择最佳检查点](#保存检查点和事后选择最佳检查点)，并且能够在训练结束时\n    [检查训练曲线](#检查训练曲线)。\n-   最简单的配置是在同一台计算实例中同时进行训练和定期评估，定期在训练和评估之间切换。\n    -   在这种情况下，用于评估的批处理大小应*至少*与训练时使用的批处理大小相同，因为在评估过程中无需保留模型激活值，从而降低了每个样本的计算需求。\n-   定期评估应按照固定的步数间隔进行，而不是按时间间隔。\n    -   如果按时间间隔进行评估，可能会使训练曲线更难解释，尤其是在训练任务可能因抢占、网络延迟等问题而中断的情况下。\n-   当使用随机打乱的训练\u002F验证\u002F测试划分时，验证\u002F测试指标的周期性波动可能表明存在实现错误，例如测试数据与训练数据有重叠，或训练数据未正确打乱。按固定步数间隔进行评估可以更容易地发现这些问题。\n-   当评估数据集不能被批处理大小整除时，可能会出现部分批次。请确保对填充的示例进行正确的加权，以防止损失函数因此产生偏差。通常，可以将这些填充示例的权重设为零。\n-   每次评估时保存足够的信息，以支持离线分析。理想情况下，应保存部分独立样本的预测结果，因为它们对于调试非常有价值。\n    -   生成诸如[SavedModels](https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fsaved_model)之类的文件，可以方便在评估作业完成后进行临时性的模型检查。\n\n\u003C\u002Fdetails>\n\n#### 选择定期评估的样本\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   定期评估作业可能无法在合理时间内完成对整个离线评估数据集的指标计算。这通常需要对数据进行采样以供定期评估使用。\n-   构建采样数据集时，我们需考虑以下因素：\n    -   \u003Cins>样本大小\u003C\u002Fins>\n        -   确保定期评估任务所用的采样数据集上的性能与整个离线评估数据集上的性能一致，即采样数据与完整数据之间不存在偏差。\n        -   定期评估所用的数据集应足够小，以便能够轻松地对其全部样本生成模型预测；同时又应足够大，以准确衡量模型的改进（即不会被标签噪声所掩盖）。\n        -   数据集还应足够大，能够在多次试验中连续进行多次此类评估，并仍能产生准确的估计。也就是说，要避免随着时间推移逐渐“适应”验证集，从而导致模型在保留的测试集上表现不佳的情况。不过，这一顾虑在实践中很少成为问题。\n    -   \u003Cins>类别不平衡的数据集\u003C\u002Fins>\n        -   对于类别不平衡的数据集，稀有类别的性能往往较为不稳定。\n        -   对于某一类别样本数量较少的数据集，建议记录正确预测的样本数量，以便更深入地了解准确率的提升情况（例如，灵敏度提高了0.05听起来很令人振奋，但这是否仅仅是因为多预测对了一个样本？）。\n\n\u003C\u002Fdetails>\n\n### 保存检查点和事后选择最佳检查点\n\n***摘要：*** *按固定步数进行训练，并在训练结束后事后选择最佳检查点。*\n\n-   大多数深度学习框架都支持\n    [模型检查点功能](https:\u002F\u002Fflax.readthedocs.io\u002Fen\u002Flatest\u002Fapi_reference\u002Fflax.training.html)。\n    即定期将模型的当前状态保存到磁盘上。这样可以使训练作业在计算实例中断时仍能恢复。\n-   最佳检查点往往不是最后一个检查点，特别是在验证集性能并未持续提升，而是围绕某个值上下波动的情况下。\n-   应设置一个管道来跟踪训练过程中迄今为止看到的前N个最佳检查点。训练结束后，只需从这些检查点中选择最佳的一个即可。我们称这种方法为\n    **事后最优检查点选择**。\n-   通常不需要支持提前停止训练的功能，因为我们已经预先设定了每次试验的预算，并且会保存迄今为止最好的N个检查点。\n\n### 实验跟踪设置\n\n***摘要：*** *在跟踪不同实验时，务必记录一些关键信息，例如该研究中最佳检查点的性能，以及对该研究的简要描述。*\n\n-   我们发现，使用电子表格来记录实验结果，对于我们处理的各类建模问题非常有帮助。通常，表格会包含以下列：\n    -   研究名称\n    -   该研究配置文件的存储链接\n    -   备注或研究的简要说明\n    -   运行的试验次数\n    -   该研究中最佳检查点在验证集上的性能\n    -   具体的复现命令，或关于启动训练所需但尚未提交的修改的说明\n-   寻找一个能够至少记录上述信息、且对执行人员来说方便易用的跟踪系统。未被跟踪的实验，几乎可以视为不存在。\n\n### 批归一化实现细节\n\n***摘要：*** *如今，批归一化常常可以被层归一化替代；但在无法替代的情况下，调整批量大小或主机数量时，仍存在一些棘手的细节。*\n\n-   批归一化会基于当前批次内的均值和方差对激活值进行归一化，然而在多设备环境下，除非显式同步，否则每个设备上的统计量都会有所不同。\n-   根据一些经验性报告（主要来自 ImageNet 数据集），仅使用约 64 个样本计算这些归一化统计量，实际上在实践中效果更好（参见 [这篇论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.08741) 中提出的 Ghost Batch Norm）。\n-   将总批量大小与用于计算批归一化统计量的样本数解耦，对于比较不同批量大小的情况尤为有用。\n-   Ghost batch norm 的实现并不总是能正确处理每台设备的批量 > 虚拟批量的情况。在这种情况下，实际上需要在每台设备上对批次进行子采样，以获得足够数量用于计算批归一化统计量的样本。\n-   测试模式下批归一化的指数移动平均值只是训练统计量的线性组合，因此只需在将它们保存到检查点之前进行同步即可。然而，一些常见的批归一化实现并不会同步这些指数移动平均值，而只保存第一台设备上的 EMA。\n\n### 多主机流水线的注意事项\n\n***摘要：*** *在日志记录、评估、随机数生成器、检查点保存以及数据分片等方面，多主机训练很容易引入错误！*\n\n-   确保流水线仅在一个主机上进行日志记录和检查点保存。\n-   在执行评估或检查点保存之前，务必确保各主机之间的批归一化统计量已同步。\n-   各主机之间必须保持一致的随机种子（用于模型初始化），同时又要有不同的随机种子（用于数据打乱\u002F预处理），因此请务必妥善标记这些种子。\n-   通常建议将数据文件跨主机分片，以提升性能。\n\n## 常见问题解答\n\n### 最佳的学习率衰减策略家族是什么？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n-   这仍然是一个开放性问题。目前尚不清楚如何设计一套严谨的实验，以确信地回答“最佳”的学习率衰减策略究竟是什么。\n-   尽管我们不知道最佳的策略家族，但我们相信，采用某种非恒定的学习率衰减策略非常重要，且对其调优也确实有意义。\n-   不同的学习率在优化过程的不同阶段表现最佳。采用某种形式的衰减策略，能够提高模型找到合适学习率的可能性。\n\n\u003C\u002Fdetails>\n\n### 默认情况下应该使用哪种学习率衰减策略？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   我们倾向于选择线性衰减或余弦衰减，当然，其他许多类型的衰减策略也可能同样有效。\n\n\u003C\u002Fdetails>\n\n### 为什么有些论文会采用复杂的学习率衰减策略呢？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   论文中出现复杂的分段式学习率（LR）衰减策略并不少见。\n-   读者常常会好奇，作者是如何得出这样复杂的策略的。\n-   很多复杂的 LR 衰减策略，其实是通过根据验证集性能临时调整学习率得来的：\n    1.  首先以某种简单的 LR 衰减策略（或恒定学习率）开始一次训练。\n    2.  持续训练，直到性能似乎停滞不前。此时暂停训练，然后以更陡峭的 LR 衰减策略（或更低的恒定学习率）继续训练。重复这一过程，直到会议或发布截止日期到来。\n-   盲目照搬最终得到的*策略*通常并不是一个好主意，因为最佳的具体策略往往对其他超参数的选择非常敏感。\n    -   更好的做法是复制产生该策略的*算法*，不过当策略是由人为判断随意制定时，这往往难以实现。\n-   这种依赖于验证误差的策略，如果能够完全自动化，倒也无妨；但那些需要人工干预、并根据验证误差动态调整的策略，往往不够稳定且难以复现，因此我们建议尽量避免使用此类策略。\n    -   在发表使用了这类策略的结果之前，请务必尝试将其完全可复现。\n\n\u003C\u002Fdetails>\n\n### 应该如何调优 Adam 优化器的超参数？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   如上文所述，很难对搜索空间及其采样点的数量做出普遍性的结论。需要注意的是，Adam 优化器中的各个超参数重要性并不相同。以下是针对不同试验次数预算的参考规则：\n    -   如果一项研究中少于 10 次试验，只需调优基础学习率。\n    -   如果有 10 到 25 次试验，则需调优学习率和 $\\beta_1$。\n    -   如果超过 25 次试验，则应调优学习率、$\\beta_1$ 和 $\\epsilon$。\n    -   若试验次数远超 25 次，还可进一步调优 $\\beta_2$。\n\n\u003C\u002Fdetails>\n\n### 为什么在调参的探索阶段要使用拟随机搜索，而不是更复杂的黑箱优化算法？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n-   拟随机搜索（基于\n    [低差异序列](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLow-discrepancy_sequence)）\n    是我们的首选，相较于更为复杂的黑箱优化工具，尤其是在作为迭代调参流程的一部分、旨在深入理解调参问题（即我们所说的“探索阶段”）时。贝叶斯优化及类似工具则更适合用于开发阶段。\n-   基于随机平移的低差异序列的拟随机搜索可以被视为“抖动、打乱后的网格搜索”，因为它以均匀但随机的方式遍历给定的搜索空间，并且比随机搜索更能分散搜索点。\n-   拟随机搜索相较于更复杂的黑箱优化工具（如贝叶斯优化、进化算法等）的优势包括：\n    1.  由于是非适应性的采样方式，可以在事后分析中更改调参目标，而无需重新运行实验。\n        -   例如，我们通常希望找到在训练过程中任意时刻验证误差最低的试验。然而，拟随机搜索的非适应性特性使得我们能够在不重新运行任何实验的情况下，根据最终验证误差、训练误差或其他评估指标来确定最佳试验。\n    2.  拟随机搜索的行为具有一致性和统计上的可重复性。\n        -   即便搜索算法的实现发生变化，只要其保持相同的均匀性特征，就应当能够复现六个月前的研究结果。而如果使用复杂的贝叶斯优化软件，不同版本之间实现可能会发生重大变化，从而大大增加复现旧有搜索结果的难度。此外，有时也无法回滚到旧版本的实现（例如当优化工具以服务形式运行时）。\n    3.  它对搜索空间的均匀探索使得我们更容易对结果进行推理，并推断这些结果可能揭示的关于搜索空间的信息。\n        -   例如，如果拟随机搜索遍历过程中找到的最佳点位于搜索空间的边界上，这通常是一个很好的信号（尽管并非绝对可靠），表明可能需要调整搜索空间的边界。[本节](#identifying-bad-search-space-boundaries)将对此进行更详细的讨论。然而，自适应的黑箱优化算法可能会因为早期几次不太理想的试验结果而忽略搜索空间的中部区域，即使那里同样存在表现优异的点。这是因为优秀的优化算法往往需要利用这种非均匀性来加速搜索过程。\n    4.  使用拟随机搜索（或其他非适应性搜索算法）时，无论并行还是串行运行不同数量的试验，其统计结果都不会有显著差异；而对于适应性算法则不然。\n    5.  更复杂的搜索算法并不总是能正确处理不可行的点，尤其是那些并非专门为神经网络超参数调优设计的算法。\n    6.  拟随机搜索简单易用，在需要大量试验并行运行的情况下尤为有效。\n        -   根据经验[^3]，自适应算法很难在预算仅为拟随机搜索两倍的情况下超越后者，特别是在需要大量试验并行执行的情形下（此时几乎无法利用先前试验的结果来指导新试验的开展）。\n        -   如果缺乏贝叶斯优化及其他高级黑箱优化方法的专业知识，我们可能无法充分发挥这些方法理论上所能带来的优势。在真实的深度学习调参场景中，很难对高级黑箱优化算法进行基准测试。这类方法目前仍是研究热点，而越复杂的算法对于缺乏经验的用户来说往往伴随着更多的陷阱。虽然相关领域的专家能够取得不错的效果，但在高并行度条件下，搜索空间和预算的影响往往会更加显著。\n-   当然，如果计算资源仅允许少量试验并行运行，而我们可以负担得起顺序执行大量试验的话，尽管贝叶斯优化会使调参结果更难解释，但它会变得更具吸引力。\n\n[^3]: Ben Recht 和 Kevin Jamieson\n    [指出](http:\u002F\u002Fwww.argmin.net\u002F2016\u002F06\u002F20\u002Fhypertuning\u002F)，预算为两倍的随机搜索作为基线非常强大（[Hyperband 论文](https:\u002F\u002Fjmlr.org\u002Fpapers\u002Fvolume18\u002F16-558\u002F16-558.pdf)也提出了类似观点），不过确实存在一些搜索空间和问题，其中最先进的贝叶斯优化技术能够轻松击败预算为两倍的随机搜索。然而，根据我们的经验，在高并行度环境下，要超越预算为两倍的随机搜索则要困难得多，因为贝叶斯优化在这种情况下几乎没有机会观察到先前试验的结果。\n\n\u003C\u002Fdetails>\n\n### 在哪里可以找到拟随机搜索的实现？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   [开源 Vizier](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier) 提供了[拟随机搜索的实现](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fvizier\u002Fblob\u002Fmain\u002Fvizier\u002F_src\u002Falgorithms\u002Fdesigners\u002Fquasi_random.py)。只需在[使用示例](https:\u002F\u002Foss-vizier.readthedocs.io\u002Fen\u002Flatest\u002Fguides\u002Fuser\u002Frunning_vizier.html)中设置 `algorithm=\"QUASI_RANDOM_SEARCH\"` 即可。\n-   另一种实现可以在[这里](https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Falgorithmic-efficiency\u002Fblob\u002Fmain\u002Falgorithmic_efficiency\u002Fhalton.py)找到。\n-   上述两种实现都会为给定的搜索空间生成 Halton 序列（旨在实现 https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03200 中推荐的平移、打乱后的 Halton 序列）。\n-   如果没有基于低差异序列的拟随机搜索算法可用，也可以用伪随机均匀搜索代替，尽管效率可能会略低。\n    -   在 1–2 维情况下，网格搜索也是可以接受的，但在更高维情况下则不建议使用（参见\n        [Bergstra & Bengio, 2012](https:\u002F\u002Fwww.jmlr.org\u002Fpapers\u002Fv13\u002Fbergstra12a.html))。\n\n\u003C\u002Fdetails>\n\n### 进行拟随机搜索需要多少次试验才能获得良好结果？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fmain\u002Fassets\u002Fhave_we_sampled_enough.png\" width=\"49%\" alt=\"箱线图展示了充分采样的重要性\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图3：\u003C\u002Fb>在ImageNet数据集上使用ResNet-50模型进行了100次试验的调优。通过自助法模拟了不同规模的调优预算。上方绘制了对应不同试验次数的最佳性能的箱线图。\n\n-   一般情况下无法给出一个通用的答案，但我们可以通过具体例子来分析。\n-   如图3所示，试验次数会对结果产生显著影响。\n    -   注意当只采样6次试验时，四分位距非常大；而当采样20次时，四分位距明显缩小。\n    -   即便进行了20次试验，运气特别好或特别差的几次试验之间的差异，仍可能大于在固定超参数下、使用不同随机种子对同一模型进行多次重新训练时的典型波动。对于该任务而言，这种波动可能表现为验证误差率约23%时上下浮动0.1%左右。\n\n\u003C\u002Fdetails>\n\n### 如何调试并缓解优化失败问题？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n\n***摘要：*** *如果模型在优化过程中遇到困难，在尝试其他方法之前，务必先解决这些问题。诊断和修复训练失败是当前研究的热点领域。*\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_2d62ecfebc75.png\" width=\"80%\" alt=\"更改WideResnet中单个残差块的步幅会导致训练不稳定。\">\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\u003Cb>图4：\u003C\u002Fb>更改WideResnet中单个残差块的步幅（从2×2改为1×1）会导致训练不稳定。在低学习率下，这并不会降低性能，但在高学习率下，由于不稳定现象，模型将无法正常训练。通过应用1000步的学习率预热，可以解决这一特定的不稳定问题，从而允许在最大学习率为0.1的情况下稳定训练。\u003C\u002Fp>\n\n#### 识别不稳定的训练工作负载\n\n-   如果学习率过大，任何工作负载都可能变得不稳定。只有当不稳定迫使我们使用过小的学习率时，这个问题才会真正成为瓶颈。\n-   至少有两种类型的训练不稳定值得区分：\n    1.  初始化阶段或训练初期的不稳定。\n    2.  训练中期突然出现的不稳定。\n-   我们可以采用系统化的步骤来识别工作负载中的稳定性问题：\n    1.  进行学习率扫描，找到最佳学习率lr*。\n    2.  绘制略高于lr*的学习率下的训练损失曲线。\n    3.  如果这些高于lr*的学习率显示出损失不稳定（即在训练过程中损失不是下降而是上升），那么很可能通过修复这种不稳定现象能够提升训练效果。\n-   在训练过程中记录完整损失梯度的L2范数，异常值可能导致训练中途出现虚假的不稳定现象。这可以帮助我们决定如何设置梯度或更新的裁剪阈值。\n\n**注意：** 有些模型会在早期表现出明显的不稳定，随后又自行恢复，最终以缓慢但稳定的节奏完成训练。**然而，常见的评估频率可能不足以捕捉到这些问题！**\n\n为了检查这种情况，我们可以使用“当前最佳学习率的两倍”进行一次简短的训练，仅运行约500步，但每一步都进行评估。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_0ec4845b9f33.png\" width=\"80%\" alt=\"说明在训练初期更频繁评估的重要性。\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图5：\u003C\u002Fb>说明在训练初期更频繁评估的重要性。如果怀疑模型存在早期训练不稳定，这种方法尤为有用。\u003C\u002Fp>\n\n#### 常见不稳定模式的潜在解决方案\n\n-   应用学习率预热\n    -   最适合解决训练初期的不稳定问题。\n-   应用梯度裁剪\n    -   对于训练初期和中期的不稳定都有效，有时甚至能解决预热无法处理的不良初始化问题。\n-   尝试更换优化器\n    -   有时Adam优化器能够处理Momentum优化器无法应对的不稳定情况。这仍然是一个活跃的研究方向。\n-   确保我们的模型架构采用了最佳实践和初始化方法（示例如下）。\n    -   如果模型尚未包含残差连接和归一化层，应予以添加。\n-   归一化层应置于残差结构内部，例如x + f(Norm(x))。\n    -   而Norm(x + f(x))则已知容易引发问题。\n-   尝试将残差分支初始化为0（例如[ReZero初始化](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.04887)）。\n-   降低学习率\n    -   这是最后的手段。\n\n#### 学习率预热\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b597576e155f.png\" width=\"80%\" alt=\"预热期间出现不稳定现象的例子（注意横轴为对数尺度）。\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图6：\u003C\u002Fb>预热期间出现不稳定现象的例子（注意横轴为对数尺度）。在此案例中，成功训练需要4万步的预热过程。\u003C\u002Fp>\n\n##### 何时应用学习率预热\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_903b5b57a5a2.png\" width=\"49%\" alt=\"带有不稳定现象的模型的超参数轴图\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图7a：\u003C\u002Fb>展示了一种出现训练不稳定现象的模型的超参数轴图。最佳学习率位于可行范围的边缘。所谓“不可行”的试验是指那些产生NaN值或损失值异常高的试验。\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_b8f315d09bbb.png\" width=\"49%\" alt=\"带有不稳定现象的模型的损失曲线\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图7b：\u003C\u002Fb>使用导致不稳定现象的学习率训练的模型的训练损失曲线。\u003C\u002Fp>\n\n-   图7a显示了一个超参数轴图，表明该模型存在优化方面的不稳定问题，因为最佳学习率正好处于不稳定区域的边缘。\n-   图7b则通过观察使用比峰值学习率高5倍或10倍的学习率训练的模型的训练损失曲线，进一步证实了这一点。如果该曲线在平稳下降后突然出现急剧上升（如上图中约第10,000步处），则可以确定该模型确实存在优化方面的不稳定问题。\n\n##### 如何应用学习率预热\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_a7fd3b70ea1f.png\" width=\"80%\" alt=\"预热对训练稳定性有益的效果\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图8：\u003C\u002Fb>学习率预热在解决训练不稳定问题上的有益效果。\u003C\u002Fp>\n\n-   借助上一节的内容，我们假设实践者已经确定了模型开始变得不稳定的那一个学习率。这个值就是 `unstable_base_learning_rate`。\n-   预热是指在训练开始前添加一个学习率调度策略，将学习率从0逐步提升到某个稳定的 `base_learning_rate`，而这个 `base_learning_rate` 至少要比 `unstable_base_learning_rate` 大一个数量级。默认情况下，可以尝试将 `base_learning_rate` 设置为 `unstable_base_learning_rate` 的10倍。不过需要注意的是，也可以针对例如 `unstable_base_learning_rate` 的100倍再次执行整个流程。具体的学习率调度如下：\n    -   在 `warmup_steps` 步骤内，将学习率从0线性提升至 `base_learning_rate`。\n    -   然后以恒定的学习率继续训练 `post_warmup_steps` 步骤。\n-   我们的目的是找到最短的 `warmup_steps` 数量，使得我们能够使用远高于 `unstable_base_learning_rate` 的峰值学习率。\n-   因此，对于每一个 `base_learning_rate`，都需要调整 `warmup_steps` 和 `post_warmup_steps`。通常可以将 `post_warmup_steps` 设置为 `2*warmup_steps`。\n-   预热可以独立于现有的学习率衰减计划进行调优。`warmup_steps` 应该在几个不同的数量级范围内进行扫描。例如，可以尝试 [10, 10\u003Csup>3\u003C\u002Fsup>, 10\u003Csup>4\u003C\u002Fsup>, 10\u003Csup>5\u003C\u002Fsup>]。最大可行的步数不应超过 `max_train_steps` 的10%。\n-   一旦确定了在 `base_learning_rate` 下不会导致训练崩溃的 `warmup_steps`，就应将其应用到基准模型上。本质上，我们将这个预热调度插入到现有调度之前，并利用上述讨论的最佳检查点选择方法，将此次实验与基准模型进行比较。例如，如果原本有10,000个 `max_train_steps`，并且进行了1,000步的预热，那么新的训练过程总共将持续11,000步。\n-   如果为了稳定训练需要较长的预热步骤（超过 `max_train_steps` 的5%），可能需要增加 `max_train_steps` 来适应这一情况。\n-   在各种工作负载中并没有一个“典型”的预热步数。有些模型只需要100步，而另一些模型（尤其是Transformer）则可能需要4万步以上。\n\n#### 梯度裁剪\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_readme_6b07e7c74517.png\" width=\"80%\" alt=\"梯度裁剪用于早期训练不稳定\">\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\u003Cb>图9：\u003C\u002Fb>梯度裁剪纠正早期训练不稳定现象的示意图。\u003C\u002Fp>\n\n-   梯度裁剪在出现较大或异常的梯度问题时最为有效。\n-   裁剪既可以解决早期训练中的不稳定问题（训练初期梯度范数过大），也可以解决中期训练中的不稳定问题（训练过程中突然出现的梯度尖峰）。\n-   有时，较长的预热期可以解决一些即使通过裁剪也无法解决的不稳定问题：请参阅[上文的相关部分](#如何应用学习率预热)。\n    -   🤖 那么在预热期间进行梯度裁剪怎么样呢？\n-   理想的裁剪阈值应略高于“典型”的梯度范数。\n-   以下是一个梯度裁剪的具体示例：\n    -   如果梯度的范数 $\\left | g \\right |$ 超过梯度裁剪阈值 $\\lambda$，则执行 ${g}'= \\lambda \\times \\frac{g}{\\left | g \\right |}$，其中 ${g}'$ 是新的梯度。\n-   在训练过程中记录未被裁剪的梯度范数。默认情况下，生成：\n    -   梯度范数随训练步数变化的曲线图\n    -   所有步骤梯度范数的直方图\n-   根据梯度范数的90百分位数来选择梯度裁剪阈值。\n    -   阈值会因工作负载而异，但90%是一个不错的起点。如果效果不佳，还可以进一步调整这个阈值。\n    -   🤖 那么是否可以采用某种自适应策略呢？\n-   如果尝试了梯度裁剪后，不稳定问题仍然存在，则可以进一步加强裁剪力度（即降低阈值）。\n-   极端激进的梯度裁剪本质上是一种变相降低学习率的方式。如果我们发现自己不得不采取极端激进的裁剪措施，那么直接降低学习率可能更为合适。\n-   一般而言，如果超过50%的更新都被裁剪，就可以认为是“极端激进”的做法。\n-   如果我们需要通过极端激进的梯度裁剪来解决不稳定问题，那么不如直接降低学习率。\n\n\u003C\u002Fdetails>\n\n\n\n### 为什么你们把学习率和其他优化参数称为超参数？它们并不是任何先验分布的参数。\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   确实，“超参数”一词在贝叶斯机器学习中有其精确的[定义](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHyperparameter)，而将深度学习中我们经常调整的学习率及其他大部分参数称为“超参数”实际上是一种术语上的误用。\n-   对于学习率、架构参数以及我们在深度学习中调整的其他所有内容，我们更倾向于使用“元参数”这一术语，因为它避免了因滥用“超参数”一词而可能带来的混淆（尤其是在讨论贝叶斯优化时，因为概率响应曲面模型本身也有真正的超参数）。\n-   不幸的是，尽管可能存在混淆，但“超参数”这一术语在深度学习社区中已极为普遍。\n-   因此，对于像本文档这样面向广泛受众的文件——其中包含许多不太了解这一技术细节的人——我们选择了顺应当前的术语习惯，以避免引入另一种潜在的混淆。\n-   当然，如果我们发表研究论文时，可能会做出不同的选择；我们也鼓励大家在大多数情况下使用“元参数”这一术语。\n\n\u003C\u002Fdetails>\n\n### 为什么不应该通过调整批量大小来直接提升验证集性能？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\u003Cbr>\n\n-   在**不改变训练流程其他任何细节**的情况下，单纯改变批量大小通常会影响验证集性能。\n-   然而，如果为每种批量大小分别优化训练流程，两种批量大小之间的验证集性能差异通常会消失。\n-   与批量大小交互最强烈、因此需要为每种批量大小单独调优的超参数主要是优化器超参数（如学习率、动量）和正则化超参数。\n    - 较小的批量大小由于样本方差会引入更多的噪声到训练算法中，这种噪声具有正则化作用。因此，较大的批量大小更容易过拟合，可能需要更强的正则化或额外的正则化技术。\n-   此外，当改变批量大小时，[训练步数也可能需要相应调整](#choosing-the-batch-size-to-minimize-training-time)。\n-   综合考虑所有这些因素后，目前尚无令人信服的证据表明批量大小会影响可达到的最高验证性能（参见 [Shallue 等人, 2018](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.03600)）。\n\n\u003C\u002Fdetails>\n\n### 所有流行优化算法的更新规则是什么？\n\n\u003Cdetails>\u003Csummary>\u003Cem>[点击展开]\u003C\u002Fem>\u003C\u002Fsummary>\n\n\u003Cbr>\n\n#### 随机梯度下降 (SGD)\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t \\nabla \\mathcal{l}(\\theta_t)$$\n\n#### 动量法\n\n$$v_0 = 0$$\n\n$$v_{t+1} = \\gamma v_{t} + \\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t v_{t+1}$$\n\n#### Nesterov 动量法\n\n$$v_0 = 0$$\n\n$$v_{t+1} = \\gamma v_{t} + \\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\eta_t( \\gamma v_{t+1} + \\nabla \\mathcal{l}(\\theta_{t}))$$\n\n#### RMSProp\n\n$$v_0 = 1 \\text{,} m_0 = 0$$\n\n$$v_{t+1} = \\rho v_{t} + (1 - \\rho) \\nabla \\mathcal{l}(\\theta_t)^2$$\n\n$$m_{t+1} = \\gamma m_{t} + \\frac{\\eta_t}{\\sqrt{v_{t+1} + \\epsilon}}\\nabla \\mathcal{l}(\\theta_t)$$\n\n$$\\theta_{t+1} = \\theta_{t} - m_{t+1}$$\n\n#### ADAM\n\n$$m_0 = 0 \\text{,} v_0 = 0$$\n\n$$m_{t+1} = \\beta_1 m_{t} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)$$\n\n$$v_{t+1} = \\beta_2 v_{t} + (1 - \\beta_2) \\nabla \\mathcal{l}(\\theta_t)^2$$\n\n$$b_{t+1} = \\frac{\\sqrt{1 - \\beta_2^{t+1}}}{1 - \\beta_1^{t+1}}$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\alpha_t \\frac{m_{t+1}}{\\sqrt{v_{t+1}} + \\epsilon} b_{t+1}$$\n\n#### NADAM\n\n$$m_0 = 0 \\text{,} v_0 = 0$$\n\n$$m_{t+1} = \\beta_1 m_{t} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)$$\n\n$$v_{t+1} = \\beta_2 v_{t} + (1 - \\beta_2) \\nabla \\mathcal{l} (\\theta_t)^2$$\n\n$$b_{t+1} = \\frac{\\sqrt{1 - \\beta_2^{t+1}}}{1 - \\beta_1^{t+1}}$$\n\n$$\\theta_{t+1} = \\theta_{t} - \\alpha_t \\frac{\\beta_1 m_{t+1} + (1 - \\beta_1) \\nabla \\mathcal{l} (\\theta_t)}{\\sqrt{v_{t+1}} + \\epsilon} b_{t+1}$$\n\n\u003C\u002Fdetails>\n\n## 致谢\n\n-   我们衷心感谢 Max Bileschi、Roy Frostig、Zelda Mariet、Stan Bileschi、Mohammad Norouzi、Chris DuBois 和 Charles Sutton 阅读本手稿并提供了宝贵的反馈。\n-   我们在几幅图表中复用了部分实验数据，这些数据最初是由 Naman Agarwal 为另一项联合研究生成的。\n-   我们还要感谢 Will Chen 对文档呈现方式提供的宝贵建议。\n-   同时，我们也感谢 Rohan Anil 的有益讨论。\n\n## 引用\n\n```\n@misc{tuningplaybookgithub,\n  author = {Varun Godbole 和 George E. Dahl 和 Justin Gilmer 和 Christopher J. Shallue 和 Zachary Nado},\n  title = {深度学习调优手册},\n  url = {http:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook},\n  year = {2023},\n  note = {版本 1.0}\n}\n```\n\n## 贡献说明\n\n-   本项目并非 Google 官方支持的产品。\n\n-   我们非常期待您的反馈！\n\n    -   如果您喜欢本手册，请[点亮星标](https:\u002F\u002Fdocs.github.com\u002Fen\u002Fget-started\u002Fexploring-projects-on-github\u002Fsaving-repositories-with-stars#starring-a-repository)！或者发送邮件至 deep-learning-tuning-playbook \\[at\\] googlegroups.com。您的评价将帮助我们证明创建更多类似资源的必要性。\n    -   如果发现任何错误，请提交问题以开启讨论。对于不适合使用问题的功能咨询或其他信息交流，请在 GitHub 上新建讨论主题。\n\n-   如前言所述，本文档是一个持续更新的活文档。我们预计会定期进行小幅或大幅改进。如果您希望及时获取更新通知，请关注我们的仓库（详见 [GitHub 文档](https:\u002F\u002Fdocs.github.com\u002Fen\u002Faccount-and-profile\u002Fmanaging-subscriptions-and-notifications-on-github\u002Fsetting-up-notifications\u002Fconfiguring-notifications#configuring-your-watch-settings-for-an-individual-repository)）。\n\n-   请勿在未通过问题跟踪系统与作者沟通的情况下直接提交拉取请求。\n\n### 贡献者许可协议\n\n对本项目的贡献必须附带贡献者许可协议 (CLA)。您（或您的雇主）保留对贡献内容的版权；该协议仅授予我们使用和再分发您贡献的权利，作为项目的一部分。请访问 \u003Chttps:\u002F\u002Fcla.developers.google.com\u002F> 查看您当前签署的协议或签署新的协议。\n\n通常情况下，您只需提交一次 CLA，因此如果您已经提交过（即使是在其他项目中），则无需再次提交。\n\n### 代码审查\n\n所有提交内容，包括项目成员的提交，均需经过审查。我们使用 GitHub 拉取请求来进行这一过程。有关如何使用拉取请求的更多信息，请参阅 [GitHub 帮助](https:\u002F\u002Fhelp.github.com\u002Farticles\u002Fabout-pull-requests\u002F)。\n\n### 社区准则\n\n本项目遵循 [Google 开源社区准则](https:\u002F\u002Fopensource.google\u002Fconduct\u002F)。","# Deep Learning Tuning Playbook 快速上手指南\n\n`Deep Learning Tuning Playbook` 并非一个可直接安装运行的软件包或库，而是一份由 Google Research 团队编写的**最佳实践指南文档**。它旨在指导工程师和研究人员如何科学地调整深度学习模型的超参数，以最大化模型性能。\n\n本指南将帮助你在本地获取该文档，并概述其核心操作流程，助你快速应用其中的方法论。\n\n## 环境准备\n\n由于本项目本质为文档资源，对系统环境无特殊要求，仅需具备以下条件：\n\n*   **操作系统**：Windows, macOS, 或 Linux 均可。\n*   **前置依赖**：\n    *   **Git**：用于克隆代码仓库。\n    *   **Markdown 阅读器**（可选）：推荐使用 VS Code、Typora 或 GitHub 网页版直接阅读 `.md` 文件。\n    *   **深度学习框架**：虽然文档本身不依赖框架，但应用其中的调优策略时，你需要熟悉 TensorFlow、PyTorch 或 JAX 等主流框架。\n\n## 安装步骤（获取文档）\n\n你可以通过 Git 将该项目克隆到本地，以便离线阅读或跟踪更新。\n\n### 1. 克隆仓库\n打开终端（Terminal）或命令提示符，执行以下命令：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook.git\n```\n\n> **国内加速建议**：\n> 如果访问 GitHub 速度较慢，可以使用国内镜像源（如 Gitee 镜像，若有）或通过代理加速。若无法使用代理，可尝试以下命令（需确保网络通畅）：\n> ```bash\n> git clone https:\u002F\u002Fgitee.com\u002Fmirrors\u002Ftuning_playbook.git\n> ```\n> *(注：若上述 Gitee 地址无效，请优先配置本地 Git 代理或使用 GitHub 官方地址)*\n\n### 2. 进入目录\n```bash\ncd tuning_playbook\n```\n\n### 3. 查看文档\n你可以直接在文件管理器中打开 `README.md` 或 `playbook.md`（主文档），或在终端使用支持 Markdown 预览的工具查看。\n\n## 基本使用（核心实践流程）\n\n本“工具”的使用方式是**遵循文档中定义的科学调优流程**。以下是基于文档内容的简化操作指南，适用于启动新的深度学习项目：\n\n### 第一步：确定初始配置 (Choosing the Initial Configuration)\n\n在开始训练前，根据文档建议固定以下基础设置：\n\n1.  **选择模型架构**：\n    *   **策略**：不要从头发明轮子。选择一个在该任务领域已验证有效的成熟架构（如 ResNet, Transformer 等）。\n    *   **操作**：复现一篇与当前任务最接近的论文中的模型作为基线。\n\n2.  **选择优化器 (Optimizer)**：\n    *   **策略**：使用该任务类型中最流行的优化器。\n    *   **推荐**：\n        *   通用场景：`SGD with momentum` (推荐 Nesterov 变体)。\n        *   复杂场景：`Adam` 或 `NAdam`。\n    *   **注意**：若选择 Adam，需准备好调整其全部 4 个超参数（学习率, $\\beta_1$, $\\beta_2$, $\\epsilon$）。\n\n3.  **选择批量大小 (Batch Size)**：\n    *   **策略**：批大小主要影响训练速度，而非直接用于提升验证集性能。\n    *   **操作**：设置为硬件（GPU\u002FTPU）所能支持的**最大值**，以缩短实验迭代周期。\n\n### 第二步：执行增量调优策略 (The Incremental Tuning Strategy)\n\n不要随机搜索，而是采用科学的“探索与利用”循环：\n\n1.  **设定目标**：明确下一轮实验要解决的具体瓶颈（例如：降低训练损失、提高验证准确率、减少过拟合）。\n2.  **设计实验**：\n    *   每次只改变**一个**关键超参数或组件。\n    *   区分“科学干扰参数”（需固定）和“待调优参数”。\n3.  **运行训练**：\n    *   根据计算资源限制决定训练步数（Compute-bound 情况下跑满预算；非计算受限情况下跑至收敛）。\n4.  **评估与决策**：\n    *   对比新配置与基线。\n    *   若性能显著提升，则采纳该变更作为新的基线（Exploitation）。\n    *   若结果不明确，进行更多探索性实验（Exploration）。\n\n### 第三步：完善训练流水线 (Training Pipeline Optimization)\n\n在调优过程中，确保以下工程细节已到位：\n\n*   **输入管道优化**：确保数据加载不会成为 GPU 训练的瓶颈。\n*   **检查点管理**：定期保存 Checkpoint，并在实验结束后回溯选择验证集表现最佳的模型，而非仅看最后一个 epoch。\n*   **实验追踪**：搭建实验记录系统（如 TensorBoard, WandB），记录每一次超参数变更及其对应的指标。\n\n---\n\n**示例：应用调优逻辑的伪代码结构**\n\n虽然本库不提供代码，但你的训练脚本应体现如下逻辑：\n\n```python\n# 伪代码示例：遵循 Playbook 的实验循环\n\n# 1. 初始化基线配置 (基于成熟架构和流行优化器)\nconfig = {\n    \"model_arch\": \"ResNet50\", \n    \"optimizer\": \"SGD_with_momentum\",\n    \"batch_size\": MAX_HARDWARE_LIMIT, # 最大可用显存对应的 batch size\n    \"learning_rate\": 0.1 # 初始经验值\n}\n\n# 2. 增量调优循环\nwhile not converged:\n    # 设计下一轮实验：仅修改一个变量 (例如学习率)\n    next_config = config.copy()\n    next_config[\"learning_rate\"] = config[\"learning_rate\"] * 0.1\n    \n    # 运行训练并评估\n    metrics = run_training_and_evaluate(next_config)\n    \n    # 3. 决策：是否采纳新配置\n    if metrics[\"validation_accuracy\"] > best_recorded_accuracy:\n        config = next_config # 采纳 (Exploitation)\n        best_recorded_accuracy = metrics[\"validation_accuracy\"]\n        save_checkpoint(config)\n    else:\n        # 记录失败，尝试其他方向 (Exploration)\n        log_experiment(next_config, metrics, status=\"rejected\")\n```\n\n通过严格遵循上述流程，你可以减少盲目猜测，系统化地提升深度学习模型的性能。","某电商推荐算法团队正致力于优化深度点击率（CTR）预测模型，试图在有限的算力预算下将模型准确率提升 0.5% 以满足上线标准。\n\n### 没有 tuning_playbook 时\n- 调参过程依赖工程师的“直觉”和零散经验，缺乏系统性的实验设计，导致大量计算资源浪费在无效的超参数组合上。\n- 面对模型性能瓶颈时，团队难以判断是继续延长训练时间、调整批量大小（Batch Size），还是更换优化器，决策过程充满盲目猜测。\n- 实验记录混乱，缺乏统一的评估标准，经常无法复现之前的最佳结果，甚至因过早停止训练而错过了全局最优解。\n- 输入管道和检查点保存策略未经优化，导致数据加载成为瓶颈，且难以回溯选择历史训练中的最佳模型状态。\n\n### 使用 tuning_playbook 后\n- 团队遵循其科学的增量调优策略，按顺序系统地确定架构、优化器和初始配置，将试错成本降低了 60% 以上。\n- 依据指南中关于“探索与利用”的建议，清晰制定了每一轮实验的目标，精准判断何时该尝试新方向、何时该深耕现有配置。\n- 建立了规范的实验追踪体系，根据算力约束科学设定训练步数，并通过回顾性选择检查点机制，成功挖掘出被忽略的高性能模型版本。\n- 参照输入管道优化和多主机训练的最佳实践，消除了数据加载瓶颈，确保了训练流程的高效与稳定。\n\ntuning_playbook 将深度学习调优从一门“玄学”转变为可复制、可度量的科学工程流程，显著提升了模型迭代效率与最终性能上限。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgoogle-research_tuning_playbook_2d62ecfe.png","google-research","Google Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgoogle-research_c23b2adf.png","",null,"https:\u002F\u002Fresearch.google","https:\u002F\u002Fgithub.com\u002Fgoogle-research",29982,2422,"2026-04-04T18:41:56","NOASSERTION",1,"未说明",{"notes":89,"python":87,"dependencies":90},"该工具（tuning_playbook）是一份关于深度学习超参数调优的指南文档，而非可执行的软件代码库。因此，README 中未包含具体的操作系统、GPU、内存、Python 版本或依赖库等运行环境需求。其内容主要面向工程师和研究人员，提供模型架构选择、优化器选择、批次大小设定及实验设计等方法论指导。用户需根据自己实际使用的深度学习框架（如 TensorFlow 或 PyTorch）自行配置相应的运行环境。",[],[13],"2026-03-27T02:49:30.150509","2026-04-06T07:13:42.387155",[95,100,105,110,115,120],{"id":96,"question_zh":97,"answer_zh":98,"source_url":99},16038,"项目是否支持或官方认可其他语言的翻译版本？","目前维护团队没有带宽支持多种语言的官方翻译，因此不会在主页面推荐任何非英语的分支版本。但是，项目的许可证允许社区成员自行 fork 并进行未经授权的翻译和扩展。建议在非官方翻译中明确说明其非“官方”性质，并链接回原始仓库。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fissues\u002F18",{"id":101,"question_zh":102,"answer_zh":103,"source_url":104},16039,"文档中提到 'Norm(x + f(x))' 会导致问题，有什么实验或论文依据吗？","这一说法主要涉及残差连接后是否需要再次进行归一化（Normalization）。相关讨论可参考以下论文：\n1. https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.04369（讨论了前后置层归一化在稳定性方面的差异）。\n2. https:\u002F\u002Farxiv.org\u002Fabs\u002F2206.00330（提供了另一个相关示例）。\n这通常意味着在残差连接之后不需要额外的 `Norm` 操作。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fissues\u002F31",{"id":106,"question_zh":107,"answer_zh":108,"source_url":109},16040,"大批量训练（Large Batch Size）是否更难调整超参数？是否应该避免使用？","这取决于你的约束条件：\n1. 如果主要约束是计算资源（或成本），且固定 epoch 数量：随着批量增大，能获得良好性能的超参数范围会变窄，确实需要更多的调优工作，此时大批量可能效率不高。\n2. 如果主要约束是墙钟时间（Wall clock time），且固定步数（step budget）：相反，大批量通常会有更大体积的超参数组合能达到目标性能。\n结论：不应仅仅因为难调优而避免大批量。最佳实践是使用大批量，但需确保训练时间足够长（即训练足够的步数）。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fissues\u002F11",{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},16041,"有没有推荐的论文或基准测试来帮助选择优化器（Optimizer）？","早期的 \"Crowded Valley\" 论文（arXiv:2007.01547）已不再是首选参考。目前更推荐使用 MLCommons Algorithms Working Group 的最新研究成果，特别是论文 https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.07179。该研究涵盖了多样化的数据集和模型组合，在定义明确的搜索空间中测试了流行的优化器，并根据性能配置文件得分进行了排名。也可以参考 AlgoPerf (https:\u002F\u002Fgithub.com\u002Fmlcommons\u002Falgorithmic-efficiency) 项目。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fissues\u002F10",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},16042,"如果最佳检查点出现在学习率预热（Warmup）阶段，可能是什么原因？","这种情况比较奇怪，可能表明模型在验证集上发生了过拟合（Overfitting）。需要检查预热步数设置为总训练步数的百分比是多少。例如，在一个简单的文本分类任务中，如果数据量为数十万，而预热步数仅为 400-1000 步（不到一个 epoch 的四分之一），却在此时达到最佳效果，通常意味着超参数设置存在问题或验证集泄露。","https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Ftuning_playbook\u002Fissues\u002F61",{"id":121,"question_zh":122,"answer_zh":123,"source_url":119},16043,"小数据集训练时，增大批量大小（Batch Size）是否会因为采样多样性减少而影响最终结果？","批量大小主要影响训练速度而非最终的泛化能力，前提是训练步数足够。虽然较小的批量在每个 epoch 内能看到更多种类的样本组合（由于洗牌），但在小数据集上，只要保证模型遍历了足够的步数（steps）而不是仅仅固定的 epoch 数，使用较大的批量通常不会使最终训练结果变得无意义。关键在于根据步数预算而非 epoch 预算来评估训练过程。",[]]