[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-gcr--torch-residual-networks":3,"tool-gcr--torch-residual-networks":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":79,"owner_twitter":79,"owner_website":82,"owner_url":83,"languages":84,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":97,"env_os":98,"env_gpu":99,"env_ram":98,"env_deps":100,"category_tags":109,"github_topics":79,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":110,"updated_at":111,"faqs":112,"releases":141},3644,"gcr\u002Ftorch-residual-networks","torch-residual-networks","This is a Torch implementation of [\"Deep Residual Learning for Image Recognition\",Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun](http:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385) the winners of the 2015 ILSVRC and COCO challenges.","torch-residual-networks 是基于 Torch 框架对经典论文《Deep Residual Learning for Image Recognition》的开源实现，旨在复现由何恺明等人提出、并荣获 2015 年 ILSVRC 和 COCO 挑战赛冠军的残差网络（ResNet）架构。该工具核心解决了深度神经网络中随着层数增加而出现的梯度消失及模型退化难题，通过引入“残差学习”机制，让深层网络更容易训练且性能更优。\n\n目前，torch-residual-networks 已在 CIFAR 数据集上验证了其有效性，能够稳定收敛并复现论文中的关键实验结果，包括不同模型尺寸、架构变体及优化策略对性能的影响分析。虽然 ImageNet 的大规模训练功能尚在完善中，但其提供的详细实验日志、误差曲线及预训练模型 artifacts，为研究者复现结果提供了极大便利。\n\n这款工具特别适合计算机视觉领域的研究人员和深度学习开发者使用。它不仅帮助使用者快速上手残差网络结构，还支持通过修改代码探索不同的超参数配置（如批归一化动量、替代求解器等）。对于希望深入理解深度残差学习原理或在自定义图","torch-residual-networks 是基于 Torch 框架对经典论文《Deep Residual Learning for Image Recognition》的开源实现，旨在复现由何恺明等人提出、并荣获 2015 年 ILSVRC 和 COCO 挑战赛冠军的残差网络（ResNet）架构。该工具核心解决了深度神经网络中随着层数增加而出现的梯度消失及模型退化难题，通过引入“残差学习”机制，让深层网络更容易训练且性能更优。\n\n目前，torch-residual-networks 已在 CIFAR 数据集上验证了其有效性，能够稳定收敛并复现论文中的关键实验结果，包括不同模型尺寸、架构变体及优化策略对性能的影响分析。虽然 ImageNet 的大规模训练功能尚在完善中，但其提供的详细实验日志、误差曲线及预训练模型 artifacts，为研究者复现结果提供了极大便利。\n\n这款工具特别适合计算机视觉领域的研究人员和深度学习开发者使用。它不仅帮助使用者快速上手残差网络结构，还支持通过修改代码探索不同的超参数配置（如批归一化动量、替代求解器等）。对于希望深入理解深度残差学习原理或在自定义图像识别任务中应用 ResNet 的技术人员来说，torch-residual-networks 是一个兼具教育意义与实用价值的参考基准。","Deep Residual Learning for Image Recognition\n============================================\n\nThis is a Torch implementation of [\"Deep Residual Learning for Image Recognition\",Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun](http:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385) the winners of the 2015 ILSVRC and COCO challenges.\n\n**What's working:** CIFAR converges, as per the paper.\n\n**What's not working yet:** Imagenet. I also have only implemented Option\n(A) for the residual network bottleneck strategy.\n\nTable of contents\n-----------------\n\n- [CIFAR: Effect of model size](#cifar-effect-of-model-size)\n- [CIFAR: Effect of model architecture on shallow networks](#cifar-effect-of-model-architecture)\n  - [...on deep networks](#cifar-effect-of-model-architecture-on-deep-networks)\n- [Imagenet: Others' preliminary model architecture experiments](#imagenet-effect-of-model-architecture-preliminary)\n- [CIFAR: Effect of alternate solvers (RMSprop, Adagrad, Adadelta)](#cifar-alternate-training-strategies-rmsprop-adagrad-adadelta)\n  - [...on deep networks](#cifar-alternate-training-strategies-on-deep-networks)\n- [CIFAR: Effect of batch normalization momentum](#effect-of-batch-norm-momentum)\n\nChanges\n-------\n- 2016-02-01: Added others' preliminary results on ImageNet for the architecture. (I haven't found time to train ImageNet yet)\n- 2016-01-21: Completed the 'alternate solver' experiments on deep networks. These ones take quite a long time.\n- 2016-01-19:\n  - **New results**: Re-ran the 'alternate building block' results on deeper networks. They have more of an effect.\n  - **Added a table of contents** to avoid getting lost.\n  - **Added experimental artifacts** (log of training loss and test error, the saved model, the any patches used on the source code, etc) for two of the more interesting experiments, for curious folks who want to reproduce our results. (These artifacts are hereby released under the zlib license.)\n- 2016-01-15:\n  - **New CIFAR results**: I re-ran all the CIFAR experiments and\n  updated the results. There were a few bugs: we were only testing on\n  the first 2,000 images in the training set, and they were sampled\n  with replacement. These new results are much more stable over time.\n- 2016-01-12: Release results of CIFAR experiments.\n\nHow to use\n----------\n- You need at least CUDA 7.0 and CuDNN v4.\n- Install Torch.\n- Install the Torch CUDNN V4 library: `git clone https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fcudnn.torch; cd cudnn; git co R4; luarocks make` This will give you `cudnn.SpatialBatchNormalization`, which helps save quite a lot of memory.\n- Install nninit: `luarocks install nninit`.\n- Download\n  [CIFAR 10](http:\u002F\u002Ftorch7.s3-website-us-east-1.amazonaws.com\u002Fdata\u002Fcifar-10-torch.tar.gz).\n  Use `--dataRoot \u003Ccifar>` to specify the location of the extracted CIFAR 10 folder.\n- Run `train-cifar.lua`.\n\nCIFAR: Effect of model size\n---------------------------\n\nFor this test, our goal is to reproduce Figure 6 from the original paper:\n\n![figure 6 from original paper](https:\u002F\u002Fi.imgur.com\u002Fq3lcHic.png)\n\nWe train our model for 200 epochs (this is about 7.8e4 of their\niterations on the above graph). Like their paper, we start at a\nlearning rate of 0.1 and reduce it to 0.01 at 80 epochs and then to\n0.01 at 160 epochs.\n\n### Training loss\n![Training loss curve](http:\u002F\u002Fi.imgur.com\u002FXqKnNX1.png)\n\n### Testing error\n![Test error curve](http:\u002F\u002Fi.imgur.com\u002Flt2D5cA.png)\n\n| Model                                 | My Test Error | Reference Test Error from Tab. 6 | Artifacts |\n|----|----|----|----|\n| Nsize=3, 20 layers                    | 0.0829 | 0.0875 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FSource.git-patch) |\n| Nsize=5, 32 layers                    | 0.0763 | 0.0751 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FSource.git-patch) |\n| Nsize=7, 44 layers                    | 0.0714 | 0.0717 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FSource.git-patch) |\n| Nsize=9, 56 layers                    | 0.0694 | 0.0697 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FSource.git-patch) |\n| Nsize=18, 110 layers, fancy policy¹   | 0.0673 | 0.0661² | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FSource.git-patch) |\n\nWe can reproduce the results from the paper to typically within 0.5%.\nIn all cases except for the 32-layer network, we achieve very slightly\nimproved performance, though this may just be noise.\n\n¹: For this run, we started from a learning rate of 0.001 until the\nfirst 400 iterations. We then raised the learning rate to 0.1 and\ntrained as usual. This is consistent with the actual paper's results.\n\n²: Note that the paper reports the best run from five runs, as well as\nthe mean. I consider the mean to be a valid test protocol, but I don't\nlike reporting the 'best' score because this is effectively training\non the test set. (This method of reporting effectively introduces an\nextra parameter into the model--which model to use from the\nensemble--and this parameter is fitted to the test set)\n\nCIFAR: Effect of model architecture\n-----------------------------------\n\nThis experiment explores the effect of different NN architectures that\nalter the \"Building Block\" model inside the residual network.\n\nThe original paper used a \"Building Block\" similar to the \"Reference\"\nmodel on the left part of the figure below, with the standard\nconvolution layer, batch normalization, and ReLU, followed by another\nconvolution layer and batch normalization. The only interesting piece\nof this architecture is that they move the ReLU after the addition.\n\nWe investigated two alternate strategies.\n\n![Three different alternate CIFAR architectures](https:\u002F\u002Fi.imgur.com\u002FuRMBOaS.png)\n\n- **Alternate 1: Move batch normalization after the addition.**\n  (Middle) The reasoning behind this choice is to test whether\n  normalizing the first term of the addition is desirable. It grew out\n  of the mistaken belief that batch normalization always normalizes to\n  have zero mean and unit variance. If this were true, building an\n  identity building block would be impossible because the input to the\n  addition always has unit variance. However, this is not true. BN\n  layers have additional learnable scale and bias parameters, so the\n  input to the batch normalization layer is not forced to have unit\n  variance.\n\n- **Alternate 2: Remove the second ReLU.** The idea behind this was\n  noticing that in the reference architecture, the input cannot\n  proceed to the output without being modified by a ReLU. This makes\n  identity connections *technically* impossible because negative\n  numbers would always be clipped as they passed through the skip\n  layers of the network. To avoid this, we could either move the ReLU\n  before the addition or remove it completely. However, it is not\n  correct to move the ReLU before the addition: such an architecture\n  would ensure that the output would never decrease because the first\n  addition term could never be negative. The other option is to simply\n  remove the ReLU completely, sacrificing the nonlinear property of\n  this layer. It is unclear which approach is better.\n\nTo test these strategies, we repeat the above protocol using the\nsmallest (20-layer) residual network model.\n\n(Note: The other experiments all use the leftmost \"Reference\" model.)\n\n![Training loss](http:\u002F\u002Fi.imgur.com\u002FqDDLZLQ.png)\n\n![Testing error](http:\u002F\u002Fi.imgur.com\u002FfTY6TL5.png)\n\n| Architecture                        | Test error |\n|-----------------------------------|----------|\n| ReLU, BN before add (ORIG PAPER reimplementation)    | 0.0829 |\n| No ReLU, BN before add              | 0.0862 |\n| ReLU, BN after add                  | 0.0834 |\n| No ReLU, BN after add               | 0.0823 |\n\nAll methods achieve accuracies within about 0.5% of each other.\nRemoving ReLU and moving the batch normalization after the addition\nseems to make a small improvement on CIFAR, but there is too much\nnoise in the test error curve to reliably tell a difference.\n\nCIFAR: Effect of model architecture on deep networks\n----------------------------------------------------\n\nThe above experiments on the 20-layer networks do not reveal any\ninteresting differences. However, these differences become more\npronounced when evaluated on very deep networks. We retry the above\nexperiments on 110-layer (Nsize=19) networks.\n\n![Training loss](http:\u002F\u002Fi.imgur.com\u002FRANDrXl.png)\n\n![Testing error](http:\u002F\u002Fi.imgur.com\u002FsldN4cK.png)\n\nResults:\n\n- For deep networks, **it's best to put the batch normalization before\n  the addition part of each building block layer**. This effectively\n  removes most of the batch normalization operations from the input\n  skip paths. If a batch normalization comes after each building\n  block, then there exists a path from the input straight to the\n  output that passes through several batch normalizations in a row.\n  This could be problematic because each BN is not idempotent (the\n  effects of several BN layers accumulate).\n\n- Removing the ReLU layer at the end of each building block appears to\n  give a small improvement (~0.6%)\n\n| Architecture                        | Test error | Artifacts |\n|-----------------------------------|----------|---|\n| ReLU, BN before add (ORIG PAPER reimplementation)    |  0.0697 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FSource.git-patch) |\n| No ReLU, BN before add              |  0.0632 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FSource.git-patch) |\n| ReLU, BN after add                  |  0.1356 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FSource.git-patch) |\n| No ReLU, BN after add               |  0.1230 | [Model](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002Fmodel.t7), [Loss](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FTraining%20loss.csv) and [Error](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FTesting%20Error.csv) logs, [Source commit](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FSource.git-current-commit) + [patch](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FSource.git-patch) |\n\nImageNet: Effect of model architecture (preliminary)\n----------------------------------------------------\n[@ducha-aiki is performing preliminary experiments on imagenet.](https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F5)\nFor ordinary CaffeNet networks, @ducha-aiki found that putting batch\nnormalization after the ReLU layer may provide a small benefit\ncompared to putting it before.\n\n> Second, results on CIFAR-10 often contradicts results on ImageNet. I.e., leaky ReLU > ReLU on CIFAR, but worse on ImageNet.\n\n@ducha-aiki's more detailed results here: https:\u002F\u002Fgithub.com\u002Fducha-aiki\u002Fcaffenet-benchmark\u002Fblob\u002Fmaster\u002Fbatchnorm.md\n\n\nCIFAR: Alternate training strategies (RMSPROP, Adagrad, Adadelta)\n-----------------------------------------------------------------\n\nCan we improve on the basic SGD update rule with Nesterov momentum?\nThis experiment aims to find out. Common wisdom suggests that\nalternate update rules may converge faster, at least initially, but\nthey do not outperform well-tuned SGD in the long run.\n\n![Training loss curve](http:\u002F\u002Fi.imgur.com\u002F0ZxQZ7k.png)\n\n![Testing error curve](http:\u002F\u002Fi.imgur.com\u002FoLzwLDo.png)\n\nIn our experiments, vanilla SGD with Nesterov momentum and a learning\nrate of 0.1 eventually reaches the lowest test error. Interestingly,\nRMSPROP with learning rate 1e-2 achieves a lower training loss, but\noverfits.\n\n| Strategy                                      | Test error |\n|---------------------------------------------|----------|\n| Original paper: SGD + Nesterov momentum, 1e-1 | 0.0829     |\n| RMSprop, learrning rate = 1e-4                | 0.1677     |\n| RMSprop, 1e-3                                 | 0.1055     |\n| RMSprop, 1e-2                                 | 0.0945     |\n| Adadelta¹, rho = 0.3                          | 0.1093     |\n| Adagrad, 1e-3                                 | 0.3536     |\n| Adagrad, 1e-2                                 | 0.1603     |\n| Adagrad, 1e-1                                 | 0.1255     |\n\n¹: Adadelta does not use a learning rate, so we did not use the same\nlearning rate policy as in the paper. We just let it run until\nconvergence.\n\nSee\n[Andrej Karpathy's CS231N notes](https:\u002F\u002Fcs231n.github.io\u002Fneural-networks-3\u002F#update)\nfor more details on each of these learning strategies.\n\nCIFAR: Alternate training strategies on deep networks\n-----------------------------------------------------\n\nDeeper networks are more prone to overfitting. Unlike the earlier\nexperiments, all of these models (except Adagrad with a learning rate\nof 1e-3) achieve a loss under 0.1, but test error varies quite wildly.\nOnce again, using vanilla SGD with Nesterov momentum achieves the\nlowest error.\n\n![Training loss](http:\u002F\u002Fi.imgur.com\u002FZvMfLtk.png)\n\n![Testing error](http:\u002F\u002Fi.imgur.com\u002FB8PMIQw.png)\n\n| Solver                                    | Testing error |\n|-------------------------------------------|--------|\n| Nsize=18, Original paper: Nesterov, 1e-1  | 0.0697 |\n| Nsize=18, RMSprop, 1e-4                   | 0.1482 |\n| Nsize=18, RMSprop, 1e-3                   | 0.0821 |\n| Nsize=18, RMSprop, 1e-2                   | 0.0768 |\n| Nsize=18, RMSprop, 1e-1                   | 0.1098 |\n| Nsize=18, Adadelta                        | 0.0888 |\n| Nsize=18, Adagrad, 1e-3                   | 0.3022 |\n| Nsize=18, Adagrad, 1e-2                   | 0.1321 |\n| Nsize=18, Adagrad, 1e-1                   | 0.1145 |\n\nEffect of batch norm momentum\n-----------------------------\n\nFor our experiments, we use batch normalization using an exponential\nrunning mean and standard deviation with a momentum of 0.1, meaning\nthat the running mean and std changes by 10% of its value at each\nbatch. A value of 1.0 would cause the batch normalization layer to\ncalculate the mean and standard deviation across only the current\nbatch, and a value of 0 would cause the batch normalization layer to\nstop accumulating changes in the running mean and standard deviation.\n\nThe strictest interpretation of the original batch normalization paper\nis to calculate the mean and standard deviation across the entire\ntraining set at every update. This takes too long in practice, so the\nexponential average is usually used instead.\n\nWe attempt to see whether batch normalization momentum affects\nanything. We try different values away from the default, along with a\n\"dynamic\" update strategy that sets the momentum to 1 \u002F (1+n), where n\nis the number of batches seen so far (N resets to 0 at every epoch).\nAt the end of training for a certain epoch, this means the batch\nnormalization's running mean and standard deviation is effectively\ncalculated over the entire training set.\n\nNone of these effects appear to make a significant difference.\n\n![Test error curve](http:\u002F\u002Fi.imgur.com\u002F3M1P79N.png)\n\n| Strategy | Test Error |\n|----|----|\n| BN, momentum = 1 just for fun      |  0.0863 |\n| BN, momentum = 0.01                |  0.0835 |\n| Original paper: BN momentum = 0.1  |  0.0829 |\n| Dynamic, reset every epoch.        |  0.0822 |\n\n\n\nTODO: Imagenet\n--------------\n","深度残差学习用于图像识别\n============================================\n\n这是对2015年ILSVRC和COCO挑战赛冠军论文《深度残差学习用于图像识别》（作者：何凯明、张祥雨、任少庆、孙剑）的Torch实现。\n\n**已实现的功能：** CIFAR数据集可以收敛，与论文一致。\n\n**尚未实现的功能：** ImageNet数据集。此外，我目前只实现了残差网络瓶颈结构的选项(A)。\n\n目录\n-----------------\n\n- [CIFAR：模型大小的影响](#cifar-effect-of-model-size)\n- [CIFAR：浅层网络中模型架构的影响](#cifar-effect-of-model-architecture)\n  - [...在深层网络中的影响](#cifar-effect-of-model-architecture-on-deep-networks)\n- [ImageNet：他人关于模型架构的初步实验](#imagenet-effect-of-model-architecture-preliminary)\n- [CIFAR：替代优化器的影响（RMSprop、Adagrad、Adadelta）](#cifar-alternate-training-strategies-rmsprop-adagrad-adadelta)\n  - [...在深层网络中的影响](#cifar-alternate-training-strategies-on-deep-networks)\n- [CIFAR：批归一化动量的影响](#effect-of-batch-norm-momentum)\n\n更新记录\n-------\n- 2016-02-01：添加了他人在ImageNet数据集上关于该架构的初步结果。（我尚未有时间训练ImageNet）\n- 2016-01-21：完成了深层网络上的“替代优化器”实验。这些实验耗时较长。\n- 2016-01-19：\n  - **新结果**：重新运行了深层网络上的“替代构建块”实验。这些实验的效果更为显著。\n  - **增加了目录**，以方便查阅。\n  - **添加了实验产物**（训练损失和测试误差的日志、保存的模型、对源代码所做的补丁等），供希望复现我们结果的好奇者参考。（这些产物在此依据zlib许可证发布。）\n- 2016-01-15：\n  - **新的CIFAR结果**：我重新运行了所有CIFAR实验并更新了结果。之前存在一些错误：我们仅在训练集的前2,000张图片上进行测试，且是带放回地随机采样。这些新结果随时间推移更加稳定。\n- 2016-01-12：发布CIFAR实验结果。\n\n使用方法\n----------\n- 需要至少CUDA 7.0和CuDNN v4。\n- 安装Torch。\n- 安装Torch CuDNN V4库：`git clone https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fcudnn.torch; cd cudnn; git co R4; luarocks make` 这将提供`cudnn.SpatialBatchNormalization`，有助于节省大量内存。\n- 安装nninit：`luarocks install nninit`。\n- 下载\n  [CIFAR 10](http:\u002F\u002Ftorch7.s3-website-us-east-1.amazonaws.com\u002Fdata\u002Fcifar-10-torch.tar.gz)。\n  使用`--dataRoot \u003Ccifar>`指定解压后的CIFAR 10文件夹位置。\n- 运行`train-cifar.lua`。\n\nCIFAR：模型大小的影响\n---------------------------\n\n本次测试的目标是复现原论文中的图6：\n\n![原论文中的图6](https:\u002F\u002Fi.imgur.com\u002Fq3lcHic.png)\n\n我们将模型训练200个epoch（这大约相当于上图中他们的7.8万次迭代）。与原论文相同，我们从0.1的学习率开始，在第80个epoch将其降低到0.01，并在第160个epoch再次降至0.01。\n\n### 训练损失\n![训练损失曲线](http:\u002F\u002Fi.imgur.com\u002FXqKnNX1.png)\n\n### 测试误差\n![测试误差曲线](http:\u002F\u002Fi.imgur.com\u002Flt2D5cA.png)\n\n| 模型                                 | 我的测试误差 | 表6中的参考测试误差 | 实验产物 |\n|----|----|----|----|\n| Nsize=3，20层                    | 0.0829 | 0.0875 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002Fmodel.t7)，[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FTesting%20Error.csv)日志，[源码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-AnY56THQt7\u002FSource.git-patch) |\n| Nsize=5，32层                    | 0.0763 | 0.0751 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002Fmodel.t7)，[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FTesting%20Error.csv)日志，[源码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141709-rewkex7oPJ\u002FSource.git-patch) |\n| Nsize=7，44层                    | 0.0714 | 0.0717 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002Fmodel.t7)，[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FTesting%20Error.csv)日志，[源码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-HxIw7lGPyu\u002FSource.git-patch) |\n| Nsize=9，56层                    | 0.0694 | 0.0697 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002Fmodel.t7)，[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FTesting%20Error.csv)日志，[源码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601141710-te4ScgnYMA\u002FSource.git-patch) |\n| Nsize=18，110层，特殊策略¹   | 0.0673 | 0.0661² | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002Fmodel.t7)，[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FTesting%20Error.csv)日志，[源码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601142006-5T5D1DO3VP\u002FSource.git-patch) |\n\n我们通常可以在0.5%的误差范围内复现论文中的结果。除32层网络外，其他情况下我们的性能都略优于论文中的结果，但这可能只是噪声所致。\n\n¹：对于这次运行，我们在前400次迭代中使用0.001的学习率，随后将其提高到0.1，并按常规方式继续训练。这与原文的结果一致。\n\n²：请注意，论文报告了五次运行中的最佳结果以及平均值。我认为平均值是一种有效的测试协议，但我不喜欢报告“最佳”分数，因为这实际上相当于在测试集上进行训练。（这种报告方式实际上为模型引入了一个额外的参数——从集成中选择哪个模型——而这个参数是针对测试集拟合的）\n\nCIFAR：模型架构的影响\n-------------------\n\n本实验探讨了不同神经网络架构对残差网络内部“构建块”模型的影响。\n\n原始论文使用的“构建块”类似于下图左侧的“参考”模型，包含标准卷积层、批归一化和ReLU激活函数，随后是另一层卷积和批归一化。该架构中唯一有趣的地方在于他们将ReLU放在了加法之后。\n\n我们研究了两种替代策略。\n\n![三种不同的CIFAR架构](https:\u002F\u002Fi.imgur.com\u002FuRMBOaS.png)\n\n- **替代方案1：将批归一化移至加法之后。**（中间）这一选择的理由是测试是否对加法的第一项进行归一化是有益的。它源于一种误解，即批归一化总是会将数据归一化为均值为零、方差为1的状态。如果真是如此，构建一个恒等构建块将是不可能的，因为加法的输入始终具有单位方差。然而，事实并非如此。批归一化层具有可学习的缩放和偏置参数，因此输入到批归一化层的数据并不一定会被强制归一化为单位方差。\n\n- **替代方案2：去掉第二个ReLU。** 这一想法的灵感来自于观察到，在参考架构中，输入必须经过ReLU的修改才能传递到输出。这使得恒等连接在技术上几乎不可能实现，因为负数在通过网络的跳跃层时会被始终截断。为了避免这种情况，我们可以将ReLU移到加法之前，或者完全移除它。然而，将ReLU移到加法之前并不正确：这样的架构会确保输出永远不会减少，因为加法的第一个项不可能为负数。另一种选择则是直接移除ReLU，从而牺牲该层的非线性特性。目前尚不清楚哪种方法更好。\n\n为了测试这些策略，我们使用最小的（20层）残差网络模型重复上述实验。\n\n（注：其他实验均采用最左侧的“参考”模型。）\n\n![训练损失](http:\u002F\u002Fi.imgur.com\u002FqDDLZLQ.png)\n\n![测试误差](http:\u002F\u002Fi.imgur.com\u002FfTY6TL5.png)\n\n| 架构                        | 测试误差 |\n|-------------------------------|----------|\n| ReLU，BN在加法前（原论文复现）    | 0.0829 |\n| 无ReLU，BN在加法前              | 0.0862 |\n| ReLU，BN在加法后                  | 0.0834 |\n| 无ReLU，BN在加法后               | 0.0823 |\n\n所有方法的准确率相差都在0.5%以内。在CIFAR数据集上，移除ReLU并将批归一化移至加法之后似乎带来了一点点改进，但测试误差曲线中的噪声太大，难以可靠地判断出差异。\n\nCIFAR：模型架构对深度网络的影响\n------------------------------------\n\n上述针对20层网络的实验并未显示出任何显著差异。然而，当评估非常深的网络时，这些差异变得更加明显。我们再次在110层（Nsize=19）网络上重复了上述实验。\n\n![训练损失](http:\u002F\u002Fi.imgur.com\u002FRANDrXl.png)\n\n![测试误差](http:\u002F\u002Fi.imgur.com\u002FsldN4cK.png)\n\n结果：\n\n- 对于深度网络而言，**最好将批归一化置于每个构建块层的加法部分之前**。这样可以有效地减少输入跳跃路径上的批归一化操作次数。如果每个构建块之后都紧跟着批归一化，那么就存在一条从输入直接通向输出的路径，这条路径会连续经过多个批归一化层。这可能会带来问题，因为每个批归一化层都不是幂等的（多次批归一化的效果会累积）。\n\n- 移除每个构建块末端的ReLU层似乎能带来小幅提升（约0.6%）。\n\n| 架构                        | 测试误差 | 附带资料 |\n|-------------------------------|----------|----------|\n| ReLU，BN在加法前（原论文复现）    | 0.0697 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002Fmodel.t7)、[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FTesting%20Error.csv)日志，[源代码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181920-jmOtpiNPQa\u002FSource.git-patch) |\n| 无ReLU，BN在加法前              | 0.0632 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002Fmodel.t7)、[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FTesting%20Error.csv)日志，[源代码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181924-V2wDg0NKDK\u002FSource.git-patch) |\n| ReLU，BN在加法后                  | 0.1356 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002Fmodel.t7)、[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FTesting%20Error.csv)日志，[源代码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181922-8VYWhyuTuA\u002FSource.git-patch) |\n| 无ReLU，BN在加法后               | 0.1230 | [模型](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002Fmodel.t7)、[损失](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FTraining%20loss.csv)和[误差](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FTesting%20Error.csv)日志，[源代码提交](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FSource.git-current-commit) + [补丁](https:\u002F\u002Fmjw-xi8mledcnyry.s3.amazonaws.com\u002Fexperiments\u002F201601181923-Qfp5mTA2u9\u002FSource.git-patch) |\n\nImageNet：模型架构的影响（初步）\n----------------------------------------------------\n[@ducha-aiki 正在进行 ImageNet 的初步实验。](https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F5)\n对于普通的 CaffeNet 网络，@ducha-aiki 发现将批归一化放在 ReLU 层之后，相比放在之前，可能会带来一点小的提升。\n\n> 其次，CIFAR-10 上的结果往往与 ImageNet 上的结果相矛盾。例如，在 CIFAR 上，带泄露的 ReLU 优于普通 ReLU，但在 ImageNet 上却表现更差。\n\n@ducha-aiki 的更详细结果请见：https:\u002F\u002Fgithub.com\u002Fducha-aiki\u002Fcaffenet-benchmark\u002Fblob\u002Fmaster\u002Fbatchnorm.md\n\n\nCIFAR：其他训练策略（RMSPROP、Adagrad、Adadelta）\n-----------------------------------------------------------------\n\n我们能否在带有 Nesterov 动量的基本 SGD 更新规则基础上进一步改进？本次实验旨在探究这一点。普遍的观点认为，其他更新规则至少在初期可能收敛得更快，但从长期来看，它们通常无法超越经过精心调优的 SGD。\n\n![训练损失曲线](http:\u002F\u002Fi.imgur.com\u002F0ZxQZ7k.png)\n\n![测试误差曲线](http:\u002F\u002Fi.imgur.com\u002FoLzwLDo.png)\n\n在我们的实验中，采用 Nesterov 动量且学习率为 0.1 的普通 SGD 最终达到了最低的测试误差。有趣的是，学习率为 1e-2 的 RMSPROP 虽然训练损失更低，但却出现了过拟合现象。\n\n| 策略                                      | 测试误差 |\n|---------------------------------------------|----------|\n| 原论文：SGD + Nesterov 动量，1e-1           | 0.0829     |\n| RMSprop，学习率 = 1e-4                      | 0.1677     |\n| RMSprop，1e-3                                | 0.1055     |\n| RMSprop，1e-2                                | 0.0945     |\n| Adadelta¹，rho = 0.3                         | 0.1093     |\n| Adagrad，1e-3                                | 0.3536     |\n| Adagrad，1e-2                                | 0.1603     |\n| Adagrad，1e-1                                | 0.1255     |\n\n¹：Adadelta 不使用学习率，因此我们没有沿用论文中的学习率策略，而是让其一直运行直到收敛。\n\n有关这些学习策略的更多细节，请参阅\n[Andrej Karpathy 的 CS231N 笔记](https:\u002F\u002Fcs231n.github.io\u002Fneural-networks-3\u002F#update)。\n\nCIFAR：深度网络上的其他训练策略\n-----------------------------------------------------\n\n深度网络更容易出现过拟合问题。与之前的实验不同，除了学习率为 1e-3 的 Adagrad 外，所有模型的损失都低于 0.1，但测试误差却差异很大。再次证明，使用带有 Nesterov 动量的普通 SGD 能够达到最低的误差。\n\n![训练损失](http:\u002F\u002Fi.imgur.com\u002FZvMfLtk.png)\n\n![测试误差](http:\u002F\u002Fi.imgur.com\u002FB8PMIQw.png)\n\n| 求解器                                    | 测试误差 |\n|-------------------------------------------|--------|\n| 网络规模=18，原论文：Nesterov，1e-1       | 0.0697 |\n| 网络规模=18，RMSprop，1e-4                 | 0.1482 |\n| 网络规模=18，RMSprop，1e-3                 | 0.0821 |\n| 网络规模=18，RMSprop，1e-2                 | 0.0768 |\n| 网络规模=18，RMSprop，1e-1                 | 0.1098 |\n| 网络规模=18，Adadelta                      | 0.0888 |\n| 网络规模=18，Adagrad，1e-3                 | 0.3022 |\n| 网络规模=18，Adagrad，1e-2                 | 0.1321 |\n| 网络规模=18，Adagrad，1e-1                 | 0.1145 |\n\n批归一化动量的影响\n-----------------------------\n\n在我们的实验中，我们使用了基于指数移动平均的批归一化方法，动量值为 0.1，这意味着每次批次处理时，运行均值和标准差会根据当前批次的数据调整其值的 10%。如果动量值为 1.0，则批归一化层只会基于当前批次计算均值和标准差；而动量值为 0 时，批归一化层将不再累积运行均值和标准差的变化。\n\n严格来说，原始批归一化论文建议在每次更新时都基于整个训练集计算均值和标准差。然而，这在实际操作中耗时过长，因此通常采用指数移动平均的方法。\n\n我们尝试观察批归一化动量是否会对结果产生影响。为此，我们测试了几个不同于默认值的动量参数，并引入了一种“动态”更新策略，即动量设置为 1 \u002F (1+n)，其中 n 是迄今为止已见过的批次数量（每完成一个 epoch，n 会重置为 0）。这样一来，在某个 epoch 结束时，批归一化层的运行均值和标准差实际上就相当于在整个训练集上计算得出的。\n\n然而，实验结果表明，这些不同的动量设置并未带来显著差异。\n\n![测试误差曲线](http:\u002F\u002Fi.imgur.com\u002F3M1P79N.png)\n\n| 策略 | 测试误差 |\n|----|----|\n| BN，动量 = 1 仅作趣味尝试      |  0.0863 |\n| BN，动量 = 0.01                |  0.0835 |\n| 原论文：BN 动量 = 0.1          |  0.0829 |\n| 动态策略，每 epoch 重置        |  0.0822 |\n\n\n\n待办事项：ImageNet\n--------------","# torch-residual-networks 快速上手指南\n\n本指南基于 Kaiming He 等人提出的深度残差网络（ResNet）论文，提供在 Torch 框架下的 CIFAR-10 图像识别实现。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **GPU 驱动**：需安装 CUDA 7.0 或更高版本。\n*   **加速库**：需安装 CuDNN v4。\n*   **框架**：已安装 Torch (Lua)。\n\n> **注意**：本项目目前主要支持 CIFAR-10 数据集的复现，ImageNet 部分尚未完全实现。\n\n## 安装步骤\n\n请按顺序执行以下命令以配置依赖环境和数据集：\n\n1.  **安装 Torch CuDNN v4 库**\n    该库提供了 `cudnn.SpatialBatchNormalization`，能显著节省显存。\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fcudnn.torch\n    cd cudnn.torch\n    git co R4\n    luarocks make\n    cd ..\n    ```\n\n2.  **初始化工具包**\n    安装权重初始化库 `nninit`：\n    ```bash\n    luarocks install nninit\n    ```\n\n3.  **下载数据集**\n    下载 CIFAR-10 数据集（Torch 格式）：\n    ```bash\n    wget http:\u002F\u002Ftorch7.s3-website-us-east-1.amazonaws.com\u002Fdata\u002Fcifar-10-torch.tar.gz\n    tar -xzf cifar-10-torch.tar.gz\n    ```\n    *注：国内用户若下载缓慢，可尝试使用迅雷等工具加速，或自行寻找国内镜像源替换上述链接。*\n\n## 基本使用\n\n完成安装后，即可运行训练脚本。假设您将 CIFAR-10 数据解压到了 `\u003Ccifar>` 目录：\n\n```bash\nth train-cifar.lua --dataRoot \u003Ccifar>\n```\n\n**参数说明：**\n*   `--dataRoot`：指定解压后的 CIFAR-10 文件夹路径。\n\n**默认训练策略：**\n*   **迭代次数**：200 epochs。\n*   **学习率调整**：初始为 0.1，在第 80 个 epoch 降至 0.01，在第 160 个 epoch 降至 0.001。\n*   **模型验证**：默认复现原论文 Figure 6 的结果，测试误差通常在 0.5% 范围内与论文一致。","某计算机视觉初创团队正致力于开发一套高精度的工业缺陷检测系统，需要在有限的标注数据上训练极深的神经网络以识别微小瑕疵。\n\n### 没有 torch-residual-networks 时\n- **深层网络难以收敛**：尝试构建超过 20 层的自定义卷积网络时，梯度消失问题严重，模型在 CIFAR-10 数据集上的测试错误率居高不下，无法复现论文中的高性能。\n- **实验复现成本高昂**：缺乏经过验证的残差块（Residual Block）标准实现，工程师需花费数周时间手动调试架构细节，且难以确定是代码错误还是算法局限。\n- **训练过程不稳定**：在使用不同优化器（如 RMSprop 或 Adagrad）时，损失曲线波动剧烈，缺乏像 Batch Normalization 与残差结构结合后的稳定训练表现。\n- **资源浪费严重**：由于架构设计不当，显存占用过高且训练效率低下，导致在 CUDA 环境下的迭代周期被大幅拉长。\n\n### 使用 torch-residual-networks 后\n- **轻松突破深度瓶颈**：直接调用该工具中已验证的 20 层及以上残差网络架构，在 CIFAR-10 上快速复现了低至 8.29% 的测试错误率，成功解决了梯度消失难题。\n- **开箱即用的可靠基线**：利用其提供的完整训练脚本和预调参策略（如学习率衰减计划），团队将模型搭建与验证时间从数周缩短至几天，立即获得可信赖的性能基准。\n- **训练稳定且高效**：借助工具集成的 `cudnn.SpatialBatchNormalization` 和优化后的残差连接策略，模型在深层训练时损失曲线平滑收敛，显著提升了调试效率。\n- **透明可复现的实验资产**：直接参考其开源的训练日志、误差曲线及保存模型，团队能快速定位自身数据差异，避免了重复造轮子带来的算力浪费。\n\ntorch-residual-networks 通过将顶会获奖算法转化为可执行的 Torch 代码，让开发者能站在巨人的肩膀上，以最低成本实现工业级的高精度图像识别。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgcr_torch-residual-networks_99b417e1.png","gcr","Kimmy","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgcr_00647108.jpg",null,"Krnel AI","NYC","https:\u002F\u002Fkjwilber.org","https:\u002F\u002Fgithub.com\u002Fgcr",[85,89],{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",95.4,{"name":90,"color":91,"percentage":92},"Lua","#000080",4.6,581,146,"2026-03-09T12:56:01","Zlib",4,"未说明","必需 NVIDIA GPU，需支持 CUDA 7.0 及以上",{"notes":101,"python":102,"dependencies":103},"该项目基于旧的 Torch (Lua) 框架而非 PyTorch 或 Python。安装时需手动克隆并编译 cudnn.torch 库（切换至 R4 分支），并使用 luarocks 安装 nninit。目前仅支持 CIFAR-10 数据集的复现，ImageNet 部分尚未完成。","不适用 (基于 Lua\u002FTorch)",[104,105,106,107,108],"Torch","cudnn.torch (v4\u002FR4)","nninit","CUDA 7.0+","CuDNN v4+",[14,13],"2026-03-27T02:49:30.150509","2026-04-06T05:44:09.447157",[113,118,123,128,133,137],{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},16692,"如何在 train-cifar.lua 中配置和使用 lab-workbook 进行快照保存？","lab-workbook 需要配置文件 `~\u002F.lab-workbook-config`。如果你使用 AWS，可以先运行 `aws configure` 命令来设置凭证，这通常是配置该库的好方法。如果不喜欢这个库，也可以直接替换代码中的相关调用，使用你自己选择的日志记录或保存方式，因为每个调用的功能通常都很直观。详细文档可参考：https:\u002F\u002Fgithub.com\u002Fgcr\u002Flab-workbook#readme","https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F4",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},16693,"为什么 CIFAR-10 网络架构在最后添加了一个额外的残差层而不是直接接平均池化？","这是一个架构调整。原作者最初的设计导致结果不稳定，后来发现是因为测试集大小设置错误（ typo）。修正测试集大小并调整架构（包括保留该层或按论文标准修改）后，结果变得更加稳定。建议检查你的测试集加载逻辑，确保使用了完整的 10k 测试图像进行评估。","https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F2",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},16694,"模型训练不收敛或收敛缓慢怎么办？","如果遇到不收敛的问题，请尝试移除 `nn.LogSoftMax()` 层之前的 `nn.ReLU` 层。此外，权重初始化对收敛速度至关重要，Torch 默认的权重初始化可能导致收敛变慢。如果网络几乎没有进展，建议检查权重初始化策略。","https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F1",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},16695,"为什么我的残差网络测试误差只能降到 9% 左右且下降缓慢？","这可能是由参数初始化不佳引起的。有用户反馈，当输入特征图数量与残差块输出不同时，使用卷积核大小为 1 的跳跃连接实现是可行的。如果效果不好，重点检查参数初始化方法，并确保在足够的 epoch（例如超过 80 个 epoch）后适当降低学习率（如除以 10）。良好的初始化能让网络几乎立即开始取得进展。","https:\u002F\u002Fgithub.com\u002Fgcr\u002Ftorch-residual-networks\u002Fissues\u002F7",{"id":134,"question_zh":135,"answer_zh":136,"source_url":122},16696,"如何验证 CIFAR-10 的测试结果是否正确？","确保你在评估时使用了完整的 10,000 张 CIFAR 测试图像。之前有案例因为测试集大小设置错误（只用了很少的图片），导致评估运行时间极短且结果不稳定。修正测试集大小后，重新运行实验以获得准确的误差率。",{"id":138,"question_zh":139,"answer_zh":140,"source_url":122},16697,"有没有其他复现 ResNet 的结果或替代实现可以参考？","除了本仓库，社区还有其他实现可供参考对比。例如，有用户使用 CNTK 复现了 CIFAR-10 20 层网络，达到了约 8.19% 的错误率；也有用户提供了基于 Torch 的改进版本（https:\u002F\u002Fgithub.com\u002Feriche2016\u002Fdeep_residual_networks_cifar）。对比不同框架的实现细节有助于排查问题。",[]]