[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-cresset-template--cresset":3,"tool-cresset-template--cresset":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":80,"owner_website":80,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":10,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":113,"github_topics":114,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":164},4106,"cresset-template\u002Fcresset","cresset","Template repository to build PyTorch projects from source on any version of PyTorch\u002FCUDA\u002FcuDNN.","Cresset 是一个专为深度学习打造的开源项目模板，旨在帮助开发者在任何版本的 PyTorch、CUDA 和 cuDNN 环境下，轻松构建可复现的源代码训练项目。它本质上是一套基于 Docker Compose 的现代化 MLOps 系统，核心目标是解决深度学习领域中普遍存在的环境配置难题。\n\n在传统的开发流程中，研究人员往往耗费大量时间在适配显卡驱动、安装特定版本的 CUDA 工具包以及处理依赖冲突上，导致实验难以复现且协作效率低下。Cresset 通过容器化技术将这些复杂的环境依赖封装起来，让用户只需简单几步即可启动交互式开发环境，彻底消除了“在我机器上能跑”的兼容性困扰。无论是需要快速验证想法的学术研究者，还是追求工程稳定性的企业开发人员，都能从中受益。\n\n其独特的技术亮点在于极高的灵活性与规范性：它不仅支持从旧版到最新版的各类 PyTorch 组合，还内置了 pre-commit 等代码质量检查工具，倡导业界最佳实践。用户无需具备深厚的运维背景，只需在终端运行几条命令，即可在 Linux、Windows (WSL2) 或 Mac 上获得一致且高效的训练体验，让精力真正回归到模","Cresset 是一个专为深度学习打造的开源项目模板，旨在帮助开发者在任何版本的 PyTorch、CUDA 和 cuDNN 环境下，轻松构建可复现的源代码训练项目。它本质上是一套基于 Docker Compose 的现代化 MLOps 系统，核心目标是解决深度学习领域中普遍存在的环境配置难题。\n\n在传统的开发流程中，研究人员往往耗费大量时间在适配显卡驱动、安装特定版本的 CUDA 工具包以及处理依赖冲突上，导致实验难以复现且协作效率低下。Cresset 通过容器化技术将这些复杂的环境依赖封装起来，让用户只需简单几步即可启动交互式开发环境，彻底消除了“在我机器上能跑”的兼容性困扰。无论是需要快速验证想法的学术研究者，还是追求工程稳定性的企业开发人员，都能从中受益。\n\n其独特的技术亮点在于极高的灵活性与规范性：它不仅支持从旧版到最新版的各类 PyTorch 组合，还内置了 pre-commit 等代码质量检查工具，倡导业界最佳实践。用户无需具备深厚的运维背景，只需在终端运行几条命令，即可在 Linux、Windows (WSL2) 或 Mac 上获得一致且高效的训练体验，让精力真正回归到模型创新本身。","# Cresset: The One Template to Train Them All\n\n[![GitHub stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fstargazers)\n[![GitHub issues](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues)\n[![GitHub forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fnetwork)\n[![pre-commit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpre--commit-enabled-brightgreen?logo=pre-commit)](https:\u002F\u002Fgithub.com\u002Fpre-commit\u002Fpre-commit)\n[![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fblob\u002Fmain\u002FLICENSE)\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.7939089.svg)](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.7939089)\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl?url=https%3A%2F%2Fgithub.com%2Fcresset-template%2Fcresset)](https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=Awesome_Project!!!:&url=https%3A%2F%2Fgithub.com%2Fcresset-template%2Fcresset)\n\n![Cresset Logo](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fblob\u002Fmain\u002Fassets\u002Flogo.png \"Logo\")\n\n---\n\n## TL;DR\n\n**_A new MLOps system for deep learning development using Docker Compose\nwith the aim of providing reproducible and easy-to-use interactive\ndevelopment environments for deep learning practitioners.\nHopefully, the methods presented here will become\nbest practice in both academia and industry._**\n\n## Introductory Video (In English)\n\n## [![Weights and Biases Presentation](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcresset-template_cresset_readme_0d132817234e.jpg)](https:\u002F\u002Fyoutu.be\u002FsW3VxlJl46o?t=6865 \"Weights and Biases Presentation\")\n\n## Installation on a New Host\n\nIf this is your first time using this project, follow these steps:\n\n1. Install the NVIDIA CUDA [Driver](https:\u002F\u002Fwww.nvidia.com\u002Fdownload\u002Findex.aspx)\n   appropriate for the target host and NVIDIA GPU.\n   If the driver has already been installed,\n   check that the installed version is compatible with the target CUDA version.\n   CUDA driver version mismatch is the single most common issue for new users.\n   See the\n   [compatibility matrix](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)\n   for compatible versions of the CUDA driver and CUDA Toolkit.\n\n2. Install [Docker](https:\u002F\u002Fdocs.docker.com\u002Fget-docker) (v23.0+ is recommended)\n   or update to a recent version compatible with Docker Compose V2.\n   Docker incompatibility with Docker Compose V2 is another common issue for new users.\n   Note that Windows users may use WSL (Windows Subsystem for Linux).\n   Cresset has been tested on Windows 11 WSL2 with the Windows CUDA driver\n   using Docker Desktop for Windows. There is no need to install a separate\n   WSL CUDA driver or Docker for Linux inside WSL.\n   Note that only Docker Desktop is under a commercial EULA and Docker Engine\n   (for Linux) and Lima Docker (for Mac) are still both open-source.\n   _N.B._ Windows Security real-time protection causes significant slowdown if enabled.\n   Disable any active antivirus programs on Windows for best performance.\n   _N.B._ Linux hosts may also install via this\n   [repo](https:\u002F\u002Fgithub.com\u002Fdocker\u002Fdocker-install).\n\n3. Install the NVIDIA Container Toolkit as specified in this\n   [link](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html).\n\n4. Run `make install-compose` to install Docker Compose V2 for Linux hosts.\n   Installation does _**not**_ require `root` permissions. Visit the\n   [documentation](https:\u002F\u002Fdocs.docker.com\u002Fcompose\u002Fcli-command\u002F#install-on-linux)\n   for the latest installation information. Note that Docker Compose V2\n   is available for WSL users with Docker Desktop by default.\n\n5. Run `make env SERVICE=(train|devel|ngc|simple)` on the terminal\n   at project root to create a basic `.env` file.\n   The `.env` file provides environment variables for `docker-compose.yaml`,\n   allowing different users and machines to set their own variables as required.\n   The Makefile has also been configured to read values from the `.env` file\n   if it exists, allowing non-default values to be specified only once.\n   Each host should have a separate `.env` file for host-specific configurations.\n\n6. Run `make over` to create a `docker-compose.override.yaml` file.\n   Add configurations that should not be shared via source control there.\n   For example, volume-mount pairs specific to each host machine.\n\n7. If Cresset is being placed within a pre-existing project's subdirectory,\n   change the `volume` pairing from `.:${PROJECT_ROOT}` to `..:${PROJECT_ROOT}`.\n   All commands in Cresset assume that they are being run at project root\n   but this can be changed easily.\n\n### Explanation of services\n\nDifferent Docker Compose services are organized to serve different needs.\n\n- `train`, the default service, should be used when compiled dependencies are\n  necessary or when PyTorch needs to be compiled from source due to\n  Compute Capability issues, etc.\n- `devel` is designed for PyTorch CUDA\u002FC++ developers who need to recompile\n  frequently and have many complex dependencies.\n- `ngc` is derived from the official NVIDIA PyTorch NGC images with the option\n  to install additional packages. It is recommended for users who wish to base\n  their projects on the NGC images provided by NVIDIA. Note that the NGC images\n  change between different releases and that configurations for one\n  release may not work for another one.\n- `simple` is derived from the Official Ubuntu Linux image by default as some\n  corporations restrict the use of Docker images not officially verified by\n  Docker. It installs all packages via `conda` by default and can optionally\n  install highly reproducible environments via `conda-lock`. Note that\n  `pip` packages can also be installed via `conda`. Also, the base image can\n  be configured to use images other than the Official Linux Docker images\n  by specifying the `BASE_IMAGE` argument directly in the `.env` file.\n  PyTorch runtime performance may be superior in official NVIDIA CUDA images\n  under certain circumstances. Use the tests to benchmark runtime speeds.\n  **The `simple` service is recommended for users without compiled dependencies.**\n\nThe `Makefile` has been configured to take values specified in the `.env` file\nif the `.env` file exists. Therefore, all `make` commands will automatically\nuse the `${SERVICE}` specified by `make env SERVICE=${SERVICE}` after the\n`.env` file is created.\n\n### Notes for Rootless Users\n\nMany institutions forbid the use of Docker because it requires `root` permissions, compromising security.\nFor users without Docker `root` access, using rootless Docker\n[link](https:\u002F\u002Fdocs.docker.com\u002Fengine\u002Fsecurity\u002Frootless) is recommended.\n\nWhile installing rootless Docker requires root permissions on the host,\nroot permissions are not necessary after the initial installation.\n\nWhen using rootless Docker, it is most convenient to set `ADD_USER=exclude` in the `.env` file\nas the `root` user will be the host user in rootless Docker.\n\n## Project Configuration\n\n1. To build PyTorch from source, set `BUILD_MODE=include` and the\n   CUDA Compute Capability (CCC) of the target NVIDIA GPU in the `.env` file.\n   Visit the NVIDIA [website](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus#compute)\n   to find compute capabilities of NVIDIA GPUs. Visit the\n   [documentation](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-c-programming-guide\u002Findex.html#compute-capabilities)\n   for an explanation of compute capability and its relevance.\n   Note that the Docker cache will save previously built binaries\n   if the given configurations are identical.\n\n2. Read the `docker-compose.yaml` file to fill in extra variables in `.env`.\n   Also, feel free to edit `docker-compose.yaml` as necessary by changing\n   session names, hostnames, etc. for different projects and configurations.\n   The `docker-compose.yaml` file provides reasonable default values but these\n   can be overridden by values specified in the `.env` file.\n   An important configuration is `ipc: host`, which allows the container to\n   access the shared memory of the host. This is required for multiprocessing,\n   e.g., to use `num_workers` in the PyTorch `DataLoader` class.\n   Disable this configuration on WSL and specify `shm_size:` instead as WSL\n   cannot use host IPC as of the time of writing.\n\n3. Edit requirements in `reqs\u002Fapt-train.requirements.txt`\n   and `reqs\u002Ftrain-environment.yaml`.\n   These contain project package dependencies.\n   The `apt` requirements are designed to resemble an\n   ordinary Python `requirements.txt` file.\n\n4. Edit the `volumes` section of a service\n   to include external directories in the container environment.\n   Run `make over` to create a `docker-compose.override.yaml` file\n   to add custom volumes and configurations.\n   The `docker-compose.override.yaml` file is excluded from version control\n   to allow per-user and per-server settings.\n\n5. (Advanced) If an external file must be included in the Docker image build process,\n   edit the `.dockerignore` file to allow the Docker context to find the external file.\n   By default, all files except requirements\n   files are excluded from the Docker build context.\n\nExample `.env` file for user with username `USERNAME`,\ngroup name `GROUPNAME`, user id `1000`, group id `1000` on service `train`.\nUse the `simple` service if no dependencies need to be compiled and requirements\ncan either be downloaded or installed via `apt`, `conda`, or `pip`.\n\n```text\n# Generated automatically by `make env`.\n# When using the `root` user with `UID=0`\u002F`USR=root`, set `ADD_USER=exclude`.\nGID=1000\nUID=1000\nGRP=GROUPNAME\nUSR=USERNAME\nHOST_ROOT=.\nSERVICE=train\n# Do not use the same `PROJECT` name for different projects on the same host!\nPROJECT=train-username             # `PROJECT` must be in lowercase.\nPROJECT_ROOT=\u002Fopt\u002Fproject\nIMAGE_NAME=cresset:train-username  # `IMAGE_NAME` is also converted to lowercase.\nCOMMAND=\u002Fusr\u002Fbin\u002Fzsh --login       # Command to execute on starting the container.\nTZ=Asia\u002FSeoul                      # Set the container timezone.\n\n# [[Optional]]: Fill in these configurations manually if the defaults do not suffice.\n\n# NVIDIA GPU Compute Capability (CCC) values may be found at https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus\nCCC=8.6              # Compute capability. CCC=8.6 for RTX3090.\n# CCC='8.6+PTX'      # The '+PTX' enables forward compatibility. Multiple CCCs can also be specified.\n# CCC='7.5 8.6+PTX'  # Visit https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fcpp_extension.html for details.\n\n# Used only if building PyTorch from source (`BUILD_MODE=include`).\n# The `*_TAG` variables are used only if `BUILD_MODE=include`. No effect otherwise.\nBUILD_MODE=exclude               # Whether to build PyTorch from source.\nPYTORCH_VERSION_TAG=v2.0.0       # Any `git` tag can be used (but not just any commit hash).\nTORCHVISION_VERSION_TAG=v0.15.1\n\n# General environment configurations.\nLINUX_DISTRO=ubuntu   # Visit the NVIDIA Docker Hub repo for available base images.\nDISTRO_VERSION=22.04  # https:\u002F\u002Fhub.docker.com\u002Fr\u002Fnvidia\u002Fcuda\u002Ftags\nCUDA_VERSION=11.8.0   # Must be compatible with hardware and CUDA driver.\nCUDNN_VERSION=8       # Only major version specifications are available.\nPYTHON_VERSION=3.10   # Specify the Python version.\nMKL_MODE=include      # Enable MKL for Intel CPUs.\n\n# Advanced Usage.\nTARGET_STAGE=train    # Target Dockerfile stage. The `*.whl` files are available in `train-builds`.\nADD_USER=include      # Whether to create a new user (include) or use `root` user (exclude).\n```\n\n## General Usage After Initial Installation and Configuration\n\n1. Run `make build` to build the image from the Dockerfile and start the service.\n   The `make` commands are defined in the\n   `Makefile` and target the `train` service by default.\n   Run `make up` if the image has already been built and\n   rebuilding the image from the Dockerfile is not necessary.\n2. Run `make exec` to enter the interactive container environment.\n   Using `tmux` inside the container is recommended.\n3. There is no step 3. Just start coding.\n   Check out the documentation or create an issue if anything goes wrong.\n\n## Makefile Instructions\n\nThe Makefile contains shortcuts for common docker compose commands.\nPlease read the Makefile to see the exact commands.\n\n1. `make build` builds the Docker image from the Dockerfile\n   regardless of whether the image already exists.\n   This will reinstall packages to the updated requirements files,\n   and then recreate the container.\n2. `make up` creates a fresh container from the image,\n   undoing any changes to the container made by the user.\n   Allows changing container settings as network ports,\n   mounted volumes, shared memory configurations, etc.\n   Recommended method for using this project.\n3. `make exec` enters the interactive terminal of the container\n   created by `make build` or `make up`.\n4. `make down` stops Compose containers and deletes networks.\n   Necessary for service teardown.\n5. `make start` restarts a stopped container without recreating it.\n   Similar to `make up` but does not delete the current container.\n   Not recommended unless data saved in container are absolutely necessary.\n6. `make ls` shows all Docker Compose services, both active and inactive.\n7. `make run` is used for debugging. Containers are removed on exit.\n   If a service fails to start, use this to find the error.\n8. `make build-only` builds the Docker image from the Dockerfile\n   without starting the service.\n   It exists to help publish images to container registries.\n\n### Tips\n\n- The `PROJECT`, `SERVICE`, and `COMMAND` variables in the Makefile\n  use variables specified in the `.env` file if available.\n- If something does not work, first try `make down` to remove the current container and\n  then `make up` to create a new container from the image.\n  Explicitly tearing the container down is often necessary when something happens to the host.\n- If the service startup stalls during `make up`,\n  check `docker system df` to see if there is space left on the host machine.\n- `make up` is akin to rebooting a computer.\n  The current container is removed and a new container is created from the current image.\n- `make build` is akin to resetting\u002Fformatting a computer.\n  The current image, if present, is removed and a new image is built from the Dockerfile,\n  after which a container is created from the resulting image.\n  In contrast, `make up`\n  only creates an image from source if the specified image is not present.\n- `make exec` is akin to logging into a computer.\n  It is the most important command\n  and allows the user to access the container's terminal interactively.\n- Configurations such as connected volumes and network ports cannot\n  be changed in a running container, requiring a new container to be created.\n- Docker automatically caches all builds up to `defaultKeepStorage`.\n  Builds use caches from previous builds by default,\n  greatly speeding up later builds by only building modified layers.\n- If the build fails during `git clone`,\n  try `make build` again with a stable internet connection.\n- If the build fails during `pip install`,\n  check the PyPI mirror URLs and package requirements.\n- If any networking issues arise, check `docker network ls` and check for conflicts.\n  Most networking and SSH problems can be solved by running `docker network prune`.\n\n## Project Overview\n\nThe main components of the project are as follows. The other files are utilities.\n\n1. Dockerfile\n2. docker-compose.yaml\n3. docker-compose.override.yaml\n4. reqs\u002F(`*requirements.txt`|`*environment.yaml`)\n5. .env\n\nWhen the user inputs `make up` or another `make` command,\ncommands specified in the `Makefile` are executed.\nThe `Makefile` is used to specify shorthand commands and variables.\n\nWhen a command related to Docker Compose (e.g., `make build`) is executed,\nThe `docker-compose.yaml` file and the `.env` file are read by Docker Compose.\nThe `docker-compose.yaml` file specifies reasonable default values\nbut users may wish to change them as per their needs.\nThe values specified in the `.env` file take precedence over\nthe defaults specified in the `docker-compose.yaml` file.\nEnvironment variables specified in the shell\ntake precedence over those in the `.env` file.\nThe `.env` file is deliberately excluded from source control\nto allow different users and machines to use different configurations.\n\nThe `docker-compose.yaml` file manages configurations,\nbuilds, runs, etc. using the `Dockerfile`.\nVisit the Docker Compose [Specification](https:\u002F\u002Fgithub.com\u002Fcompose-spec\u002Fcompose-spec\u002Fblob\u002Fmaster\u002Fspec.md)\nand [Reference](https:\u002F\u002Fdocs.docker.com\u002Fcompose\u002Fcompose-file\u002Fcompose-file-v3\u002F) for details.\n\nThe `docker-compose.override.yaml` is read by the `docker-compose.yaml` file\nduring the setup phase. Add configurations specific to each host that should not be\nshared via source control such as volume mounts for host-specific paths.\n\nThe `Dockerfile` is configured to read only requirements files in the `reqs` directory.\nEdit `reqs\u002Fpip-train.requirements.txt` to specify Python package requirements.\nEdit `reqs\u002Fapt-train.requirements.txt` to specify Ubuntu package requirements.\nUsers must edit the `.dockerignore` file to `COPY` other files into the Docker build,\nfor example, when building from private code during the Docker build.\n\nThe `Dockerfile` uses Docker BuildKit and a multi-stage build where\ncontrol flow is specified via stage names and build-time environment variables\ngiven via `docker-compose.yaml`. See the Docker BuildKit\n[Syntax](https:\u002F\u002Fgithub.com\u002Fmoby\u002Fbuildkit\u002Fblob\u002Fmaster\u002Ffrontend\u002Fdockerfile\u002Fdocs\u002Fsyntax.md)\nfor more information on Docker BuildKit.\nThe `train` service specified in the `docker-compose.yaml` file uses\nthe `train` stage specified in the `Dockerfile`, which assumes an Ubuntu image.\n\n## _Raison d'Être_\n\nThe purpose of this section is to introduce a new paradigm for deep learning development.\nThe hope is that Cresset, or at least the ideas behind it, will eventually become\nbest practice for small to medium-scale deep learning research and development.\n\nDeveloping in local environments with `conda` or `pip`\nis commonplace in the deep learning community.\nHowever, this risks rendering the development environment,\nand the code meant to run on it, unreproducible.\nThis state of affairs is a serious detriment to scientific progress\nthat many readers of this article will have experienced at first-hand.\n\nDocker containers are the standard method for providing reproducible programs\nacross different computing environments.\nThey create isolated environments where programs\ncan run without interference from the host or from one another.\nFor details, see the\n[documentation](https:\u002F\u002Fwww.docker.com\u002Fresources\u002Fwhat-container).\n\nBut in practice, Docker containers are often misused.\nContainers are meant to be transient and best practice dictates\nthat a new container be created for each run.\nHowever, this is very inconvenient for development,\nespecially for deep learning applications,\nwhere new libraries must constantly be installed and\nbugs are often only evident at runtime.\nThis leads many researchers to develop inside interactive containers.\nDocker users often have `run.sh` files with commands such as\n`docker run -v my_data:\u002Fmnt\u002Fdata -p 8080:22 -t my_container my_image:latest \u002Fbin\u002Fbash`\n(look familiar, anyone?) and use SSH to connect to running containers.\nVSCode even provides a remote development mode to code inside containers.\n\nThe problem with this approach is that these interactive containers\nbecome just as unreproducible as local development environments.\nA running container cannot connect to a new port or attach a new\n[volume](https:\u002F\u002Fdocs.docker.com\u002Fstorage\u002Fvolumes).\nBut if the computing environment within the container was created over\nseveral months of installs and builds, the only way to keep it is to\nsave the container as an image and create a new container from the saved image.\nAfter a few iterations of this process, the resulting images become bloated and\nno less scrambled than the local environments that they were meant to replace.\n\nProblems become even more evident when preparing for deployment.\nMLOps, defined as a set of practices that aims to deploy and maintain\nmachine learning models reliably and efficiently, has gained enormous popularity\nof late as many practitioners have come to realize the importance of\ncontinuously maintaining ML systems long after the initial development phase ends.\n\nHowever, bad practices such as those mentioned above mean that much coffee has\nbeen spilled turning research code into anything resembling a production-ready product.\nOften, even the original developers cannot recreate the same model after a few months.\nMany firms thus have entire teams dedicated to model translation, a huge expenditure.\n\nTo alleviate these problems, Docker Compose is proposed as a simple MLOps solution.\nUsing Docker and Docker Compose, the entire training environment can be reproduced.\nCompose has not yet caught on in the deep learning community,\npossibly because it is usually advertised as a multi-container solution.\nThis is a misunderstanding\nas it can be used for single-container development just as well.\n\nA `docker-compose.yaml` file is provided for easy management of containers.\n**Using the provided `docker-compose.yaml` file will create an interactive environment,\nproviding a programming experience very similar to using a terminal on a remote server.\nIntegrations with popular IDEs (PyCharm, VSCode) are also available.**\n\nMoreover, it also allows the user to specify settings for both build and run,\nremoving the need to manage the environment with custom shell scripts.\nConnecting a new volume or port is as simple as removing the current container,\nadding a line in the `docker-compose.yaml` file, then running `make up`\nto create a new container from the same image.\n\nBuild caches allow new images to be built very quickly,\nremoving another barrier to Docker adoption, the long initial build time.\nFor more information on Compose, visit the\n[documentation](https:\u002F\u002Fdocs.docker.com\u002Fcompose).\n\nDocker [Compose](https:\u002F\u002Fwww.compose-spec.io) can also be used for deployment,\nwhich is useful for small to medium-sized deployments.\nIf and when large-scale deployments using container orchestration such as\nKubernetes becomes necessary, using reproducible Docker environments from\nthe very beginning will accelerate the development process\nand smooth the path to MLOps adoption.\nAccelerating time-to-market by streamlining the development process\nis a competitive edge for any firm, whether lean startup or tech titan.\n\nWith luck, the techniques proposed here will enable\nthe deep learning community to \"_write once, train anywhere_\".\nBut even if most users are not persuaded of the merits of this method,\nMany a hapless grad student may be spared from the\nsisyphean labor of setting up their `conda` environment,\nonly to have it crash and burn right before their paper submission is due.\n\n## Compose as Best Practice\n\nDocker Compose is superior to using custom shell scripts for each environment.\nNot only does it gather all variables and commands\nfor both build and run into a single file,\nbut its native integration with Docker means that it makes complicated\nDocker build\u002Frun setups simple to implement and use.\n\nUsing Docker Compose this way is a general-purpose technique\nthat does not depend on anything about this project.\nThe other services available in the project emphasize this point.\n\n### Using Compose with PyCharm and VSCode\n\nThe Docker Compose container environment can be used with popular Python IDEs,\nnot just in the terminal.\nPyCharm and Visual Studio Code, both very popular in the deep learning community,\nare compatible with Docker Compose.\n\n#### PyCharm (Professional only)\n\nBoth Docker and Docker Compose are natively available as Python interpreters.\nSee tutorials for [Docker](https:\u002F\u002Fwww.jetbrains.com\u002Fhelp\u002Fpycharm\u002Fdocker.html) and\n[Compose](https:\u002F\u002Fwww.jetbrains.com\u002Fhelp\u002Fpycharm\u002Fusing-docker-compose-as-a-remote-interpreter.html#summary)\nfor details. JetBrains [Gateway](https:\u002F\u002Fwww.jetbrains.com\u002Fremote-development\u002Fgateway)\ncan also be used to connect to running containers.\n\nWhen using the `ngc` service, add `\u002Fusr\u002Flocal\u002Flib\u002Fpython3\u002Fdist-packages` and\n`\u002Fopt\u002Fconda\u002Flib\u002Fpython3\u002Fsite-packages` to the interpreter search paths via\nthe GUI to enable code assistance on the packages installed with `conda`.\n\n_N.B._ PyCharm Professional and other JetBrains IDEs are available\nfree of charge to anyone with a valid university e-mail address.\n\n#### VSCode\n\nInstall the Remote Development extension pack. See\n[tutorial](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fremote\u002Fcontainers-tutorial)\nfor details.\n\n##### VSCode Tips\n\nVSCode may fail to start up when accessing remote containers created by\nCresset because of the `${HOME}\u002F.vscode-server` volume mounted in the\n`docker-compose.yaml` file, which is used to preserve the `.vscode-server`\ndirectory between separate containers.\n\nThe reason for VSCode connection failure is that if any host directory\nspecified as a volume does not exist, Docker will automatically create\nthe specified host directory with the directory owner set to `root`.\nDirectories that already exist retain their directory ownership.\nWhen the `.vscode-server` directory is created by Docker this way,\nVSCode is unable to install any files in the `.vscode-server` directory.\n\nThis has been fixed in the Makefile but problems related to\nthe `.vscode-server` directory occur frequently.\nTo solve this problem, simply change the directory ownership to the\nuser with `sudo chown -R $(id -u):$(id -g) ${HOME}\u002F.vscode-server`.\nThis command can be run either on the host or inside the container,\nwhich is useful if `sudo` permissions are unavailable on the host.\n\nAlso, when one user switches between multiple Cresset-based containers\non a single machine, VSCode may not be able to find the container workspace.\nThis is because the `docker-compose.yaml` file mounts the host's\n`~\u002F.vscode-server` directory to the `\u002Fhome\u002F${USR}\u002F.vscode-server` directory\nof all containers to preserve VSCode extensions between containers.\nTo fix this issue, create a new directory on the host\nto mount the containers' `.vscode-server` directories.\nFor example, one can set volume pairs as\n`${HOME}\u002F.vscode-project1:\u002Fhome\u002F${USR}\u002F.vscode-server` for project1 and\n`${HOME}\u002F.vscode-project2:\u002Fhome\u002F${USR}\u002F.vscode-server` for project2.\nDo not forget to create `${HOME}\u002F.vscode-project1` and\n`${HOME}\u002F.vscode-project2` on the host first.\nOtherwise, the directory will be owned by `root`,\nwhich will cause VSCode to stall indefinitely due to permission issues.\n\nFor other VSCode problems, try deleting `~\u002F.vscode-server` on the host.\n\n# Known Issues\n\n1. Connecting to a running container by `ssh` will remove all variables\n   set by `ENV`. This is because `sshd` starts a new environment,\n   deleting all previous variables. Using `docker`\u002F`docker compose`\n   to enter containers is strongly recommended.\n\n2. `pip install package[option]` will fail on the terminal because of\n   Z-shell globbing. Characters such as `[`,`]`,`*`, etc. will be\n   interpreted by Z-shell as special commands. Use string literals,\n   e.g., `pip install 'package[option]'`, for cross-shell consistency.\n\n3. If the build fails during `git clone`, simply try `make build` again.\n   Most of the build will be cached. Failure is probably due to\n   networking issues during installation. Updating git submodules is\n   [not fail-safe](https:\u002F\u002Fstackoverflow.com\u002Fa\u002F8573310\u002F9289275).\n\n4. `torch.cuda.is_available()` will return a\n   `... UserWarning: CUDA initialization:...`\n   error or the image will simply not start if the host CUDA driver is\n   incompatible with the CUDA version on the Docker image.\n   Either upgrade the host CUDA driver or downgrade the CUDA version of the image.\n   Check the\n   [compatibility matrix](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)\n   to see if the host CUDA driver is compatible with the desired version of CUDA.\n   Also, check if the CUDA driver has been configured correctly on the host.\n   The CUDA driver version can be found using the `nvidia-smi` command.\n\n5. Docker Compose V2 will silently fail if the installed Docker engine\n   version is too low on Linux hosts. Update Docker to the latest\n   version (23.0+) to use Docker Compose V2.\n\n6. If the user is set to `root` in the `.env` file, i.e., `UID=0, USR=root`,\n   then set `ADD_USER=exclude` to prevent the creation of a new user, which is\n   expected to be non-root.\n\n# Desiderata\n\n1. **MORE STARS**. _**No Contribution Without Appreciation!**_\n\n2. Bug reports are welcome. Only the latest versions have been tested rigorously.\n   Please raise an issue if there are any versions that do not build properly.\n   However, please check that your host Docker, Docker Compose,\n   and especially NVIDIA Driver are up-to-date before doing so.\n\n3. Translations into other languages and updates to existing translations are welcome.\n   Please create a separate `LANG.README.md` file and make a pull request.\n","# Cresset：一个模板，搞定所有训练\n\n[![GitHub 星标](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fstargazers)\n[![GitHub 问题](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues)\n[![GitHub 分支](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fnetwork)\n[![pre-commit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpre--commit-enabled-brightgreen?logo=pre-commit)](https:\u002F\u002Fgithub.com\u002Fpre-commit\u002Fpre-commit)\n[![GitHub 许可证](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fcresset-template\u002Fcresset?style=flat)](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fblob\u002Fmain\u002FLICENSE)\n[![DOI](https:\u002F\u002Fzenodo.org\u002Fbadge\u002FDOI\u002F10.5281\u002Fzenodo.7939089.svg)](https:\u002F\u002Fdoi.org\u002F10.5281\u002Fzenodo.7939089)\n[![Twitter](https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl?url=https%3A%2F%2Fgithub.com%2Fcresset-template%2Fcresset)](https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=Awesome_Project!!!:&url=https%3A%2F%2Fgithub.com%2Fcresset-template%2Fcresset)\n\n![Cresset Logo](https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fblob\u002Fmain\u002Fassets\u002Flogo.png \"Logo\")\n\n---\n\n## 简要说明\n\n**_一种基于 Docker Compose 的新型 MLOps 系统，用于深度学习开发，旨在为深度学习从业者提供可复现且易于使用的交互式开发环境。\n希望此处介绍的方法能够成为学术界和工业界的最佳实践。_**\n\n## 入门视频（英文）\n\n## [![Weights and Biases 演示](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcresset-template_cresset_readme_0d132817234e.jpg)](https:\u002F\u002Fyoutu.be\u002FsW3VxlJl46o?t=6865 \"Weights and Biases 演示\")\n\n## 在新主机上安装\n\n如果您是首次使用该项目，请按照以下步骤操作：\n\n1. 安装适用于目标主机和 NVIDIA GPU 的 NVIDIA CUDA [驱动程序](https:\u002F\u002Fwww.nvidia.com\u002Fdownload\u002Findex.aspx)。\n   如果已安装驱动程序，请确认其版本与目标 CUDA 版本兼容。\n   CUDA 驱动版本不匹配是新用户最常见的问题。\n   请参阅\n   [兼容性矩阵](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)\n   以获取兼容的 CUDA 驱动和 CUDA 工具包版本。\n\n2. 安装 [Docker](https:\u002F\u002Fdocs.docker.com\u002Fget-docker)（建议使用 v23.0 及以上版本）\n   或更新至与 Docker Compose V2 兼容的最新版本。\n   Docker 与 Docker Compose V2 不兼容也是新用户的常见问题。\n   注意，Windows 用户可以使用 WSL（Windows Subsystem for Linux）。\n   Cresset 已在 Windows 11 WSL2 上使用 Windows CUDA 驱动程序和 Docker Desktop for Windows 进行测试。\n   无需在 WSL 内单独安装 WSL CUDA 驱动程序或 Linux 版 Docker。\n   请注意，只有 Docker Desktop 受商业 EULA 约束，而 Docker Engine（Linux）和 Lima Docker（Mac）仍然是开源的。\n   _注意_：Windows 安全实时防护功能启用时会导致性能显著下降。\n   为获得最佳性能，请在 Windows 上禁用任何正在运行的杀毒软件。\n   _注意_：Linux 主机也可以通过此\n   [仓库](https:\u002F\u002Fgithub.com\u002Fdocker\u002Fdocker-install) 进行安装。\n\n3. 按照此\n   [链接](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)\n   安装 NVIDIA Container Toolkit。\n\n4. 在 Linux 主机上运行 `make install-compose` 以安装 Docker Compose V2。\n   安装过程 _**不需要**_ root 权限。请访问\n   [文档](https:\u002F\u002Fdocs.docker.com\u002Fcompose\u002Fcli-command\u002F#install-on-linux)\n   获取最新的安装信息。请注意，对于使用 Docker Desktop 的 WSL 用户，Docker Compose V2 默认可用。\n\n5. 在项目根目录的终端中运行 `make env SERVICE=(train|devel|ngc|simple)`，\n   以创建一个基本的 `.env` 文件。\n   `.env` 文件为 `docker-compose.yaml` 提供环境变量，\n   允许不同用户和机器根据需要设置各自的变量。\n   Makefile 也已配置为读取 `.env` 文件中的值（如果存在），\n   从而只需指定一次非默认值。\n   每台主机都应拥有独立的 `.env` 文件，用于特定于该主机的配置。\n\n6. 运行 `make over` 以创建 `docker-compose.override.yaml` 文件。\n   将不应通过源代码管理共享的配置添加到该文件中。\n   例如，特定于每台主机的卷挂载对。\n\n7. 如果 Cresset 被放置在现有项目的子目录中，\n   请将 `volume` 对从 `.:${PROJECT_ROOT}` 更改为 `..:${PROJECT_ROOT}`。\n   Cresset 中的所有命令都假定是在项目根目录下执行的，\n   但这一设置可以轻松更改。\n\n### 服务说明\n\n不同的 Docker Compose 服务被组织起来以满足不同的需求。\n\n- `train` 是默认服务，应在需要编译依赖项时使用，\n  或当由于计算能力等问题需要从源代码编译 PyTorch 时使用。\n- `devel` 专为需要频繁重新编译且具有复杂依赖关系的 PyTorch CUDA\u002FC++ 开发人员设计。\n- `ngc` 源自官方 NVIDIA PyTorch NGC 镜像，并可选择安装额外的软件包。\n   建议希望基于 NVIDIA 提供的 NGC 镜像开展工作的用户使用。\n   请注意，NGC 镜像会随不同版本变化，某个版本的配置可能无法在另一个版本中正常工作。\n- `simple` 默认基于官方 Ubuntu Linux 镜像，因为某些公司限制使用未经 Docker 官方验证的镜像。\n   它默认通过 `conda` 安装所有软件包，并可选择通过 `conda-lock` 安装高度可重复的环境。\n   请注意，也可以通过 `conda` 安装 `pip` 包。\n   此外，可以通过直接在 `.env` 文件中指定 `BASE_IMAGE` 参数来配置基础镜像，\n   使用除官方 Linux Docker 镜像之外的其他镜像。\n   在某些情况下，官方 NVIDIA CUDA 镜像中的 PyTorch 运行时性能可能会更优。\n   请使用测试来基准测试运行速度。\n   **建议没有编译依赖项的用户使用 `simple` 服务。**\n\nMakefile 已被配置为在 `.env` 文件存在时读取其中指定的值。\n因此，在创建 `.env` 文件后，所有 `make` 命令将自动使用由 `make env SERVICE=${SERVICE}` 指定的 `${SERVICE}`。\n\n### 无 root 权限用户注意事项\n\n许多机构禁止使用 Docker，因为它需要 `root` 权限，从而影响安全性。\n对于没有 Docker `root` 访问权限的用户，建议使用无 root 模式的 Docker\n[链接](https:\u002F\u002Fdocs.docker.com\u002Fengine\u002Fsecurity\u002Frootless)。\n\n虽然安装无 root 模式的 Docker 需要主机上的 root 权限，\n但在首次安装完成后，后续操作并不需要 root 权限。\n\n在使用无 root 模式的 Docker 时，最方便的做法是在 `.env` 文件中设置 `ADD_USER=exclude`，\n因为在这种模式下，容器内的 `root` 用户将映射为宿主机的当前用户。\n\n## 项目配置\n\n1. 若要从源码构建 PyTorch，请在 `.env` 文件中设置 `BUILD_MODE=include`，并指定目标 NVIDIA GPU 的 CUDA 计算能力（CCC）。\n   可访问 NVIDIA [官网](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus#compute)\n   查找 NVIDIA GPU 的计算能力。有关计算能力及其重要性的详细说明，请参阅\n   [文档](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-c-programming-guide\u002Findex.html#compute-capabilities)。\n   请注意，如果给定的配置相同，Docker 缓存会保存之前构建的二进制文件。\n\n2. 阅读 `docker-compose.yaml` 文件，以补充 `.env` 文件中的额外变量。\n   此外，您也可以根据需要编辑 `docker-compose.yaml` 文件，例如更改会话名称、主机名等，\n   以适应不同的项目和配置。`docker-compose.yaml` 文件提供了合理的默认值，\n   但这些默认值可以被 `.env` 文件中指定的值覆盖。\n   其中一个重要配置是 `ipc: host`，它允许容器访问宿主机的共享内存。\n   这对于多进程操作至关重要，例如在 PyTorch 的 `DataLoader` 类中使用 `num_workers`。\n   在 WSL 上请禁用此配置，并改用 `shm_size:`，因为在撰写本文时 WSL 尚不支持宿主机 IPC。\n\n3. 编辑 `reqs\u002Fapt-train.requirements.txt` 和 `reqs\u002Ftrain-environment.yaml` 中的依赖项。\n   这些文件包含了项目的软件包依赖关系。\n   `apt` 依赖项的设计类似于普通的 Python `requirements.txt` 文件。\n\n4. 编辑服务的 `volumes` 部分，以将外部目录挂载到容器环境中。\n   运行 `make over` 命令生成 `docker-compose.override.yaml` 文件，\n   以便添加自定义卷和配置。`docker-compose.override.yaml` 文件不会被纳入版本控制，\n   以允许用户和服务器级别的个性化设置。\n\n5. （高级）如果必须在 Docker 镜像构建过程中包含外部文件，\n   请编辑 `.dockerignore` 文件，以允许 Docker 构建上下文找到该外部文件。\n   默认情况下，除依赖文件外，所有其他文件都会被排除在 Docker 构建上下文之外。\n\n以下是一个适用于用户名为 `USERNAME`、组名为 `GROUPNAME`、用户 ID 为 `1000`、组 ID 为 `1000`\n且服务为 `train` 的用户的 `.env` 文件示例。\n如果无需编译任何依赖项，且所需的软件包可以通过 `apt`、`conda` 或 `pip` 直接下载或安装，\n则可使用 `simple` 服务。\n\n```text\n# 由 `make env` 自动生成。\n# 当使用 `UID=0`\u002F`USR=root` 的 `root` 用户时，请设置 `ADD_USER=exclude`。\nGID=1000\nUID=1000\nGRP=GROUPNAME\nUSR=USERNAME\nHOST_ROOT=.\nSERVICE=train\n# 不要在同一台主机上为不同项目使用相同的 `PROJECT` 名称！\nPROJECT=train-username             # `PROJECT` 必须为小写。\nPROJECT_ROOT=\u002Fopt\u002Fproject\nIMAGE_NAME=cresset:train-username  # `IMAGE_NAME` 也会转换为小写。\nCOMMAND=\u002Fusr\u002Fbin\u002Fzsh --login       # 容器启动时执行的命令。\nTZ=Asia\u002FSeoul                      # 设置容器的时区。\n\n# [[可选]]：如果默认配置不满足需求，可手动填写以下配置。\n\n# NVIDIA GPU 计算能力（CCC）可在 https:\u002F\u002Fdeveloper.nvidia.com\u002Fcuda-gpus 查看\nCCC=8.6              # 计算能力。RTX3090 的 CCC 为 8.6。\n# CCC='8.6+PTX'      # '+PTX' 可实现向前兼容性。也可指定多个 CCC。\n# CCC='7.5 8.6+PTX'  # 更多详情请参阅 https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fcpp_extension.html。\n\n# 仅在从源码构建 PyTorch 时使用（`BUILD_MODE=include`）。\n# `*_TAG` 变量仅在 `BUILD_MODE=include` 时生效，否则无效。\nBUILD_MODE=exclude               # 是否从源码构建 PyTorch。\nPYTORCH_VERSION_TAG=v2.0.0       # 可使用任意 `git` 标签（但不能仅使用提交哈希）。\nTORCHVISION_VERSION_TAG=v0.15.1\n\n# 通用环境配置。\nLINUX_DISTRO=ubuntu   # 可访问 NVIDIA Docker Hub 仓库获取可用的基础镜像。\nDISTRO_VERSION=22.04  # https:\u002F\u002Fhub.docker.com\u002Fr\u002Fnvidia\u002Fcuda\u002Ftags\nCUDA_VERSION=11.8.0   # 必须与硬件及 CUDA 驱动程序兼容。\nCUDNN_VERSION=8       # 仅提供主要版本信息。\nPYTHON_VERSION=3.10   # 指定 Python 版本。\nMKL_MODE=include      # 为 Intel CPU 启用 MKL。\n\n# 高级用法。\nTARGET_STAGE=train    # 目标 Dockerfile 阶段。`*.whl` 文件可在 `train-builds` 中找到。\nADD_USER=include      # 是否创建新用户（include）或使用 `root` 用户（exclude）。\n```\n\n## 初次安装与配置后的常规使用\n\n1. 运行 `make build` 命令以从 Dockerfile 构建镜像并启动服务。\n   `make` 命令在 `Makefile` 中定义，默认针对 `train` 服务。\n   如果镜像已构建完毕，且无需重新构建，则可运行 `make up`。\n2. 运行 `make exec` 命令进入交互式容器环境。\n   建议在容器内使用 `tmux`。\n3. 没有第 3 步。直接开始编码即可。\n   如有任何问题，请查阅文档或提交问题。\n\n## Makefile 使用说明\n\nMakefile 包含常用 Docker Compose 命令的快捷方式。请阅读 Makefile 以了解具体命令。\n\n1. `make build` 会从 Dockerfile 构建 Docker 镜像，无论该镜像是否已存在。此操作会重新安装依赖包到更新后的 requirements 文件中，然后重新创建容器。\n2. `make up` 会基于镜像创建一个新的容器，撤销用户对容器所做的任何更改。允许修改容器设置，如网络端口、挂载卷、共享内存配置等。这是推荐的使用本项目的方式。\n3. `make exec` 会进入由 `make build` 或 `make up` 创建的容器的交互式终端。\n4. `make down` 会停止 Compose 容器并删除网络。这是服务拆解所必需的操作。\n5. `make start` 会在不重新创建容器的情况下重启已停止的容器。与 `make up` 类似，但不会删除当前容器。除非容器中保存的数据绝对必要，否则不建议使用此命令。\n6. `make ls` 会显示所有 Docker Compose 服务，包括正在运行和未运行的服务。\n7. `make run` 用于调试。容器在退出时会被移除。如果某个服务无法启动，请使用此命令查找错误。\n8. `make build-only` 会仅从 Dockerfile 构建 Docker 镜像，而不启动服务。此命令用于帮助将镜像发布到容器注册表。\n\n### 小贴士\n\n- Makefile 中的 `PROJECT`、`SERVICE` 和 `COMMAND` 变量会优先使用 `.env` 文件中指定的变量（如果存在）。\n- 如果遇到问题，首先尝试运行 `make down` 删除当前容器，然后运行 `make up` 从镜像创建新容器。当宿主机出现问题时，显式地销毁容器通常是必要的。\n- 如果在执行 `make up` 时服务启动卡住，请检查 `docker system df` 以确认宿主机是否有足够的磁盘空间。\n- `make up` 类似于重启计算机：当前容器会被移除，并基于当前镜像创建一个新容器。\n- `make build` 类似于重置或格式化计算机：如果当前镜像存在，则会被移除；然后根据 Dockerfile 重新构建镜像，并基于新镜像创建容器。相比之下，`make up` 只有在指定的镜像不存在时才会从源代码构建镜像。\n- `make exec` 类似于登录到计算机：它是最重要的命令，允许用户以交互方式访问容器的终端。\n- 已挂载的卷和网络端口等配置无法在运行中的容器中更改，必须创建新的容器才能生效。\n- Docker 默认会缓存所有构建内容，直到达到 `defaultKeepStorage` 的限制。默认情况下，构建会复用之前构建的缓存层，从而仅构建修改过的层，大大加快后续构建速度。\n- 如果在 `git clone` 时构建失败，请在网络连接稳定的情况下再次尝试 `make build`。\n- 如果在 `pip install` 时构建失败，请检查 PyPI 镜像地址和包的依赖要求。\n- 如果出现网络问题，请运行 `docker network ls` 检查是否存在冲突。大多数网络和 SSH 问题可以通过运行 `docker network prune` 来解决。\n\n## 项目概述\n\n项目的主要组成部分如下，其他文件均为辅助工具：\n\n1. Dockerfile\n2. docker-compose.yaml\n3. docker-compose.override.yaml\n4. reqs\u002F(`*requirements.txt`|`*environment.yaml`)\n5. .env\n\n当用户输入 `make up` 或其他 `make` 命令时，Makefile 中指定的命令会被执行。Makefile 用于定义快捷命令和变量。\n\n当执行与 Docker Compose 相关的命令（例如 `make build`）时，Docker Compose 会读取 `docker-compose.yaml` 和 `.env` 文件。`docker-compose.yaml` 文件指定了合理的默认值，但用户可以根据需要进行修改。`.env` 文件中指定的值会优先于 `docker-compose.yaml` 文件中的默认值。而 Shell 中设置的环境变量又会优先于 `.env` 文件中的设置。为了使不同用户和不同机器能够使用不同的配置，`.env` 文件被特意排除在版本控制之外。\n\n`docker-compose.yaml` 文件负责管理配置、构建、运行等操作，这些操作都基于 `Dockerfile` 进行。有关详细信息，请参阅 Docker Compose [规范](https:\u002F\u002Fgithub.com\u002Fcompose-spec\u002Fcompose-spec\u002Fblob\u002Fmaster\u002Fspec.md) 和 [参考文档](https:\u002F\u002Fdocs.docker.com\u002Fcompose\u002Fcompose-file\u002Fcompose-file-v3\u002F)。\n\n`docker-compose.override.yaml` 文件会在设置阶段被 `docker-compose.yaml` 文件读取。它可以添加特定于每个宿主机的配置，这些配置不应通过版本控制共享，例如针对宿主机特定路径的卷挂载。\n\n`Dockerfile` 被配置为仅读取 `reqs` 目录下的依赖文件。编辑 `reqs\u002Fpip-train.requirements.txt` 可以指定 Python 包的依赖要求；编辑 `reqs\u002Fapt-train.requirements.txt` 则可以指定 Ubuntu 包的依赖要求。如果需要在构建过程中从私有代码构建镜像，用户必须编辑 `.dockerignore` 文件，以便将其他文件复制到构建环境中。\n\n`Dockerfile` 使用 Docker BuildKit 和多阶段构建，通过阶段名称以及通过 `docker-compose.yaml` 传递的构建时环境变量来控制构建流程。有关 Docker BuildKit 的更多信息，请参阅其 [语法文档](https:\u002F\u002Fgithub.com\u002Fmoby\u002Fbuildkit\u002Fblob\u002Fmaster\u002Ffrontend\u002Fdockerfile\u002Fdocs\u002Fsyntax.md)。`docker-compose.yaml` 文件中指定的 `train` 服务使用了 `Dockerfile` 中定义的 `train` 阶段，该阶段假定使用 Ubuntu 镜像。\n\n## 存在的理由\n\n本节旨在介绍一种深度学习开发的新范式。\n我们希望 Cresset，或者至少其背后的理念，最终能成为中小型深度学习研究与开发的最佳实践。\n\n在深度学习社区中，使用 `conda` 或 `pip` 在本地环境中进行开发非常普遍。\n然而，这种方式容易导致开发环境及其运行代码无法复现。\n这种状况严重阻碍了科学进步，许多读者对此都深有体会。\n\nDocker 容器是跨不同计算环境提供可复现程序的标准方法。\n它们创建隔离的运行环境，使程序能够在不受宿主机或其他容器干扰的情况下运行。\n有关详细信息，请参阅\n[官方文档](https:\u002F\u002Fwww.docker.com\u002Fresources\u002Fwhat-container)。\n\n但在实际应用中，Docker 容器常常被误用。\n容器的设计初衷是短暂存在的，最佳实践建议每次运行时都创建一个新的容器。\n然而，这给开发带来了极大的不便，尤其是对于深度学习应用而言——需要不断安装新库，且许多 bug 只有在运行时才会显现。\n因此，许多研究人员选择在交互式容器中进行开发。\nDocker 用户通常会编写类似如下的 `run.sh` 脚本：\n`docker run -v my_data:\u002Fmnt\u002Fdata -p 8080:22 -t my_container my_image:latest \u002Fbin\u002Fbash`\n（是不是很熟悉？）然后通过 SSH 连接到正在运行的容器。\n甚至 VSCode 还提供了远程开发模式，可以直接在容器内编写代码。\n\n这种方法的问题在于，这些交互式容器同样难以复现，与本地开发环境无异。\n运行中的容器无法连接新的端口或挂载新的\n[卷](https:\u002F\u002Fdocs.docker.com\u002Fstorage\u002Fvolumes)。\n如果容器内的计算环境是在数月的安装和构建过程中逐步搭建起来的，那么唯一保持它的方法就是将容器保存为镜像，并基于该镜像重新创建容器。\n经过几次这样的迭代后，生成的镜像会变得臃肿不堪，混乱程度丝毫不亚于原本要取代的本地环境。\n\n当准备部署时，这些问题会更加突出。\nMLOps 是一组旨在可靠高效地部署和维护机器学习模型的最佳实践，近年来因其重要性而广受欢迎——许多从业者意识到，在初始开发阶段结束后，持续维护机器学习系统至关重要。\n\n然而，上述不良实践导致大量精力被浪费在将研究代码转化为生产级产品上。\n往往连最初的开发者几个月后都无法重现相同的模型。\n为此，许多公司不得不组建专门的团队来负责模型转换，这无疑是一笔巨大的开销。\n\n为了解决这些问题，我们提出使用 Docker Compose 作为简单的 MLOps 解决方案。\n借助 Docker 和 Docker Compose，整个训练环境都可以被复现。\n目前，Docker Compose 在深度学习社区尚未普及，\n可能是因为它通常被宣传为多容器解决方案。\n但这是一种误解——它同样适用于单容器开发。\n\n我们提供了一个 `docker-compose.yaml` 文件，用于轻松管理容器。\n**使用提供的 `docker-compose.yaml` 文件将创建一个交互式环境，\n提供与在远程服务器上使用终端非常相似的编程体验。**\n同时，还支持与主流 IDE（PyCharm、VSCode）的集成。\n\n此外，它还允许用户分别指定构建和运行时的配置，\n从而无需再使用自定义的 Shell 脚本来管理环境。\n若需挂载新的卷或映射新的端口，只需删除当前容器，\n在 `docker-compose.yaml` 中添加一行配置，然后运行 `make up`\n即可基于同一镜像创建一个新的容器。\n\n构建缓存功能使得新镜像的构建速度极快，\n从而消除了 Docker 普及的另一大障碍——漫长的首次构建时间。\n更多关于 Compose 的信息，请访问\n[官方文档](https:\u002F\u002Fdocs.docker.com\u002Fcompose)。\n\nDocker [Compose](https:\u002F\u002Fwww.compose-spec.io) 也可以用于部署，\n这对于中小型部署场景尤为有用。\n一旦需要采用 Kubernetes 等容器编排工具进行大规模部署时，\n从一开始就使用可复现的 Docker 环境，将加速开发流程，\n并为全面采用 MLOps 打下坚实基础。\n通过简化开发流程来缩短上市时间，无论是一家初创公司还是科技巨头，\n都是极具竞争力的优势。\n\n希望本文提出的技术能够帮助深度学习社区实现“一次编写，随处训练”。\n即便大多数用户并未被这一方法所打动，\n至少也能让许多不幸的研究生免于陷入反复搭建 `conda` 环境的西西弗斯式劳动——好不容易配置好环境，却往往在论文提交前夕彻底崩溃。\n\n## Compose 作为最佳实践\n\n与为每个环境编写自定义 Shell 脚本相比，Docker Compose 具有明显优势。\n它不仅将构建和运行所需的所有变量和命令集中到一个文件中，\n而且与 Docker 原生集成，使得复杂的 Docker 构建和运行配置变得简单易行。\n\n以这种方式使用 Docker Compose 是一种通用技术，\n并不依赖于本项目的任何特定内容。\n项目中的其他服务也进一步强调了这一点。\n\n### 在 PyCharm 和 VSCode 中使用 Compose\n\nDocker Compose 容器环境不仅可以在终端中使用，还可以与流行的 Python IDE 配合使用。PyCharm 和 Visual Studio Code 这两款在深度学习社区中非常受欢迎的 IDE，都与 Docker Compose 兼容。\n\n#### PyCharm（仅限 Professional 版）\n\nDocker 和 Docker Compose 均可作为 Python 解释器原生使用。有关详细信息，请参阅 [Docker](https:\u002F\u002Fwww.jetbrains.com\u002Fhelp\u002Fpycharm\u002Fdocker.html) 和 [Compose](https:\u002F\u002Fwww.jetbrains.com\u002Fhelp\u002Fpycharm\u002Fusing-docker-compose-as-a-remote-interpreter.html#summary) 的教程。此外，JetBrains 的 [Gateway](https:\u002F\u002Fwww.jetbrains.com\u002Fremote-development\u002Fgateway) 也可以用于连接到正在运行的容器。\n\n当使用 `ngc` 服务时，可通过 GUI 将 `\u002Fusr\u002Flocal\u002Flib\u002Fpython3\u002Fdist-packages` 和 `\u002Fopt\u002Fconda\u002Flib\u002Fpython3\u002Fsite-packages` 添加到解释器搜索路径中，以启用对通过 `conda` 安装的包的代码补全支持。\n\n_注意_：拥有有效大学邮箱地址的用户可以免费使用 PyCharm Professional 及其他 JetBrains IDE。\n\n#### VSCode\n\n安装 Remote Development 扩展包。有关详细信息，请参阅 [教程](https:\u002F\u002Fcode.visualstudio.com\u002Fdocs\u002Fremote\u002Fcontainers-tutorial)。\n\n##### VSCode 使用技巧\n\n由于 `docker-compose.yaml` 文件中挂载了 `${HOME}\u002F.vscode-server` 卷，该卷用于在不同容器之间保留 `.vscode-server` 目录，因此在访问由 Cresset 创建的远程容器时，VSCode 可能无法启动。\n\nVSCode 连接失败的原因是：如果指定为卷的任何主机目录不存在，Docker 会自动创建该目录，并将目录的所有者设置为 `root`。而已经存在的目录则会保留其原有的所有权。当 `.vscode-server` 目录被 Docker 以这种方式创建后，VSCode 就无法在该目录中安装任何文件。\n\n这一问题已在 Makefile 中修复，但与 `.vscode-server` 目录相关的问题仍经常出现。要解决此问题，只需将目录的所有权更改为当前用户，执行命令 `sudo chown -R $(id -u):$(id -g) ${HOME}\u002F.vscode-server`。该命令既可以在宿主机上执行，也可以在容器内执行，这在宿主机没有 `sudo` 权限时非常有用。\n\n此外，当一个用户在同一台机器上切换多个基于 Cresset 的容器时，VSCode 可能无法找到容器的工作区。这是因为 `docker-compose.yaml` 文件会将宿主机的 `~\u002F.vscode-server` 目录挂载到所有容器的 `\u002Fhome\u002F${USR}\u002F.vscode-server` 目录下，以在容器之间保留 VSCode 扩展。为了解决这个问题，可以在宿主机上创建一个新的目录来挂载各个容器的 `.vscode-server` 目录。例如，可以为项目1设置卷映射 `${HOME}\u002F.vscode-project1:\u002Fhome\u002F${USR}\u002F.vscode-server`，为项目2设置 `${HOME}\u002F.vscode-project2:\u002Fhome\u002F${USR}\u002F.vscode-server`。请务必先在宿主机上创建 `${HOME}\u002F.vscode-project1` 和 `${HOME}\u002F.vscode-project2`，否则这些目录的所有者将是 `root`，从而导致 VSCode 因权限问题而无限期卡住。\n\n对于其他 VSCode 问题，可以尝试删除宿主机上的 `~\u002F.vscode-server`。\n\n# 已知问题\n\n1. 通过 `ssh` 连接到正在运行的容器会清除所有由 `ENV` 设置的环境变量。这是因为 `sshd` 会启动一个新的环境，从而删除所有之前的变量。强烈建议使用 `docker` 或 `docker compose` 命令进入容器。\n\n2. 在终端中运行 `pip install package[option]` 会因 Z shell 的 globbing 功能而失败。诸如 `[`、`]`、`*` 等字符会被 Z shell 解释为特殊命令。为了跨 Shell 的一致性，请使用字符串字面量，例如 `pip install 'package[option]'`。\n\n3. 如果在 `git clone` 时构建失败，只需再次尝试运行 `make build`。大多数构建步骤会被缓存。失败很可能是由于安装过程中出现的网络问题所致。更新 Git 子模块并不是一个完全可靠的操作，具体可参考 [此解答](https:\u002F\u002Fstackoverflow.com\u002Fa\u002F8573310\u002F9289275)。\n\n4. 如果宿主机的 CUDA 驱动程序与 Docker 镜像中的 CUDA 版本不兼容，`torch.cuda.is_available()` 将返回 `... UserWarning: CUDA initialization:...` 错误，或者镜像根本无法启动。此时应升级宿主机的 CUDA 驱动程序，或降低镜像中的 CUDA 版本。请查阅 [兼容性矩阵](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)，确认宿主机的 CUDA 驱动程序是否与所需的 CUDA 版本兼容。同时，还需检查宿主机上的 CUDA 驱动程序是否已正确配置。可以通过 `nvidia-smi` 命令查看 CUDA 驱动程序的版本。\n\n5. 在 Linux 宿主机上，如果已安装的 Docker 引擎版本过低，Docker Compose V2 将静默失败。请将 Docker 更新至最新版本（23.0 及以上），以便使用 Docker Compose V2。\n\n6. 如果 `.env` 文件中将用户设置为 `root`，即 `UID=0, USR=root`，则应将 `ADD_USER=exclude` 设置为排除新建用户，因为预期的新用户不应为 root 用户。\n\n# 期望事项\n\n1. **更多星标**。_**没有感谢就没有贡献！**_\n\n2. 欢迎提交 bug 报告。目前仅对最新版本进行了严格测试。如果您发现有版本无法正常构建，请提出问题。但在提交之前，请确保您的宿主机上的 Docker、Docker Compose，尤其是 NVIDIA 驱动程序，均为最新版本。\n\n3. 欢迎提供其他语言的翻译以及现有翻译的更新。请创建单独的 `LANG.README.md` 文件并提交拉取请求。","# Cresset 快速上手指南\n\nCresset 是一个基于 Docker Compose 的 MLOps 系统，旨在为深度学习开发者提供可复现、易用的交互式开发环境。它支持从源码编译 PyTorch、管理复杂依赖，并适用于学术与工业界场景。\n\n## 环境准备\n\n在开始之前，请确保您的主机满足以下要求：\n\n*   **操作系统**：Linux (推荐 Ubuntu 22.04+)、Windows 11 (需安装 WSL2) 或 macOS。\n*   **GPU 驱动**：安装与目标 CUDA 版本兼容的 NVIDIA 显卡驱动。\n    *   *注意*：驱动版本不匹配是新用户最常见的问题，请参考 [NVIDIA 兼容性矩阵](https:\u002F\u002Fdocs.nvidia.com\u002Fcuda\u002Fcuda-toolkit-release-notes\u002Findex.html#cuda-major-component-versions__table-cuda-toolkit-driver-versions)。\n*   **Docker**：安装 Docker Engine (v23.0+) 或 Docker Desktop。\n    *   Windows 用户请使用 WSL2 后端，无需在 WSL 内单独安装 Linux 版 Docker 或 CUDA 驱动。\n    *   *性能提示*：Windows 用户若遇到显著卡顿，建议暂时禁用实时防病毒保护。\n*   **NVIDIA Container Toolkit**：必须安装以支持容器调用 GPU。\n    *   安装指南：[NVIDIA Container Toolkit Install Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fdatacenter\u002Fcloud-native\u002Fcontainer-toolkit\u002Flatest\u002Finstall-guide.html)\n*   **权限说明**：如果您的机构禁止使用 root 权限运行 Docker，建议配置 [Rootless Docker](https:\u002F\u002Fdocs.docker.com\u002Fengine\u002Fsecurity\u002Frootless)。\n\n## 安装步骤\n\n请在项目根目录下执行以下命令完成初始化：\n\n1.  **安装 Docker Compose V2** (Linux 主机需要，WSL\u002FDocker Desktop 通常已内置)：\n    ```bash\n    make install-compose\n    ```\n    *注：此步骤不需要 root 权限。*\n\n2.  **生成环境变量文件**：\n    根据您的需求选择服务类型（`train`, `devel`, `ngc`, 或 `simple`）。\n    *   `train`: 默认选项，适合需要编译依赖或从源码构建 PyTorch 的场景。\n    *   `simple`: 适合无编译依赖的用户，通过 conda\u002Fpip 安装包，兼容性更好。\n    \n    执行以下命令生成 `.env` 文件（以 `train` 为例）：\n    ```bash\n    make env SERVICE=train\n    ```\n\n3.  **创建本地覆盖配置**：\n    生成 `docker-compose.override.yaml` 文件，用于配置本机特有的挂载卷（Volumes）等不被版本控制的设置：\n    ```bash\n    make over\n    ```\n\n4.  **配置 `.env` 文件** (可选但推荐)：\n    编辑生成的 `.env` 文件，修改以下关键参数以适配您的环境：\n    *   `UID` \u002F `GID`: 设置为当前用户的 ID (Linux\u002FMac 下可通过 `id` 命令查看)，避免容器内文件权限问题。\n    *   `CCC`: 设置 GPU 的 Compute Capability (例如 RTX 3090 为 `8.6`)。若需从源码编译 PyTorch，此项必填。\n    *   `PROJECT`: 设置唯一的项目名称（必须小写）。\n    *   `SERVICE`: 确认与您第一步选择的服务一致。\n\n## 基本使用\n\n完成配置后，即可开始构建环境并进入开发：\n\n1.  **构建镜像并启动容器**：\n    默认针对 `.env` 中指定的 `SERVICE` 进行构建和启动：\n    ```bash\n    make build\n    ```\n    *首次构建可能需要较长时间，特别是当 `BUILD_MODE=include` 需要从源码编译 PyTorch 时。*\n\n2.  **进入交互式开发环境**：\n    构建完成后，运行以下命令进入容器终端：\n    ```bash\n    make run\n    ```\n    默认将启动 `\u002Fusr\u002Fbin\u002Fzsh` (可在 `.env` 的 `COMMAND` 字段修改)。\n\n3.  **其他常用命令**：\n    *   停止并移除容器：`make down`\n    *   重新构建（不使用缓存）：`make build-no-cache`\n    *   查看日志：`make logs`\n\n**提示**：所有 `make` 命令会自动读取 `.env` 文件中的配置。若需切换服务（例如从 `train` 切换到 `simple`），请修改 `.env` 中的 `SERVICE` 变量后重新执行 `make build`。","某高校计算机视觉实验室的研究团队需要在多台配置各异的服务器上，复现并改进一篇基于最新 PyTorch 版本的 SOTA 论文模型。\n\n### 没有 cresset 时\n- **环境依赖地狱**：不同服务器的 CUDA、cuDNN 和 PyTorch 版本冲突频发，研究员花费数天时间手动排查驱动兼容性，而非投入算法研究。\n- **复现成本高昂**：新加入的博士生需在一台新机器上重复繁琐的配置步骤，常因漏装 NVIDIA Container Toolkit 或 Docker 版本不匹配导致训练无法启动。\n- **开发体验割裂**：本地调试环境与服务器生产环境不一致，代码在本地运行正常，上传后却因缺少特定系统库而报错，排查过程极其低效。\n- **协作标准缺失**：团队成员各自维护一套安装脚本，缺乏统一的 MLOps 规范，导致项目交接时经常出现“在我机器上是好的”这类推诿现象。\n\n### 使用 cresset 后\n- **一键统一环境**：通过 `make env` 命令即可在任何主机上自动拉取适配当前 GPU 驱动的 Docker 容器，彻底屏蔽底层 CUDA 版本差异，实现“一次定义，处处运行”。\n- **极速上手开发**：新成员只需克隆仓库并执行简单指令，5 分钟内即可获得包含完整依赖的交互式开发环境，将配置时间从几天压缩至几分钟。\n- **无缝交互调试**：基于 Docker Compose 构建的统一环境确保了本地与服务器的一致性，研究员可直接在容器中利用 VS Code 远程调试，消除环境差异带来的 Bug。\n- **标准化最佳实践**：cresset 内置了 pre-commit 钩子和标准化的目录结构，强制团队遵循统一的开发规范，显著提升了代码质量和团队协作效率。\n\ncresset 通过将复杂的深度学习环境配置封装为标准化的模板，让研究人员从繁琐的运维工作中解放出来，真正专注于模型创新本身。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcresset-template_cresset_35e5b2b8.png","cresset-template","Cresset","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fcresset-template_953dd898.png","The One Template to Train Them All",null,"veritas9872@gmail.com","https:\u002F\u002Fgithub.com\u002Fcresset-template",[84,88,92],{"name":85,"color":86,"percentage":87},"Dockerfile","#384d54",78.6,{"name":89,"color":90,"percentage":91},"Python","#3572A5",11.2,{"name":93,"color":94,"percentage":95},"Makefile","#427819",10.2,721,42,"2026-03-25T12:31:40","MIT","Linux, Windows (WSL2), macOS","需要 NVIDIA GPU（用于 train\u002Fdevel\u002Fngc 服务），需安装匹配的 NVIDIA CUDA Driver 和 NVIDIA Container Toolkit。具体显存大小未说明，但需根据目标 GPU 的 Compute Capability (CCC) 进行配置（例如 RTX3090 对应 CCC=8.6）。CUDA 版本示例为 11.8.0，需与驱动兼容。simple 服务可选不依赖编译的 GPU 环境。","未说明",{"notes":104,"python":105,"dependencies":106},"1. 核心基于 Docker Compose 构建 MLOps 环境，推荐使用 Docker Desktop (Windows\u002FMac) 或 Docker Engine (Linux)。\n2. Windows 用户建议使用 WSL2，无需在 WSL 内单独安装 CUDA 驱动，但需禁用 Windows 安全实时防护以获得最佳性能。\n3. 无 Docker root 权限的用户可使用 Rootless Docker 模式，并在 .env 中设置 ADD_USER=exclude。\n4. 多进程数据处理需在 docker-compose.yaml 中启用 'ipc: host'（WSL 除外，WSL 需指定 shm_size）。\n5. 支持从源码编译 PyTorch 以适配特定 GPU 算力，需在 .env 中设置 BUILD_MODE=include 及对应的 CCC 值。","3.10 (可通过 .env 文件配置)",[107,108,109,110,111,112],"PyTorch (可源码编译或使用 NGC 镜像)","torchvision","Docker Compose V2","NVIDIA Container Toolkit","conda (simple 服务默认)","apt 包 (基础系统依赖)",[13],[115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131],"pytorch","docker","python","deep-learning","wheel","source","source-python","deep-learning-tutorial","build","cuda","docker-compose","makefile","template","template-repository","mlops","machine-learning","mlops-template","2026-03-27T02:49:30.150509","2026-04-06T09:25:37.970650",[135,140,145,150,154,159],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},18718,"构建过程中出现链接器错误或编译失败怎么办？","尝试在构建开始前添加以下命令来修复链接器路径问题：`RUN ln -sf \u002Fopt\u002Fconda\u002Fcompiler_compat\u002Fld \u002Fusr\u002Fbin\u002Fld`。此外，确保将 conda 目录添加到构建镜像的 PATH 变量末尾，这通常能解决大部分构建卡顿或失败的问题。","https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues\u002F13",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},18719,"为什么在容器内运行 `torch.cuda.is_available()` 返回 False 或报错 'forward compatibility was attempted on non supported HW'？","这通常是因为构建时使用了 `+PTX` 标志（例如 `8.6+PTX`）导致的兼容性问题，或者是宿主机上安装了多个版本的 CUDA 驱动造成冲突。建议检查并清理宿主机的 CUDA 驱动，确保只保留一个稳定版本。如果问题依旧，尝试移除构建命令中的 `+PTX` 后缀重新构建镜像。","https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues\u002F17",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},18720,"执行 `git clone` 时提示 `unknown option '--jobs'` 错误如何解决？","该错误通常表明当前环境中的 Git 版本过旧，不支持 `--jobs` 参数，或者基础镜像环境（如 Ubuntu 16.04）已过时导致依赖缺失。解决方案是将 PyTorch 和相关库的版本降低（例如降至 torch 1.8.0 和 torchvision 0.9.0），或者升级基础操作系统镜像以支持更新的 Git 版本和依赖库（如 NCCL）。","https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues\u002F6",{"id":151,"question_zh":152,"answer_zh":153,"source_url":149},18721,"导入 torch 时遇到 `ImportError: libnccl.so.2: cannot open shared object file` 错误？","此错误表明系统中缺少 NCCL 库。这通常发生在将 Docker 镜像中的 wheel 文件提取到本地 Conda 环境使用时。解决方法是确保在目标环境中正确安装了与 PyTorch 版本匹配的 NCCL 库，或者直接在配置正确的 Docker 容器内运行代码，而不是提取到本地环境。如果是旧版 Ubuntu（如 16.04），建议降低 PyTorch 版本或升级系统。",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},18722,"如何将此 Docker 模板最小化地集成到现有的项目中？","如果将项目文件放在独立目录中，建议修改 `docker-compose.yaml` 中的挂载点。将 `\u002Fopt\u002Fproject` 的挂载源从当前工作目录 `.` 改为相对于 compose 文件的项目根目录（例如 `..\u002F..`），以确保容器内能正确访问项目文件。此外，如果原项目使用 conda 环境文件但缺乏系统依赖细节，建议将依赖转换为 PyPI 包并通过 pip 安装，以简化集成过程。","https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues\u002F36",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},18723,"构建出的 Docker 镜像体积过大（例如超过 700GB）是否正常？","不正常，这通常是由于构建缓存未清理或包含了不必要的调试符号和中间文件。虽然提供的 Issue 详情被截断，但此类问题通常可以通过在 Dockerfile 中使用多阶段构建（multi-stage build），仅复制最终需要的二进制文件和库到最终镜像来解决。另外，确保在构建后运行 `docker system prune` 清理悬空镜像和构建缓存。","https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fissues\u002F7",[165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260],{"id":166,"version":167,"summary_zh":168,"released_at":169},109196,"v0.10.1","因安装失败移除了 `brew`。\n将 `pure` 当前目录的颜色从默认的深蓝色改为青色，因为深蓝色难以辨认。","2024-02-03T00:37:33",{"id":171,"version":172,"summary_zh":173,"released_at":174},109197,"v0.10.0","在镜像中添加了 Brew。\n移除了 `hub` 服务，因其功能与 `simple` 类似。\n将 `CCA` 重命名为 `CCC`。","2023-12-11T03:25:14",{"id":176,"version":177,"summary_zh":178,"released_at":179},109198,"v0.9.1","更新 `ruff` 版本。此举仅是为了创建一个启用 Zenodo 的新发布版本。","2023-05-15T22:45:21",{"id":181,"version":182,"summary_zh":183,"released_at":184},109199,"v0.9.0","用于 Zenodo 发布的版本。\n更新 PyTorch 的编译配置，以使用现代 CMake 和 `nvcc` 的修复补丁。\n更新文档，修正拼写错误并使其描述更加详尽。","2023-05-15T22:41:51",{"id":186,"version":187,"summary_zh":188,"released_at":189},109200,"v0.8.4","非交互式镜像现已提供对共享 `.zshrc` 配置的基本支持。  \r\n不过，它们不再强制依赖 `zsh`，用户可以自由切换 shell 等。  \r\n这一点在非交互式镜像下载后以交互方式使用时会非常有用，例如在交互式的 Kubernetes 环境中，或作为可复现的镜像上传到 Docker Hub 等场景。","2023-04-27T10:09:46",{"id":191,"version":192,"summary_zh":193,"released_at":194},109201,"v0.8.3","优化 `docker-compose.yaml` 文件以提高可读性，并移除 `HOST_NAME` 变量。  \n此外，还进行了大量错误修复和用户体验优化。","2023-04-20T03:39:40",{"id":196,"version":197,"summary_zh":198,"released_at":199},109202,"v0.8.2","除了众多更新和错误修复外，最大的变化是 `Makefile` 现在会读取 `.env` 文件，不再从宿主机的 shell 中获取任何变量，除非在调用 `make` 命令时显式指定。此外，我还添加了许多我认为合理的 `ruff` 规则。","2023-04-11T10:42:32",{"id":201,"version":202,"summary_zh":203,"released_at":204},109203,"v0.8.1","`simple` 服务现在可以从 `conda-lock` 文件中安装环境。","2023-04-07T10:36:04",{"id":206,"version":207,"summary_zh":208,"released_at":209},109204,"v0.8.0","这是一个规模非常庞大的版本更新，新增了众多服务。\n此前，Cresset 主要聚焦于可定制性，为此牺牲了配置的简洁性，导致配置过程极为复杂。此外，它也一直不支持 Kubernetes。\n如今，Cresset 的重心已转向提供简化的用户界面，而非一味追求模板的可定制性。同时，为了减少主 Dockerfile 中的冗余内容，原有的 `deploy` 阶段已被移除。尽管为了实现最大程度的定制化仍推荐使用原始的 Dockerfile，但对于那些无需自定义编译、仅需一个易于搭建的开发与训练环境的用户来说，可以选用其他新提供的服务。\n此外，新增的 `build-only` Makefile 命令以及 `INTERACTIVE_MODE=exclude` 选项，使用户能够更便捷地构建用于分发的镜像。最后，通过设置 `simple` 服务，现已支持使用 `conda-lock`——这是 Conda 中复现性要求最为严格的方式——以满足对复现性有最高要求的用户需求。而且，由于 `simple` 服务仅由官方或经过验证的镜像组成，也有助于缓解安全性方面的顾虑。","2023-04-07T03:35:02",{"id":211,"version":212,"summary_zh":213,"released_at":214},109205,"v0.7.0","为支持 PyTorch 2.0，引入了破坏向后兼容性的更改。\n目前已知的一个问题是在 Ubuntu 22.04 的 `deploy` 阶段，使用 `pip` 安装无法正常工作。\n该问题将尽快修复。","2023-03-20T02:13:58",{"id":216,"version":217,"summary_zh":218,"released_at":219},109206,"v0.6.3","This is a patch release to save all the work before updating to PyTorch 2.x, which has several breaking changes, most notably in the build-time dependencies.","2023-03-16T08:36:37",{"id":221,"version":222,"summary_zh":223,"released_at":224},109207,"v0.6.2","Replacing `flake8` and `isort` configurations and pre-commits with `ruff`, which is much faster and more modern.","2023-02-27T05:06:55",{"id":226,"version":227,"summary_zh":228,"released_at":229},109208,"v0.6.1","## What's Changed\r\n* Dev\u002Fadd utils by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F65\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fcompare\u002Fv0.6.0...v0.6.1","2023-02-26T08:04:38",{"id":231,"version":232,"summary_zh":233,"released_at":234},109209,"v0.6.0","## What's Changed\r\n* Update conda installation method and conda manager. by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F51\r\n* Update\u002Fbase by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F52\r\n* Dev\u002Fmake by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F56\r\n* Fix default names. by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F60\r\n* Automation updates. by @veritas9872 in https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fpull\u002F64\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fcresset-template\u002Fcresset\u002Fcompare\u002Fv0.5.0...v0.6.0","2023-02-26T04:11:42",{"id":236,"version":237,"summary_zh":238,"released_at":239},109210,"v0.5.0","Changed the training environment so that `conda` is now the preferred package manager instead of just the virtual environment.\r\nFixed many bugs and issues to make the project easier to use.","2023-01-04T16:02:16",{"id":241,"version":242,"summary_zh":243,"released_at":244},109211,"v0.4.1","Separate out installing PyTorch and related libraries.\r\nRename download-only stages as `fetch` for better contrast with `build` stages.\r\nUpdates in the documentation.","2022-09-12T13:40:47",{"id":246,"version":247,"summary_zh":248,"released_at":249},109212,"v0.4.0","Remove legacy features and clean up the documentation.\r\nTest out Ubuntu 22.04 LTS and Python 3.10.","2022-08-30T15:18:15",{"id":251,"version":252,"summary_zh":253,"released_at":254},109213,"v0.3.0","Support for external files for installation. The NSight systems debian binary cannot be installed via the command line. An external .deb file is therefore used.\r\nAlso, Git Large File System (LFS) is used to prevent the .git directory from bloating.\r\nThe build guidelines have also changed in order to make the meaning clearer (I think). CCA is no longer mandatory and the .env file is guarded by a separate recipe.\r\nPython package installation is now fully parallel with apt package installation.\r\nFinally figured out how to get separate volume paths for different hosts. Using docker-compose.override.yaml was decided to be the best approach.\r\nDocumentation still needs work.","2022-06-01T02:27:21",{"id":256,"version":257,"summary_zh":258,"released_at":259},109214,"v0.2.2","Add shell script for Docker Compose installation.\r\nFix the $HOME\u002F~ bug in compose directory.\r\nGeneral code cleanup.","2022-04-30T05:26:02",{"id":261,"version":262,"summary_zh":263,"released_at":264},109215,"v0.2.1","This is a hack release forced by the sudden failure of URLs in TorchAudio and the Kakao mirror for PyPI.\r\nAlthough both have simple fixes, these issues mean that users must become more involved with the details.\r\nTo ensure that the build does not fail for new users, TorchAudio source builds and using PyPI mirrors have been temporarily disabled.\r\nThese functionalities will be restored ASAP.","2022-04-03T03:17:31"]