[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-karpathy--nanochat":3,"tool-karpathy--nanochat":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":79,"owner_website":82,"owner_url":83,"languages":84,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":10,"env_os":105,"env_gpu":106,"env_ram":107,"env_deps":108,"category_tags":118,"github_topics":79,"view_count":119,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":120,"updated_at":121,"faqs":122,"releases":152},201,"karpathy\u002Fnanochat","nanochat","The best ChatGPT that $100 can buy.","nanochat 是一个专为大型语言模型（LLM）训练设计的极简实验框架。它的核心目标是让 LLM 训练变得触手可及，覆盖了从分词、预训练、微调到评估、推理及聊天 UI 的全流程。nanochat 显著降低了训练成本与复杂度，曾经需要 4.3 万美元才能训练的 GPT-2 级别模型，现在仅需约 48 美元（使用 8xH100 GPU 运行 2 小时）即可完成，并支持在网页界面中直接对话。\n\nnanochat 非常适合开发者、研究人员以及希望深入理解 LLM 构建原理的技术爱好者。代码精简且易于修改，只需单 GPU 节点即可运行。其独特亮点在于“一键式”超参数优化：用户只需调整模型层数（--depth），其余如模型宽度、学习率等参数均会自动计算为最优配置。此外，nanochat 还建立了训练速度排行榜，鼓励社区协作不断突破效率极限。通过 uv 管理依赖，上手便捷，是探索大模型训练技术的理想起点。无论是用于教学演示还是快速验证新想法，nanochat 都能提供高效且透明的支持。","# nanochat\n\n![nanochat logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_8d3febb3e9ab.png)\n![scaling laws](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_c48097ed0b3d.png)\n\nnanochat is the simplest experimental harness for training LLMs. It is designed to run on a single GPU node, the code is minimal\u002Fhackable, and it covers all major LLM stages including tokenization, pretraining, finetuning, evaluation, inference, and a chat UI. For example, you can train your own GPT-2 capability LLM (which cost ~$43,000 to train in 2019) for only $48 (~2 hours of 8XH100 GPU node) and then talk to it in a familiar ChatGPT-like web UI. On a spot instance, the total cost can be closer to ~$15. More generally, nanochat is configured out of the box to train an entire miniseries of compute-optimal models by setting one single complexity dial: `--depth`, the number of layers in the GPT transformer model (GPT-2 capability happens to be approximately depth 26). All other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) are calculated automatically in an optimal way.\n\nFor questions about the repo, I recommend either using [DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fkarpathy\u002Fnanochat) from Devin\u002FCognition to ask questions about the repo, or use the [Discussions tab](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions), or come by the [#nanochat](https:\u002F\u002Fdiscord.com\u002Fchannels\u002F1020383067459821711\u002F1427295580895314031) channel on Discord.\n\n## Time-to-GPT-2 Leaderboard\n\nPresently, the main focus of development is on tuning the pretraining stage, which takes the most amount of compute. Inspired by the modded-nanogpt repo and to incentivise progress and community collaboration, nanochat maintains a leaderboard for a \"GPT-2 speedrun\", which is the wall-clock time required to train a nanochat model to GPT-2 grade capability, as measured by the DCLM CORE score. The [runs\u002Fspeedrun.sh](runs\u002Fspeedrun.sh) script always reflects the reference way to train GPT-2 grade model and talk to it. The current leaderboard looks as follows:\n\n| # | time | val_bpb | CORE | Description | Date | Commit | Contributors |\n|---|-------------|---------|------|-------------|------|--------|--------------|\n| 0 | 168 hours | - | 0.2565 | Original OpenAI GPT-2 checkpoint | 2019 | - | OpenAI |\n| 1 | 3.04 | 0.74833 | 0.2585 | d24 baseline, slightly overtrained | Jan 29 2026 | 348fbb3 | @karpathy |\n| 2 | 2.91 | 0.74504 | 0.2578 | d26 slightly undertrained **+fp8** | Feb 2 2026 | a67eba3 | @karpathy |\n| 3 | 2.76 | 0.74645 | 0.2602 | bump total batch size to 1M tokens | Feb 5 2026 | 2c062aa | @karpathy |\n| 4 | 2.02 | 0.71854 | 0.2571 | change dataset to NVIDIA ClimbMix | Mar 4 2026 | 324e69c | @ddudek @karpathy |\n| 5 | 1.80 | 0.71808 | 0.2690 | autoresearch [round 1](https:\u002F\u002Fx.com\u002Fkarpathy\u002Fstatus\u002F2031135152349524125) | Mar 9 2026 | 6ed7d1d | @karpathy |\n| 6 | 1.65 | 0.71800 | 0.2626 | autoresearch round 2 | Mar 14 2026 | a825e63 | @karpathy |\n\nThe primary metric we care about is \"time to GPT-2\" - the wall clock time needed to outperform the GPT-2 (1.6B) CORE metric on an 8XH100 GPU node. The GPT-2 CORE score is 0.256525. In 2019, the training of GPT-2 cost approximately $43,000 so it is incredible that due to many advances over 7 years across the stack, we can now do so much faster and for well below $100 (e.g. at the current ~$3\u002FGPU\u002Fhr, an 8XH100 node is ~$24\u002Fhr, so 2 hours is ~$48).\n\nSee [dev\u002FLEADERBOARD.md](dev\u002FLEADERBOARD.md) for more docs on how to interpret and contribute to the leaderboard.\n\n## Getting started\n\n### Setup\n\nnanochat uses [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) for dependency management. To install:\n\n```bash\nuv sync --extra gpu    # Use for CUDA (A100\u002FH100\u002Fetc.)\nuv sync --extra cpu    # (or) Use for CPU-only \u002F MPS\nsource .venv\u002Fbin\u002Factivate\n```\n\nFor development (adds pytest, matplotlib, ipykernel, transformers, etc.):\n\n```bash\nuv sync --extra gpu --group dev\n```\n\n### Reproduce and talk to GPT-2\n\nThe most fun you can have is to train your own GPT-2 and talk to it. The entire pipeline to do so is contained in the single file [runs\u002Fspeedrun.sh](runs\u002Fspeedrun.sh), which is designed to be run on an 8XH100 GPU node. Boot up a new 8XH100 GPU box from your favorite provider (e.g. I use and like [Lambda](https:\u002F\u002Flambda.ai\u002Fservice\u002Fgpu-cloud)), and kick off the training script:\n\n```bash\nbash runs\u002Fspeedrun.sh\n```\n\nYou may wish to do so in a screen session as this will take ~3 hours to run. Once it's done, you can talk to it via the ChatGPT-like web UI. Make sure again that your local uv virtual environment is active (run `source .venv\u002Fbin\u002Factivate`), and serve it:\n\n```bash\npython -m scripts.chat_web\n```\n\nAnd then visit the URL shown. Make sure to access it correctly, e.g. on Lambda use the public IP of the node you're on, followed by the port, so for example [http:\u002F\u002F209.20.xxx.xxx:8000\u002F](http:\u002F\u002F209.20.xxx.xxx:8000\u002F), etc. Then talk to your LLM as you'd normally talk to ChatGPT! Get it to write stories or poems. Ask it to tell you who you are to see a hallucination. Ask it why the sky is blue. Or why it's green. The speedrun is a 4e19 FLOPs capability model so it's a bit like talking to a kindergartener :).\n\n---\n\n\u003Cimg width=\"2672\" height=\"1520\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_f30f2428ae3c.png\" \u002F>\n\n---\n\nA few more notes:\n\n- The code will run just fine on the Ampere 8XA100 GPU node as well, but a bit slower.\n- All code will run just fine on even a single GPU by omitting `torchrun`, and will produce ~identical results (code will automatically switch to gradient accumulation), but you'll have to wait 8 times longer.\n- If your GPU(s) have less than 80GB, you'll have to tune some of the hyperparameters or you will OOM \u002F run out of VRAM. Look for `--device-batch-size` in the scripts and reduce it until things fit. E.g. from 32 (default) to 16, 8, 4, 2, or even 1. Less than that you'll have to know a bit more what you're doing and get more creative.\n- Most of the code is fairly vanilla PyTorch so it should run on anything that supports that - xpu, mps, or etc, but I haven't personally exercised all of these code paths so there might be sharp edges.\n\n## Research\n\nIf you are a researcher and wish to help improve nanochat, two scripts of interest are [runs\u002Fscaling_laws.sh](runs\u002Fscaling_laws.sh) and [runs\u002Fminiseries.sh](runs\u002Fminiseries.sh). See [Jan 7 miniseries v1](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F420) for related documentation. For quick experimentation (~5 min pretraining runs) my favorite scale is to train a 12-layer model (GPT-1 sized), e.g. like this:\n\n```\nOMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \\\n    --depth=12 \\\n    --run=\"d12\" \\\n    --model-tag=\"d12\" \\\n    --core-metric-every=999999 \\\n    --sample-every=-1 \\\n    --save-every=-1 \\\n```\n\nThis uses wandb (run name \"d12\"), only runs the CORE metric on last step, and it doesn't sample and save intermediate checkpoints. I like to change something in the code, re-run a d12 (or a d16 etc) and see if it helped, in an iteration loop. To see if a run helps, I like to monitor the wandb plots for:\n\n1. `val_bpb` (validation loss in vocab-size-invariant units of bits per byte) as a function of `step`, `total_training_time` and `total_training_flops`.\n2. `core_metric` (the DCLM CORE score)\n3. VRAM utilization, `train\u002Fmfu` (Model FLOPS utilization), `train\u002Ftok_per_sec` (training throughput)\n\nSee an example [here](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fpull\u002F498#issuecomment-3850720044).\n\nThe important thing to note is that nanochat is written and configured around one single dial of complexity - the depth of the transformer. This single integer automatically determines all other hyperparameters (the width of the transformer, number of heads, learning rate adjustments, training horizons, weight decays, ...) so that the trained model comes out compute optimal. The idea is that the user doesn't have to think about or set any of this, they are simply asking for a smaller or bigger model using `--depth`, and everything \"just works\". By sweeping out the depth, you achieve the nanochat miniseries of compute optimal models at various sizes. GPT-2 capability model (which is of most interest at the moment) happens to be somewhere around d24-d26 range with the current code. But any candidate changes to the repo have to be principled enough that they work for all settings of depth.\n\n## Running on CPU \u002F MPS\n\nThe script [runs\u002Fruncpu.sh](runs\u002Fruncpu.sh) shows a very simple example of running on CPU or Apple Silicon. It dramatically shrinks the LLM that is being trained to make things fit into a reasonable time interval of a few ten minutes of training. You will not get strong results in this way.\n\n## Precision \u002F dtype\n\nnanochat does not use `torch.amp.autocast`. Instead, precision is managed explicitly through a single global `COMPUTE_DTYPE` (defined in `nanochat\u002Fcommon.py`). By default this is auto-detected based on your hardware:\n\n| Hardware | Default dtype | Why |\n|----------|--------------|-----|\n| CUDA SM 80+ (A100, H100, ...) | `bfloat16` | Native bf16 tensor cores |\n| CUDA SM \u003C 80 (V100, T4, ...) | `float32` | No bf16; fp16 available via `NANOCHAT_DTYPE=float16` (uses GradScaler) |\n| CPU \u002F MPS | `float32` | No reduced-precision tensor cores |\n\nYou can override the default with the `NANOCHAT_DTYPE` environment variable:\n\n```bash\nNANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p \"hello\"   # force fp32\nNANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train  # force bf16\n```\n\nHow it works: model weights are stored in fp32 (for optimizer precision), but our custom `Linear` layer casts them to `COMPUTE_DTYPE` during the forward pass. Embeddings are stored directly in `COMPUTE_DTYPE` to save memory. This gives us the same mixed-precision benefit as autocast but with full explicit control over what runs in which precision.\n\nNote: `float16` training automatically enables a `GradScaler` in `base_train.py` to prevent gradient underflow. SFT supports this too but RL currently does not. Inference in fp16 works fine everywhere.\n\n## Guides\n\nI've published a number of guides that might contain helpful information, most recent to least recent:\n\n- [Feb 1 2026: Beating GPT-2 for \u003C\u003C$100: the nanochat journey](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F481)\n- [Jan 7 miniseries v1](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F420) documents the first nanochat miniseries of models.\n- To add new abilities to nanochat, see [Guide: counting r in strawberry (and how to add abilities generally)](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F164).\n- To customize your nanochat, see [Guide: infusing identity to your nanochat](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F139) in Discussions, which describes how you can tune your nanochat's personality through synthetic data generation and mixing that data into the SFT stage.\n- [Oct 13 2025: original nanochat post](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F1) introducing nanochat, though now it contains some deprecated information and the model is a lot older (with worse results) than current master.\n\n## File structure\n\n```\n.\n├── LICENSE\n├── README.md\n├── dev\n│   ├── gen_synthetic_data.py       # Example synthetic data for identity\n│   ├── generate_logo.html\n│   ├── nanochat.png\n│   └── repackage_data_reference.py # Pretraining data shard generation\n├── nanochat\n│   ├── __init__.py                 # empty\n│   ├── checkpoint_manager.py       # Save\u002FLoad model checkpoints\n│   ├── common.py                   # Misc small utilities, quality of life\n│   ├── core_eval.py                # Evaluates base model CORE score (DCLM paper)\n│   ├── dataloader.py               # Tokenizing Distributed Data Loader\n│   ├── dataset.py                  # Download\u002Fread utils for pretraining data\n│   ├── engine.py                   # Efficient model inference with KV Cache\n│   ├── execution.py                # Allows the LLM to execute Python code as tool\n│   ├── gpt.py                      # The GPT nn.Module Transformer\n│   ├── logo.svg\n│   ├── loss_eval.py                # Evaluate bits per byte (instead of loss)\n│   ├── optim.py                    # AdamW + Muon optimizer, 1GPU and distributed\n│   ├── report.py                   # Utilities for writing the nanochat Report\n│   ├── tokenizer.py                # BPE Tokenizer wrapper in style of GPT-4\n│   └── ui.html                     # HTML\u002FCSS\u002FJS for nanochat frontend\n├── pyproject.toml\n├── runs\n│   ├── miniseries.sh               # Miniseries training script\n│   ├── runcpu.sh                   # Small example of how to run on CPU\u002FMPS\n│   ├── scaling_laws.sh             # Scaling laws experiments\n│   └── speedrun.sh                 # Train the ~$100 nanochat d20\n├── scripts\n│   ├── base_eval.py                # Base model: CORE score, bits per byte, samples\n│   ├── base_train.py               # Base model: train\n│   ├── chat_cli.py                 # Chat model: talk to over CLI\n│   ├── chat_eval.py                # Chat model: eval tasks\n│   ├── chat_rl.py                  # Chat model: reinforcement learning\n│   ├── chat_sft.py                 # Chat model: train SFT\n│   ├── chat_web.py                 # Chat model: talk to over WebUI\n│   ├── tok_eval.py                 # Tokenizer: evaluate compression rate\n│   └── tok_train.py                # Tokenizer: train it\n├── tasks\n│   ├── arc.py                      # Multiple choice science questions\n│   ├── common.py                   # TaskMixture | TaskSequence\n│   ├── customjson.py               # Make Task from arbitrary jsonl convos\n│   ├── gsm8k.py                    # 8K Grade School Math questions\n│   ├── humaneval.py                # Misnomer; Simple Python coding task\n│   ├── mmlu.py                     # Multiple choice questions, broad topics\n│   ├── smoltalk.py                 # Conglomerate dataset of SmolTalk from HF\n│   └── spellingbee.py              # Task teaching model to spell\u002Fcount letters\n├── tests\n│   └── test_engine.py\n└── uv.lock\n```\n\n## Contributing\n\nThe goal of nanochat is to improve the state of the art in micro models that are accessible to work with end to end on budgets of \u003C $1000 dollars. Accessibility is about overall cost but also about cognitive complexity - nanochat is not an exhaustively configurable LLM \"framework\"; there are no giant configuration objects, model factories, or if-then-else monsters in the code base. It is a single, cohesive, minimal, readable, hackable, maximally-forkable \"strong baseline\" codebase designed to run start to end and produce a ChatGPT model you can talk to. Currently, the most interesting part personally is speeding up the latency to GPT-2 (i.e. getting a CORE score above 0.256525). Currently this takes ~3 hours, but by improving the pretraining stage we can improve this further.\n\nCurrent AI policy: disclosure. When submitting a PR, please declare any parts that had substantial LLM contribution and that you have not written or that you do not fully understand.\n\n## Acknowledgements\n\n- The name (nanochat) derives from my earlier project [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT), which only covered pretraining.\n- nanochat is also inspired by [modded-nanoGPT](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt), which gamified the nanoGPT repo with clear metrics and a leaderboard, and borrows a lot of its ideas and some implementation for pretraining.\n- Thank you to [HuggingFace](https:\u002F\u002Fhuggingface.co\u002F) for fineweb and smoltalk.\n- Thank you [Lambda](https:\u002F\u002Flambda.ai\u002Fservice\u002Fgpu-cloud) for the compute used in developing this project.\n- Thank you to chief LLM whisperer 🧙‍♂️ Alec Radford for advice\u002Fguidance.\n- Thank you to the repo czar Sofie [@svlandeg](https:\u002F\u002Fgithub.com\u002Fsvlandeg) for help with managing issues, pull requests and discussions of nanochat.\n\n## Cite\n\nIf you find nanochat helpful in your research cite simply as:\n\n```bibtex\n@misc{nanochat,\n  author = {Andrej Karpathy},\n  title = {nanochat: The best ChatGPT that \\$100 can buy},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat}\n}\n```\n\n## License\n\nMIT\n","# nanochat\n\n![nanochat logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_8d3febb3e9ab.png)\n![scaling laws](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_c48097ed0b3d.png)\n\nnanochat 是用于训练大型语言模型（LLMs）的最简单的实验框架。它设计为在单个图形处理器（GPU）节点上运行，代码极简\u002F可修改，涵盖了所有主要的 LLM 阶段，包括分词（tokenization）、预训练（pretraining）、微调（finetuning）、评估（evaluation）、推理（inference）以及聊天用户界面（UI）。例如，你只需花费 48 美元（8XH100 GPU 节点约 2 小时）即可训练出自己的具备 GPT-2 能力的 LLM（2019 年训练成本约为 43,000 美元），然后通过熟悉的类 ChatGPT Web UI 与之对话。如果使用竞价实例（spot instance），总成本可接近 15 美元左右。更一般地说，nanochat 开箱即用配置为训练整个计算最优模型迷你系列，只需设置一个复杂度调节器：`--depth`，即 GPT Transformer 模型中的层数（GPT-2 能力大约对应深度 26）。所有其他超参数（hyperparameters）（Transformer 宽度、头数、学习率调整、训练周期、权重衰减等）都会以最优方式自动计算。\n\n关于仓库的问题，我建议要么使用 Devin\u002FCognition 的 [DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fkarpathy\u002Fnanochat) 来询问关于仓库的问题，要么使用 [Discussions 标签页](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions)，或者加入 Discord 上的 [#nanochat](https:\u002F\u002Fdiscord.com\u002Fchannels\u002F1020383067459821711\u002F1427295580895314031) 频道。\n\n## 达到 GPT-2 水平耗时排行榜\n\n目前，开发的主要重点是调整预训练（pretraining）阶段，这是计算量最大的部分。受 modded-nanogpt 仓库启发，为了激励进步和社区协作，nanochat 维护了一个\"GPT-2 速通”排行榜，即根据 DCLM CORE 分数衡量，训练 nanochat 模型达到 GPT-2 等级能力所需的挂钟时间（wall-clock time）。[runs\u002Fspeedrun.sh](runs\u002Fspeedrun.sh) 脚本始终反映了训练 GPT-2 等级模型并与之对话的参考方法。当前排行榜如下：\n\n| # | 时间 | val_bpb | CORE | 描述 | 日期 | 提交 | 贡献者 |\n|---|-------------|---------|------|-------------|------|--------|--------------|\n| 0 | 168 小时 | - | 0.2565 | 原始 OpenAI GPT-2 检查点（checkpoint） | 2019 | - | OpenAI |\n| 1 | 3.04 | 0.74833 | 0.2585 | d24 基线，略微过训练 | 2026 年 1 月 29 日 | 348fbb3 | @karpathy |\n| 2 | 2.91 | 0.74504 | 0.2578 | d26 略微欠训练 **+fp8** | 2026 年 2 月 2 日 | a67eba3 | @karpathy |\n| 3 | 2.76 | 0.74645 | 0.2602 | 将总批次大小（batch size）提升至 1M tokens | 2026 年 2 月 5 日 | 2c062aa | @karpathy |\n| 4 | 2.02 | 0.71854 | 0.2571 | 更换数据集为 NVIDIA ClimbMix | 2026 年 3 月 4 日 | 324e69c | @ddudek @karpathy |\n| 5 | 1.80 | 0.71808 | 0.2690 | 自动研究（autoresearch）[第 1 轮](https:\u002F\u002Fx.com\u002Fkarpathy\u002Fstatus\u002F2031135152349524125) | 2026 年 3 月 9 日 | 6ed7d1d | @karpathy |\n| 6 | 1.65 | 0.71800 | 0.2626 | 自动研究第 2 轮 | 2026 年 3 月 14 日 | a825e63 | @karpathy |\n\n我们关心的主要指标是“达到 GPT-2 水平耗时”——即在 8XH100 GPU 节点上超越 GPT-2 (1.6B) CORE 指标所需的挂钟时间。GPT-2 CORE 分数为 0.256525。2019 年，训练 GPT-2 的成本约为 43,000 美元，因此令人难以置信的是，由于 7 年来整个技术栈的许多进步，我们现在可以做得更快，且成本远低于 100 美元（例如，按当前 ~$3\u002FGPU\u002F小时计算，8XH100 节点约为 24 美元\u002F小时，所以 2 小时约为 48 美元）。\n\n请参阅 [dev\u002FLEADERBOARD.md](dev\u002FLEADERBOARD.md) 获取更多关于如何解读和贡献排行榜的文档。\n\n## 快速开始\n\n### 设置\n\nnanochat 使用 [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) 进行依赖管理。安装方法：\n\n```bash\nuv sync --extra gpu    # Use for CUDA (A100\u002FH100\u002Fetc.)\nuv sync --extra cpu    # (or) Use for CPU-only \u002F MPS\nsource .venv\u002Fbin\u002Factivate\n```\n\n用于开发（添加 pytest, matplotlib, ipykernel, transformers 等）：\n\n```bash\nuv sync --extra gpu --group dev\n```\n\n### 复现并与 GPT-2 对话\n\n最有趣的事情莫过于训练你自己的 GPT-2 并与之对话。实现这一目标的整个流程都包含在单个文件 [runs\u002Fspeedrun.sh](runs\u002Fspeedrun.sh) 中，该文件设计用于在 8XH100 GPU 节点上运行。从你喜欢的提供商启动一个新的 8XH100 GPU 机器（例如，我使用并喜欢 [Lambda](https:\u002F\u002Flambda.ai\u002Fservice\u002Fgpu-cloud)），然后启动训练脚本：\n\n```bash\nbash runs\u002Fspeedrun.sh\n```\n\n你可能希望在 screen 会话中执行此操作，因为运行需要约 3 小时。完成后，你可以通过类 ChatGPT Web UI 与之对话。再次确保你的本地 uv 虚拟环境（virtual environment）已激活（运行 `source .venv\u002Fbin\u002Factivate`），然后启动服务：\n\n```bash\npython -m scripts.chat_web\n```\n\n然后访问显示的 URL。确保正确访问，例如在 Lambda 上使用你所在节点的公网 IP，后跟端口，例如 [http:\u002F\u002F209.20.xxx.xxx:8000\u002F](http:\u002F\u002F209.20.xxx.xxx:8000\u002F) 等。然后像平常使用 ChatGPT 一样与你的 LLM 对话！让它写故事或诗歌。让它告诉你是谁，看看幻觉（hallucination）。问它天空为什么是蓝色的。或者为什么是绿色的。速跑模型是一个 4e19 浮点运算次数（FLOPs）能力的模型，所以有点像在与幼儿园小朋友对话 :).\n\n---\n\n\u003Cimg width=\"2672\" height=\"1520\" alt=\"image\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_readme_f30f2428ae3c.png\" \u002F>\n\n---\n\n更多说明：\n\n- 代码也可以在 Ampere 8XA100 GPU 节点上正常运行，但会稍慢一些。\n- 所有代码即使在单个 GPU 上也能正常运行，只需省略 `torchrun`，并将产生几乎相同的结果（代码会自动切换到梯度累积（gradient accumulation）），但你必须等待 8 倍长的时间。\n- 如果你的 GPU 显存（VRAM）小于 80GB，你将不得不调整一些超参数，否则会遇到内存溢出（OOM）\u002F 显存不足。在脚本中查找 `--device-batch-size` 并减小它直到适配。例如从 32（默认）减到 16, 8, 4, 2，甚至 1。低于这个值，你就需要更清楚自己在做什么并更具创造性。\n- 大部分代码是相当标准的 PyTorch，因此它应该能在任何支持该框架的设备上运行 - xpu, mps 等，但我没有亲自测试过所有这些代码路径，所以可能存在一些问题。\n\n## 研究\n\n如果您是研究人员并希望帮助改进 nanochat，两个感兴趣的脚本是 [runs\u002Fscaling_laws.sh](runs\u002Fscaling_laws.sh) 和 [runs\u002Fminiseries.sh](runs\u002Fminiseries.sh)。参见 [Jan 7 miniseries v1](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F420) 获取相关文档。对于快速实验（约 5 分钟的 pretraining (预训练) 运行），我最喜欢的规模是训练一个 12 层模型（GPT-1 大小），例如这样：\n\n```\nOMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \\\n    --depth=12 \\\n    --run=\"d12\" \\\n    --model-tag=\"d12\" \\\n    --core-metric-every=999999 \\\n    --sample-every=-1 \\\n    --save-every=-1 \\\n```\n\n这使用了 wandb (实验跟踪工具)（运行名称 \"d12\"），仅在最后一步运行 CORE 指标，并且不采样和保存中间 checkpoints (检查点)。我喜欢更改代码中的某些内容，重新运行 d12（或 d16 等），看看是否有帮助，形成一个迭代循环。为了查看运行是否有帮助，我喜欢监控 wandb 图表中的以下内容：\n\n1. `val_bpb`（validation loss (验证损失)，单位为每字节比特数 bits per byte，与词汇表大小无关），作为 `step`、`total_training_time` 和 `total_training_flops` 的函数。\n2. `core_metric`（DCLM CORE 分数）\n3. VRAM 利用率，`train\u002Fmfu`（Model FLOPS 利用率），`train\u002Ftok_per_sec`（training throughput (训练吞吐量)）\n\n参见此处的示例 [here](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fpull\u002F498#issuecomment-3850720044)。\n\n需要注意的重要一点是，nanochat 是围绕单一复杂度旋钮编写和配置的——Transformer (Transformer 架构) 的 depth (深度)。这个单一整数自动确定所有其他 hyperparameters (超参数)（Transformer 的 width (宽度)、number of heads (头数)、learning rate (学习率) 调整、training horizons (训练 horizon)、weight decays (权重衰减)、...），以便训练出的模型是 compute optimal (计算最优) 的。理念是用户无需思考或设置任何内容，他们只需使用 `--depth` 请求更小或更大的模型，一切就会“自动正常工作”。通过扫描 depth (深度)，您可以在各种尺寸下实现 nanochat compute optimal (计算最优) 模型的 miniseries (迷你系列)。GPT-2 能力模型（目前最感兴趣的）恰好位于当前代码的 d24-d26 范围内。但是，对仓库的任何候选更改都必须足够有原则，以便它们适用于所有 depth (深度) 设置。\n\n## 在 CPU \u002F MPS 上运行\n\n脚本 [runs\u002Fruncpu.sh](runs\u002Fruncpu.sh) 展示了一个在 CPU 或 Apple Silicon 上运行的非常简单的示例。它极大地缩小了正在训练的 LLM (大型语言模型)，以适应几十分钟的合理训练时间间隔。通过这种方式您不会获得强大的结果。\n\n## 精度 \u002F dtype\n\nnanochat 不使用 `torch.amp.autocast`。相反，精度通过单个全局 `COMPUTE_DTYPE`（定义在 `nanochat\u002Fcommon.py` 中）显式管理。默认情况下，这是根据您的硬件自动检测的：\n\n| Hardware | Default dtype | Why |\n|----------|--------------|-----|\n| CUDA SM 80+ (A100, H100, ...) | `bfloat16` | 原生 bf16 tensor cores (张量核心) |\n| CUDA SM \u003C 80 (V100, T4, ...) | `float32` | 无 bf16；可通过 `NANOCHAT_DTYPE=float16` 使用 fp16（使用 GradScaler） |\n| CPU \u002F MPS | `float32` | 无低精度 tensor cores (张量核心) |\n\n您可以使用 `NANOCHAT_DTYPE` 环境变量覆盖默认值：\n\n```bash\nNANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p \"hello\"   # force fp32\nNANOCHAT_DTYPE=bfloat16 torchrun --nproc_per_node=8 -m scripts.base_train  # force bf16\n```\n\n工作原理：模型权重存储在 fp32 中（用于 optimizer (优化器) 精度），但我们的自定义 `Linear (线性)` 层在 forward pass (前向传播) 期间将它们转换为 `COMPUTE_DTYPE`。Embeddings (嵌入层) 直接存储在 `COMPUTE_DTYPE` 中以节省内存。这给了我们与 autocast 相同的 mixed-precision (混合精度) 优势，但可以完全显式控制哪些部分以何种精度运行。\n\n注意：`float16` 训练会自动在 `base_train.py` 中启用 `GradScaler` 以防止 gradient underflow (梯度下溢)。SFT (监督微调) 也支持这一点，但 RL (强化学习) 目前不支持。Inference (推理) 在 fp16 下在任何地方都能正常工作。\n\n## 指南\n\n我发布了许多可能包含有用信息的指南，按从最近到最远排序：\n\n- [2026 年 2 月 1 日：以 \u003C\u003C$100 的成本超越 GPT-2：nanochat 之旅](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F481)\n- [1 月 7 日 miniseries v1](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F420) 记录了第一个 nanochat 模型 miniseries (迷你系列)。\n- 要为 nanochat 添加新功能，请参阅 [指南：计算 strawberry 中的 r（以及如何普遍添加功能）](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F164)。\n- 要自定义您的 nanochat，请参阅 Discussions 中的 [指南：为您的 nanochat 注入身份](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F139)，其中描述了如何通过合成数据生成并将该数据混合到 SFT (监督微调) 阶段来调整 nanochat 的个性。\n- [2025 年 10 月 13 日：原始 nanochat 帖子](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fdiscussions\u002F1) 介绍了 nanochat，虽然现在包含一些已弃用的信息，且模型比当前 master 分支旧得多（结果更差）。\n\n## 文件结构\n\n```\n.\n├── LICENSE\n├── README.md\n├── dev\n│   ├── gen_synthetic_data.py       # Example synthetic data for identity\n│   ├── generate_logo.html\n│   ├── nanochat.png\n│   └── repackage_data_reference.py # Pretraining data shard generation\n├── nanochat\n│   ├── __init__.py                 # empty\n│   ├── checkpoint_manager.py       # Save\u002FLoad model checkpoints\n│   ├── common.py                   # Misc small utilities, quality of life\n│   ├── core_eval.py                # Evaluates base model CORE score (DCLM paper)\n│   ├── dataloader.py               # Tokenizing Distributed Data Loader\n│   ├── dataset.py                  # Download\u002Fread utils for pretraining data\n│   ├── engine.py                   # Efficient model inference with KV Cache\n│   ├── execution.py                # Allows the LLM to execute Python code as tool\n│   ├── gpt.py                      # The GPT nn.Module Transformer\n│   ├── logo.svg\n│   ├── loss_eval.py                # Evaluate bits per byte (instead of loss)\n│   ├── optim.py                    # AdamW + Muon optimizer, 1GPU and distributed\n│   ├── report.py                   # Utilities for writing the nanochat Report\n│   ├── tokenizer.py                # BPE Tokenizer wrapper in style of GPT-4\n│   └── ui.html                     # HTML\u002FCSS\u002FJS for nanochat frontend\n├── pyproject.toml\n├── runs\n│   ├── miniseries.sh               # Miniseries training script\n│   ├── runcpu.sh                   # Small example of how to run on CPU\u002FMPS\n│   ├── scaling_laws.sh             # Scaling laws experiments\n│   └── speedrun.sh                 # Train the ~$100 nanochat d20\n├── scripts\n│   ├── base_eval.py                # Base model: CORE score, bits per byte, samples\n│   ├── base_train.py               # Base model: train\n│   ├── chat_cli.py                 # Chat model: talk to over CLI\n│   ├── chat_eval.py                # Chat model: eval tasks\n│   ├── chat_rl.py                  # Chat model: reinforcement learning\n│   ├── chat_sft.py                 # Chat model: train SFT\n│   ├── chat_web.py                 # Chat model: talk to over WebUI\n│   ├── tok_eval.py                 # Tokenizer: evaluate compression rate\n│   └── tok_train.py                # Tokenizer: train it\n├── tasks\n│   ├── arc.py                      # Multiple choice science questions\n│   ├── common.py                   # TaskMixture | TaskSequence\n│   ├── customjson.py               # Make Task from arbitrary jsonl convos\n│   ├── gsm8k.py                    # 8K Grade School Math questions\n│   ├── humaneval.py                # Misnomer; Simple Python coding task\n│   ├── mmlu.py                     # Multiple choice questions, broad topics\n│   ├── smoltalk.py                 # Conglomerate dataset of SmolTalk from HF\n│   └── spellingbee.py              # Task teaching model to spell\u002Fcount letters\n├── tests\n│   └── test_engine.py\n└── uv.lock\n```\n\n## 贡献\n\nnanochat 的目标是改进 micro models (微型模型) 的最先进水平，使其能够在低于 1000 美元的预算下实现 end to end (端到端) 工作。易用性 (Accessibility) 不仅关乎总体成本，也关乎 cognitive complexity (认知复杂度) —— nanochat 不是一个可详尽配置的 LLM (大型语言模型) “框架”；代码库 (code base) 中没有巨大的配置对象 (configuration objects)、模型工厂 (model factories) 或复杂的条件判断逻辑 (if-then-else monsters)。它是一个单一、连贯、极简、可读、可修改、最大可分叉的“强基线 (strong baseline)\"代码库，旨在从头到尾运行，并产生一个你可以对话的 ChatGPT 模型。目前，个人最感兴趣的部分是加速达到 GPT-2 水平的延迟 (latency)（即获得高于 0.256525 的 CORE score (CORE 分数)）。目前这需要约 3 小时，但通过改进 pretraining (预训练) 阶段，我们可以进一步提升。\n\n当前 AI 政策：披露 (disclosure)。当提交 PR (合并请求) 时，请声明任何有 LLM (大型语言模型) 实质贡献的部分，以及任何非你撰写或你未完全理解的部分。\n\n## 致谢\n\n- 名称 (nanochat) 源于我早期的项目 [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)，该项目仅涵盖 pretraining (预训练)。\n- nanochat 也受到 [modded-nanoGPT](https:\u002F\u002Fgithub.com\u002FKellerJordan\u002Fmodded-nanogpt) 的启发，该项目通过清晰的指标和 leaderboard (排行榜) 将 nanoGPT 仓库游戏化，并借用了其许多想法和部分 pretraining (预训练) 实现。\n- 感谢 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002F) 提供 fineweb 和 smoltalk。\n- 感谢 [Lambda](https:\u002F\u002Flambda.ai\u002Fservice\u002Fgpu-cloud) 提供开发本项目所用的 compute (算力)。\n- 感谢首席 LLM (大型语言模型) 调教专家 🧙‍♂️ Alec Radford 提供的建议\u002F指导。\n- 感谢仓库主管 (repo czar) Sofie [@svlandeg](https:\u002F\u002Fgithub.com\u002Fsvlandeg) 帮助管理 nanochat 的问题、pull requests (合并请求) 和讨论。\n\n## 引用\n\n如果您发现 nanochat 对您的研究有帮助，请按以下方式引用：\n\n```bibtex\n@misc{nanochat,\n  author = {Andrej Karpathy},\n  title = {nanochat: The best ChatGPT that \\$100 can buy},\n  year = {2025},\n  publisher = {GitHub},\n  url = {https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat}\n}\n```\n\n## 许可证\n\nMIT","# nanochat 快速上手指南\n\nnanochat 是一个极简的 LLM 训练实验框架，专为单 GPU 节点设计。代码精简可黑盒，涵盖分词、预训练、微调、评估、推理及聊天 UI。用户可通过调节 `--depth` 参数训练不同规模的计算最优模型，仅需极低成本即可复现 GPT-2 级别能力。\n\n## 环境准备\n\n*   **操作系统**: Linux \u002F Unix (推荐)\n*   **硬件要求**:\n    *   **推荐**: 8xH100 GPU 节点（完整训练约 2-3 小时）\n    *   **兼容**: 单 GPU (A100\u002FH100 等，速度较慢)、CPU、Apple Silicon (MPS)\n    *   **显存**: 建议 80GB+，若显存不足需调整 `--device-batch-size`\n*   **前置依赖**:\n    *   Python 环境\n    *   [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) 包管理工具\n\n## 安装步骤\n\n1.  **安装 uv** (如尚未安装):\n    ```bash\n    curl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n    ```\n\n2.  **同步依赖**:\n    根据硬件环境选择以下命令之一：\n    ```bash\n    uv sync --extra gpu    # 适用于 CUDA 环境 (A100\u002FH100 等)\n    # 或\n    uv sync --extra cpu    # 适用于 CPU -only \u002F MPS 环境\n    ```\n\n3.  **激活虚拟环境**:\n    ```bash\n    source .venv\u002Fbin\u002Factivate\n    ```\n\n    *(可选) 开发模式安装 (包含 pytest, matplotlib 等)*:\n    ```bash\n    uv sync --extra gpu --group dev\n    ```\n\n## 基本使用\n\n### 1. 训练与对话 (推荐 8xH100 环境)\n\n这是复现 GPT-2 能力模型的标准流程。\n\n*   **启动训练**:\n    在 8xH100 GPU 节点上运行以下脚本（耗时约 3 小时）：\n    ```bash\n    bash runs\u002Fspeedrun.sh\n    ```\n\n*   **启动聊天 UI**:\n    训练完成后，保持虚拟环境激活状态，启动 Web 界面：\n    ```bash\n    python -m scripts.chat_web\n    ```\n    访问终端显示的 URL (例如 `http:\u002F\u002F\u003CIP>:8000\u002F`) 即可像使用 ChatGPT 一样与模型对话。\n\n### 2. 轻量级运行 (CPU \u002F MPS \u002F 单卡)\n\n若不具备多卡高性能环境，可使用 CPU 或单卡脚本进行快速实验（模型规模会自动缩小）：\n\n```bash\nbash runs\u002Fruncpu.sh\n```\n\n### 3. 快速实验 (研究者模式)\n\n对于代码迭代验证，可训练一个 12 层的小模型 (GPT-1 大小) 以节省时间：\n\n```bash\nOMP_NUM_THREADS=1 torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- \\\n    --depth=12 \\\n    --run=\"d12\" \\\n    --model-tag=\"d12\" \\\n    --core-metric-every=999999 \\\n    --sample-every=-1 \\\n    --save-every=-1\n```\n\n## 注意事项\n\n*   **显存调整**: 若 GPU 显存小于 80GB，需在脚本中降低 `--device-batch-size` (默认 32，可尝试 16, 8, 4...) 以避免 OOM。\n*   **精度控制**: 精度默认根据硬件自动检测 (A100\u002FH100 为 `bfloat16`)。可通过环境变量强制指定：\n    ```bash\n    NANOCHAT_DTYPE=float32 python -m scripts.chat_cli -p \"hello\"\n    ```\n*   **单卡运行**: 省略 `torchrun` 即可在单卡运行，代码会自动切换为梯度累积，但训练时间会增加约 8 倍。\n*   **监控指标**: 训练过程中建议监控 wandb 中的 `val_bpb` (验证损失)、`core_metric` (DCLM CORE 分数) 及显存利用率。","某高校 AI 实验室希望让学生完整体验大模型训练流程，从数据预处理到聊天界面部署，但面临预算有限且缺乏大规模集群运维经验的困境。\n\n### 没有 nanochat 时\n- 成本高昂：复现 GPT-2 级别模型在 2019 年需约 43,000 美元，远超普通课题经费预算。\n- 配置复杂：需手动调整层数、学习率、权重衰减等大量超参数，调优门槛极高。\n- 流程割裂：训练、评估、推理往往需拼接不同代码库，难以一站式完成实验。\n- 硬件依赖：通常需要多节点集群支持，单卡难以运行完整训练流程。\n\n### 使用 nanochat 后\n- 成本极低：nanochat 仅需约 48 美元（2 小时 8xH100）即可训练出同等能力模型，预算降低三个数量级。\n- 自动调优：只需设定 `--depth` 参数，nanochat 自动计算最优超参数组合，无需人工干预。\n- 全流程整合：nanochat 涵盖分词到聊天 UI 的所有阶段，代码极简且易于修改 hack。\n- 单节点运行：nanochat 优化为单 GPU 节点运行，无需复杂集群配置即可启动训练。\n\nnanochat 将大模型训练从“烧钱实验”变为“人人可及的开发体验”，极大降低了研究与学习门槛。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_nanochat_fd93e028.png","karpathy","Andrej","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkarpathy_75f033eb.jpg","I like to train Deep Neural Nets on large datasets.",null,"Stanford","andrej.karpathy@gmail.com","https:\u002F\u002Ftwitter.com\u002Fkarpathy","https:\u002F\u002Fgithub.com\u002Fkarpathy",[85,89,93,97],{"name":86,"color":87,"percentage":88},"Python","#3572A5",76.4,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",16.7,{"name":94,"color":95,"percentage":96},"HTML","#e34c26",3.8,{"name":98,"color":99,"percentage":100},"Shell","#89e051",3.1,51057,6733,"2026-04-05T10:42:33","MIT","Linux, macOS","非必需但推荐。训练推荐 NVIDIA A100\u002FH100 (80GB 显存)。支持单卡或多卡 (8x)。显存小于 80GB 需调整 --device-batch-size。支持 CPU\u002FMPS 但需缩小模型。","未说明",{"notes":109,"python":107,"dependencies":110},"使用 uv 进行依赖管理。默认计算精度根据硬件自动检测 (A100\u002FH100 为 bfloat16，旧卡为 float32)。训练脚本默认针对 8xH100 节点优化，单卡运行时间约增加 8 倍。CPU\u002FMPS 运行需大幅缩小模型规模且结果较弱。",[111,112,113,114,115,116,117],"torch","transformers","wandb","uv","pytest","matplotlib","ipykernel",[26,13],202,"2026-03-27T02:49:30.150509","2026-04-06T05:37:25.495090",[123,128,133,138,143,148],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},525,"有哪些推荐的 GPU 租赁服务（特别是欧盟地区）？","社区推荐的服务包括：Nebius（8xH100 约 $23.74\u002F小时，支持 API 和 Terraform）、OVHcloud（有免费额度但平台较慢）、Runpod（价格较好）。欧盟用户需注意支付方式和可用性限制，建议比较不同服务商的价格。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fissues\u002F10",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},522,"是否可以使用单张 RTX 4090 显卡训练模型？","可以。用户实测单张 4090 可成功运行预训练。推荐使用命令：`torchrun --standalone --nproc_per_node=1 -m scripts.base_train -- --depth=20`，并将 batch size 设置为 4。在此配置下，预计完成训练大约需要 3 天左右。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fissues\u002F157",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},523,"如何在 24GB 显存的本地 GPU 上运行项目？","需要调整参数以适应显存限制。对于单 GPU，建议跳过 `torchrun` 以节省显存（约从 21GB 降至 17GB）。命令示例：`python -m scripts.base_train --depth=20 --device_batch_size=4 --sample_every=100`。同时可通过减小全局 batch size 来进一步降低显存占用。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fissues\u002F45",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},524,"在 8×H100 上运行 speedrun.sh 出现 OOM 错误如何解决？","默认设置在某些版本可能导致 OOM。解决方法：1. 在命令中添加 `--device-batch-size=16`；2. 或将 `vocab_size` 设置为 32768。维护者已在 master 分支修复默认配置（减小了 vocab size），建议更新到最新代码。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fissues\u002F443",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},526,"遇到 base model 和 SFT 检查点目录不一致的错误怎么办？","这是之前版本存在的一个 Bug。请确保拉取最新的 master 分支代码，该路径问题已在最新修复中解决。如果问题依旧，请检查是否拉取了包含修复的版本。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fnanochat\u002Fissues\u002F494",{"id":149,"question_zh":150,"answer_zh":151,"source_url":132},527,"我应该在哪里提交问题或咨询（Issues 还是 Discussions）？","Issues 主要用于追踪 Bug 和待办事项。一般性问题、配置咨询或讨论（如硬件兼容性、租赁建议）建议转移到 Discussions 论坛，以便更好地归档和搜索，维护者也会将此类 Issue 转移至 Discussions。",[]]