[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-karpathy--llm.c":3,"similar-karpathy--llm.c":108},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":19,"owner_email":20,"owner_twitter":18,"owner_website":21,"owner_url":22,"languages":23,"stars":52,"forks":53,"last_commit_at":54,"license":55,"difficulty_score":56,"env_os":57,"env_gpu":58,"env_ram":59,"env_deps":60,"category_tags":70,"github_topics":18,"view_count":73,"oss_zip_url":18,"oss_zip_packed_at":18,"status":74,"created_at":75,"updated_at":76,"faqs":77,"releases":107},4098,"karpathy\u002Fllm.c","llm.c","LLM training in simple, raw C\u002FCUDA","llm.c 是一个旨在用极简、纯粹的 C 和 CUDA 代码实现大语言模型（LLM）训练项目的开源工具。它摒弃了庞大的依赖库，无需安装数百兆的 PyTorch 或 Python 环境，仅凭一千行左右的清晰代码即可复现 GPT-2 和 GPT-3 系列的预训练过程。\n\n该项目主要解决了传统深度学习框架过于厚重、底层细节被高度封装的问题，让开发者能够直接透视模型训练的每一个数学与内存操作细节。在性能上，llm.c 甚至比最新的 PyTorch 夜间版快约 7%。其独特的技术亮点在于提供了从单文件 CPU 参考实现到高性能 GPU CUDA 内核的完整路径，既保留了代码的可读性，又兼顾了执行效率。\n\nllm.c 特别适合希望深入理解大模型底层原理的开发者、研究人员以及教育者。如果你渴望摆脱黑盒框架，亲手掌控从数据加载、前向传播到反向梯度更新的每一行代码，或者想在资源受限的环境中探索模型训练，llm.c 将是一个极佳的学习与实践平台。它不仅是一个训练工具，更是一本可执行的教科书，帮助用户从零开始构建对大语言模型的深刻认知。","# llm.c\n\nLLMs in simple, pure C\u002FCUDA with no need for 245MB of PyTorch or 107MB of cPython. Current focus is on pretraining, in particular reproducing the [GPT-2](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2) and [GPT-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.14165) miniseries, along with a parallel PyTorch reference implementation in [train_gpt2.py](train_gpt2.py). You'll recognize this file as a slightly tweaked [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT), an earlier project of mine. Currently, llm.c is a bit faster than PyTorch Nightly (by about 7%). In addition to the bleeding edge mainline code in [train_gpt2.cu](train_gpt2.cu), we have a simple reference CPU fp32 implementation in ~1,000 lines of clean code in one file [train_gpt2.c](train_gpt2.c). I'd like this repo to only maintain C and CUDA code. Ports to other languages or repos are very welcome, but should be done in separate repos, and I am happy to link to them below in the \"notable forks\" section. Developer coordination happens in the [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions) and on Discord, either the `#llmc` channel on the [Zero to Hero](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp) channel, or on `#llmdotc` on [GPU MODE](https:\u002F\u002Fdiscord.gg\u002Fgpumode) Discord.\n\n## quick start\n\nThe best introduction to the llm.c repo today is reproducing the GPT-2 (124M) model. [Discussion #481](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481) steps through this in detail. We can reproduce other models from the GPT-2 and GPT-3 series in both llm.c and in the parallel implementation of PyTorch. Have a look at the [scripts README](scripts\u002FREADME.md).\n\ndebugging tip: when you run the `make` command to build the binary, modify it by replacing `-O3` with `-g` so you can step through the code in your favorite IDE (e.g. vscode).\n\n## quick start (1 GPU, fp32 only)\n\nIf you won't be training on multiple nodes, aren't interested in mixed precision, and are interested in learning CUDA, the fp32 (legacy) files might be of interest to you. These are files that were \"checkpointed\" early in the history of llm.c and frozen in time. They are simpler, more portable, and possibly easier to understand. Run the 1 GPU, fp32 code like this:\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2fp32cu\n.\u002Ftrain_gpt2fp32cu\n```\n\nThe download_starter_pack.sh script is a quick & easy way to get started and it downloads a bunch of .bin files that help get you off the ground. These contain: 1) the GPT-2 124M model saved in fp32, in bfloat16, 2) a \"debug state\" used in unit testing (a small batch of data, and target activations and gradients), 3) the GPT-2 tokenizer, and 3) the tokenized [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset. Alternatively, instead of running the .sh script, you can re-create these artifacts manually as follows:\n\n```bash\npip install -r requirements.txt\npython dev\u002Fdata\u002Ftinyshakespeare.py\npython train_gpt2.py\n```\n\n## quick start (CPU)\n\nThe \"I am so GPU poor that I don't even have one GPU\" section. You can still enjoy seeing llm.c train! But you won't go too far. Just like the fp32 version above, the CPU version is an even earlier checkpoint in the history of llm.c, back when it was just a simple reference implementation in C. For example, instead of training from scratch, you can finetune a GPT-2 small (124M) to output Shakespeare-like text, as an example:\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2\nOMP_NUM_THREADS=8 .\u002Ftrain_gpt2\n```\n\nIf you'd prefer to avoid running the starter pack script, then as mentioned in the previous section you can reproduce the exact same .bin files and artifacts by running `python dev\u002Fdata\u002Ftinyshakespeare.py` and then `python train_gpt2.py`.\n\nThe above lines (1) download an already tokenized [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset and download the GPT-2 (124M) weights, (3) init from them in C and train for 40 steps on tineshakespeare with AdamW (using batch size 4, context length only 64), evaluate validation loss, and sample some text. Honestly, unless you have a beefy CPU (and can crank up the number of OMP threads in the launch command), you're not going to get that far on CPU training LLMs, but it might be a good demo\u002Freference. The output looks like this on my MacBook Pro (Apple Silicon M3 Max):\n\n```\n[GPT-2]\nmax_seq_len: 1024\nvocab_size: 50257\nnum_layers: 12\nnum_heads: 12\nchannels: 768\nnum_parameters: 124439808\ntrain dataset num_batches: 1192\nval dataset num_batches: 128\nnum_activations: 73323776\nval loss 5.252026\nstep 0: train loss 5.356189 (took 1452.121000 ms)\nstep 1: train loss 4.301069 (took 1288.673000 ms)\nstep 2: train loss 4.623322 (took 1369.394000 ms)\nstep 3: train loss 4.600470 (took 1290.761000 ms)\n... (trunctated) ...\nstep 39: train loss 3.970751 (took 1323.779000 ms)\nval loss 4.107781\ngenerating:\n---\nCome Running Away,\nGreater conquer\nWith the Imperial blood\nthe heaviest host of the gods\ninto this wondrous world beyond.\nI will not back thee, for how sweet after birth\nNetflix against repounder,\nwill not\nflourish against the earlocks of\nAllay\n---\n```\n\n## datasets\n\nThe data files inside `\u002Fdev\u002Fdata\u002F(dataset).py` are responsible for downloading, tokenizing and saving the tokens to .bin files, readable easily from C. So for example when you run:\n\n```bash\npython dev\u002Fdata\u002Ftinyshakespeare.py\n```\n\nWe download and tokenize the [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) dataset. The output of this looks like this:\n\n```\nwriting 32,768 tokens to .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_val.bin\nwriting 305,260 tokens to .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_train.bin\n```\n\nThe .bin files contain a short header (1024 bytes) and then a stream of tokens in uint16, indicating the token ids with the GPT-2 tokenizer. More datasets are available in `\u002Fdev\u002Fdata`.\n\n## test\n\nI am also attaching a simple unit test for making sure our C code agrees with the PyTorch code. On the CPU as an example, compile and run with:\n\n```bash\nmake test_gpt2\n.\u002Ftest_gpt2\n```\n\nThis now loads the `gpt2_124M_debug_state.bin` file that gets written by train_gpt2.py, runs a forward pass, compares the logits and loss with the PyTorch reference implementation, then it does 10 iterations of training with Adam and makes sure the losses match PyTorch. To test the GPU version we run:\n\n```bash\n# fp32 test (cudnn not supported)\nmake test_gpt2cu PRECISION=FP32 && .\u002Ftest_gpt2cu\n# mixed precision cudnn test\nmake test_gpt2cu USE_CUDNN=1 && .\u002Ftest_gpt2cu\n```\n\nThis tests both the fp32 path and the mixed precision path. The test should pass and print `overall okay: 1`.\n\n## tutorial\n\nI attached a very small tutorial here, in [doc\u002Flayernorm\u002Flayernorm.md](doc\u002Flayernorm\u002Flayernorm.md). It's a simple, step-by-step guide to implementing a single layer of the GPT-2 model, the layernorm layer. This is a good starting point to understand how the layers are implemented in C.\n\n**flash attention**. As of May 1, 2024 we use the Flash Attention from cuDNN. Because cuDNN bloats the compile time from a few seconds to ~minute and this code path is right now very new, this is disabled by default. You can enable it by compiling like this:\n\n```bash\nmake train_gpt2cu USE_CUDNN=1\n```\n\nThis will try to compile with cudnn and run it. You have to have cuDNN installed on your system. The [cuDNN installation instructions](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn) with apt-get will grab the default set of cuDNN packages. For a minimal setup, the cuDNN dev package is sufficient, e.g. on Ubuntu 22.04 for CUDA 12.x:\n\n```bash\nwget https:\u002F\u002Fdeveloper.download.nvidia.com\u002Fcompute\u002Fcuda\u002Frepos\u002Fubuntu2204\u002Fx86_64\u002Fcuda-keyring_1.1-1_all.deb\nsudo dpkg -i cuda-keyring_1.1-1_all.deb\nsudo apt-get update\nsudo apt-get -y install libcudnn9-dev-cuda-12\n```\n\nOn top of this you need the [cuDNN frontend](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcudnn-frontend\u002Ftree\u002Fmain), but this is just header files. Simply clone the repo to your disk. The Makefile currently looks for it in either your home directory or the current directory. If you have put it elsewhere, add `CUDNN_FRONTEND_PATH=\u002Fpath\u002Fto\u002Fyour\u002Fcudnn-frontend\u002Finclude` to the `make` command-line.\n\n## multi-GPU training\n\nMake sure you install MPI and NCCL, e.g. on Linux:\n\n```bash\nsudo apt install openmpi-bin openmpi-doc libopenmpi-dev\n```\n\nFor NCCL follow the instructions from the [official website](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnccl\u002Fnccl-download) (e.g. network installer)\n\nand then:\n\n```bash\nmake train_gpt2cu\nmpirun -np \u003Cnumber of GPUs> .\u002Ftrain_gpt2cu\n```\n\nor simply run one of our scripts under `.\u002Fscripts\u002F`.\n\n## multi-node training\n\nMake sure you've installed `NCCL` following instructions from [multi-GPU](#multi-gpu-training) section.\n\nThere are 3 ways we currently support that allow you to run multi-node training:\n1) Use OpenMPI to exchange nccl id and initialize NCCL. See e.g. `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_mpi.sh` script for details.\n2) Use shared file system to init NCCL. See `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_fs.sbatch` script for details.\n3) Use TCP sockets to init NCCL. See `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_tcp.sbatch` script for details.\n\nNote:\n* If you're running in a slurm environment and your slurm doesn't support PMIx (which we assume will be a common situation given that `slurm-wlm` dropped PMIx support) you will have to use FS (2) or TCP (3) approach. To test whether your slurm supports PMIx run: `srun --mpi=list` and see whether you get `pmix` in the output.\n* If you don't have slurm set up, you can kick off a multi-node run using `mpirun` - MPI (1).\n\nNone of these 3 methods is superior, we just offer you options so that you can run in your specific environment.\n\n## experiments \u002F sweeps\n\nJust as an example process to sweep learning rates on a machine with 4 GPUs on TinyStories. Run a shell script `sweep.sh` (after you of course `chmod u+x sweep.sh`):\n\n```bash\n#!\u002Fbin\u002Fbash\n\nlearning_rates=(3e-5 1e-4 3e-4 1e-3)\n\nfor i in {0..3}; do\n    export CUDA_VISIBLE_DEVICES=$i\n    screen -dmS \"tr$i\" bash -c \".\u002Ftrain_gpt2cu -i data\u002FTinyStories -v 250 -s 250 -g 144 -l ${learning_rates[$i]} -o stories$i.log\"\ndone\n\n# you can bring these down with\n# screen -ls | grep -E \"tr[0-3]\" | cut -d. -f1 | xargs -I {} screen -X -S {} quit\n```\n\nThis example opens up 4 screen sessions and runs the four commands with different LRs. This writes the log files `stories$i.log` with all the losses, which you can plot as you wish in Python. A quick example of how to parse and plot these logfiles is in [dev\u002Fvislog.ipynb](dev\u002Fvislog.ipynb).\n\n## repo\n\nA few more words on what I want this repo to be:\n\nFirst, I want `llm.c` to be a place for education. E.g. our `dev\u002Fcuda` folder is a place for a library of kernels for all the layers that are manually hand-written and very well documented, starting from very simple kernels all the way to more complex \u002F faster kernels. If you have a new kernel with various different tradeoffs, please feel free to contribute it here.\n\nThat said, I also want `llm.c` to be very fast too, even practically useful to train networks. E.g. to start, we should be able to reproduce the big GPT-2 (1.6B) training run. This requires that we incorporate whatever fastest kernels there are, including the use of libraries such as cuBLAS, cuBLASLt, CUTLASS, cuDNN, etc. I also think doing so serves an educational purpose to establish an expert upper bound, and a unit of measurement, e.g. you could say that your manually written kernels are 80% of cuBLAS speed, etc. Then you can choose to do a super fast run, or you can choose to \"drag and drop\" whatever manual kernels you wish to use, and run with those.\n\nHowever, as a constraint, I want to keep the mainline `llm.c` in the root folder simple and readable. If there is a PR that e.g. improves performance by 2% but it \"costs\" 500 lines of complex C code, and maybe an exotic 3rd party dependency, I may reject the PR because the complexity is not worth it. As a concrete example - making cuBLAS for matmuls the default in the root training loop is a no-brainer: it makes the mainline code much faster, it is a single line of interpretable code, and it is a very common dependency. On the side of this, we can have manual implementations that can compete with cuBLAS in `dev\u002Fcuda`.\n\nLastly, I will be a lot more sensitive to complexity in the root folder of the project, which contains the main \u002F default files of the project. In comparison, the `dev\u002F` folder is a bit more of a scratch space for us to develop a library of kernels or classes and share useful or related or educational code, and some of this code could be ok to be (locally) complex.\n\n## notable forks\n\n- AMD support\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fanthonix\u002Fllm.c) by @[anthonix](https:\u002F\u002Fgithub.com\u002Fanthonix): support for AMD devices, such as the 7900 XTX\n\n- C#\n  - [llm.cs](https:\u002F\u002Fgithub.com\u002Fazret\u002Fllm.cs) by @[azret](https:\u002F\u002Fgithub.com\u002Fazret): a C# port of this project\n  - [Llm.cs](https:\u002F\u002Fgithub.com\u002Fnietras\u002FLlm.cs) by @[nietras](https:\u002F\u002Fgithub.com\u002Fnietras): a C# port of this project with focus on easy to get started on any platform. Clone and run ✅\n\n- CUDA C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fgevtushenko\u002Fllm.c) by @[gevtushenko](https:\u002F\u002Fgithub.com\u002Fgevtushenko): a port of this project using the [CUDA C++ Core Libraries](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl)\n     - A presentation this fork was covered in [this lecture](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WiB_3Csfj_Q) in the [GPU MODE Discord Server](https:\u002F\u002Fdiscord.gg\u002Fcudamode)\n\n- C++\u002FCUDA\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fzhangpiu\u002Fllm.cpp\u002Ftree\u002Fmaster\u002Fllmcpp) by @[zhangpiu](https:\u002F\u002Fgithub.com\u002Fzhangpiu): a port of this project using the [Eigen](https:\u002F\u002Fgitlab.com\u002Flibeigen\u002Feigen), supporting CPU\u002FCUDA.\n\n- WebGPU C++\n  - [gpu.cpp](https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Fgpu.cpp) by @[austinvhuang](https:\u002F\u002Fgithub.com\u002Faustinvhuang): a library for portable GPU compute in C++ using native WebGPU. Aims to be a general-purpose library, but also porting llm.c kernels to WGSL.\n  \n- C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp) by @[GaoYusong](https:\u002F\u002Fgithub.com\u002FGaoYusong): a port of this project featuring a C++ single-header [tinytorch.hpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp\u002Fblob\u002Fmain\u002Ftinytorch.hpp) library\n\n- Go\n  - [llm.go](https:\u002F\u002Fgithub.com\u002Fjoshcarp\u002Fllm.go) by @[joshcarp](https:\u002F\u002Fgithub.com\u002Fjoshcarp): a Go port of this project\n\n- Java\n  - [llm.java](https:\u002F\u002Fgithub.com\u002Fharryjackson\u002Fllm.java) by @[harryjackson](https:\u002F\u002Fgithub.com\u002Fharryjackson): a Java port of this project\n\n- Metal\n  - [llm.metal](https:\u002F\u002Fgithub.com\u002Fregrettable-username\u002Fllm.metal) by @[regrettable-username](https:\u002F\u002Fgithub.com\u002Fregrettable-username): LLM training in simple, raw C\u002FMetal Shading Language\n\n- Mojo\n  - [llm.🔥](https:\u002F\u002Fgithub.com\u002Fdorjeduck\u002Fllm.mojo) by @[dorjeduck](https:\u002F\u002Fgithub.com\u002Fdorjeduck): a Mojo port of this project\n\n- OpenCL\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj\u002Fllm.c) by @[krrishnarraj](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj): an OpenCL port of this project\n\n- Rust\n  -  [llm.rs](https:\u002F\u002Fgithub.com\u002Fyijunyu\u002Fllm.rs) by @[Yijun Yu](https:\u002F\u002Fgithub.com\u002Fyijunyu): a Rust rewrite with the aim to have same performance\n  -  [llm.rs](https:\u002F\u002Fgithub.com\u002FToJen\u002Fllm.rs) by @[ToJen](https:\u002F\u002Fgithub.com\u002FToJen): a Rust port of this project\n\n- Swift\n  - [llm.swift](https:\u002F\u002Fgithub.com\u002Fotabuzzman\u002Fllm.swift) by @[otabuzzman](https:\u002F\u002Fgithub.com\u002Fotabuzzman): a Swift port of this project\n\n- Zig\n  - [llm.zig](https:\u002F\u002Fgithub.com\u002FSaimirbaci\u002Fllm.zig) by @[saimirbaci](https:\u002F\u002Fgithub.com\u002FSaimirbaci): a Zig port of this project\n \n- Habana Gaudi2\n  - [llm.tpc](https:\u002F\u002Fgithub.com\u002Fabhilash1910\u002Fllm.tpc) by @[abhilash1910](https:\u002F\u002Fgithub.com\u002Fabhilash1910): a Habana Gaudi2 port of this project \n\n- Nim\n  - [llm.nim](https:\u002F\u002Fgithub.com\u002FVindaar\u002Fllm.nim) by @[Vindaar](https:\u002F\u002Fgithub.com\u002FVindaar): a Nim port of this project\n\n## discussions\n\nWays of organizing development:\n\n- Experiencing a concrete issue with the repo? Use [Issues](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues).\n- Have some code to contribute? Open a [PR](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fpulls)\n- Chat about the repo, ask questions, etc.? Look at [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions).\n- Something faster? I created a new `#llmc` channel on my [Zero to Hero Discord channel](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp).\n\n## license\n\nMIT\n","# llm.c\n\n用简洁、纯粹的 C\u002FCUDA 实现大型语言模型，无需依赖 245MB 的 PyTorch 或 107MB 的 cPython。目前的重点是预训练，尤其是复现 [GPT-2](https:\u002F\u002Fgithub.com\u002Fopenai\u002Fgpt-2) 和 [GPT-3](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.14165) 系列小模型，并提供一个并行的 PyTorch 参考实现，位于 [train_gpt2.py](train_gpt2.py) 中。你可能会发现这个文件与我早期的一个项目 [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT) 有些许调整。目前，llm.c 的速度比 PyTorch Nightly 版本快约 7%。除了位于 [train_gpt2.cu](train_gpt2.cu) 中的最前沿主线代码外，我们还提供了一个简单的 CPU fp32 参考实现，包含约 1,000 行干净的代码，全部写在一个文件 [train_gpt2.c](train_gpt2.c) 中。我希望这个仓库只维护 C 和 CUDA 代码。其他语言或仓库的移植非常欢迎，但应放在独立的仓库中，我乐意在下方的“值得关注的分支”部分链接它们。开发者之间的协作通过 [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions) 和 Discord 进行，可以在 [Zero to Hero](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp) 频道的 `#llmc` 频道，或者在 [GPU MODE](https:\u002F\u002Fdiscord.gg\u002Fgpumode) Discord 的 `#llmdotc` 频道进行。\n\n## 快速入门\n\n今天了解 llm.c 仓库的最佳方式就是复现 GPT-2（1.24亿参数）模型。[讨论 #481](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions\u002F481) 详细介绍了这一过程。我们可以在 llm.c 和 PyTorch 的并行实现中复现 GPT-2 和 GPT-3 系列的其他模型。请查看 [scripts README](scripts\u002FREADME.md)。\n\n调试提示：当你运行 `make` 命令构建二进制文件时，可以将 `-O3` 替换为 `-g`，以便在你喜欢的 IDE（例如 VS Code）中逐步调试代码。\n\n## 快速入门（单 GPU，仅 fp32）\n\n如果你不需要在多节点上训练，也不关心混合精度，并且对学习 CUDA 感兴趣，那么 fp32（旧版）文件可能更适合你。这些文件是在 llm.c 发展初期“保存”的版本，此后便不再更新。它们更简单、更易移植，也可能更容易理解。你可以这样运行单 GPU 的 fp32 代码：\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2fp32cu\n.\u002Ftrain_gpt2fp32cu\n```\n\ndownload_starter_pack.sh 脚本是一个快速简便的入门方式，它会下载一些 .bin 文件帮助你快速上手。这些文件包括：1) 以 fp32 和 bfloat16 格式保存的 GPT-2 1.24亿参数模型；2) 用于单元测试的“调试状态”（一小批数据以及目标激活值和梯度）；3) GPT-2 分词器；以及 3) 经过分词的 [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) 数据集。或者，你也可以不运行该脚本，而是手动重新创建这些文件和工件，如下所示：\n\n```bash\npip install -r requirements.txt\npython dev\u002Fdata\u002Ftinyshakespeare.py\npython train_gpt2.py\n```\n\n## 快速入门（CPU）\n\n适用于“连一块 GPU 都没有”的情况。即使没有 GPU，你仍然可以体验 llm.c 的训练过程！不过进展不会太远。与上述 fp32 版本一样，CPU 版本是 llm.c 更早时期的快照，那时它还只是一个简单的 C 语言参考实现。例如，与其从头开始训练，不如对 GPT-2 小模型（1.24亿参数）进行微调，使其生成类似莎士比亚风格的文本，示例如下：\n\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\nmake train_gpt2\nOMP_NUM_THREADS=8 .\u002Ftrain_gpt2\n```\n\n如果你不想运行启动包脚本，也可以像前一节提到的那样，通过运行 `python dev\u002Fdata\u002Ftinyshakespeare.py` 和 `python train_gpt2.py` 来重现完全相同的 .bin 文件和工件。\n\n以上步骤（1）下载已经分词的 [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) 数据集，并下载 GPT-2（1.24亿参数）的权重；（3）使用这些权重在 C 语言中初始化模型，在 tinyshakespeare 数据集上以 AdamW 优化器训练 40 步（批量大小为 4，上下文长度仅为 64），评估验证损失，并采样生成一些文本。老实说，除非你有一台性能强劲的 CPU（并且能够在启动命令中增加 OMP 线程数），否则在 CPU 上训练大型语言模型的效果不会太好，但这或许可以作为一个不错的演示或参考。以下是我 MacBook Pro（Apple Silicon M3 Max）上的输出结果：\n\n```\n[GPT-2]\nmax_seq_len: 1024\nvocab_size: 50257\nnum_layers: 12\nnum_heads: 12\nchannels: 768\nnum_parameters: 124439808\ntrain dataset num_batches: 1192\nval dataset num_batches: 128\nnum_activations: 73323776\nval loss 5.252026\nstep 0: train loss 5.356189（耗时 1452.121000 ms）\nstep 1: train loss 4.301069（耗时 1288.673000 ms）\nstep 2: train loss 4.623322（耗时 1369.394000 ms）\nstep 3: train loss 4.600470（耗时 1290.761000 ms）\n...（截断）...\nstep 39: train loss 3.970751（耗时 1323.779000 ms）\nval loss 4.107781\n生成：\n---\n奔逃而来，\n更大的征服者，\n带着皇室血脉，\n众神中最强大的军队，\n进入这奇妙的世界之外。\n我不会支持你，因为出生之后是多么甜蜜啊——\nNetflix 对抗 repounder，\n将不会\n繁荣于 Allay 的耳环之下——\n---\n```\n\n## 数据集\n\n位于 `\u002Fdev\u002Fdata\u002F(dataset).py` 中的数据文件负责下载、分词并将分词结果保存为 .bin 文件，以便 C 语言轻松读取。例如，当你运行：\n\n```bash\npython dev\u002Fdata\u002Ftinyshakespeare.py\n```\n\n我们会下载并分词 [tinyshakespeare](https:\u002F\u002Fraw.githubusercontent.com\u002Fkarpathy\u002Fchar-rnn\u002Fmaster\u002Fdata\u002Ftinyshakespeare\u002Finput.txt) 数据集。其输出如下：\n\n```\n将 32,768 个 token 写入 .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_val.bin\n将 305,260 个 token 写入 .\u002Fdev\u002Fdata\u002Ftinyshakespeare\u002Ftiny_shakespeare_train.bin\n```\n\n这些 .bin 文件包含一个 1024 字节的头部，随后是以 uint16 格式存储的 token 流，表示使用 GPT-2 分词器得到的 token ID。更多数据集可在 `\u002Fdev\u002Fdata` 中找到。\n\n## 测试\n\n我还附带了一个简单的单元测试，用于确保我们的 C 代码与 PyTorch 代码一致。以 CPU 为例，编译并运行如下命令：\n\n```bash\nmake test_gpt2\n.\u002Ftest_gpt2\n```\n\n此测试会加载由 train_gpt2.py 生成的 `gpt2_124M_debug_state.bin` 文件，执行一次前向传播，比较 logits 和损失与 PyTorch 参考实现的结果，然后进行 10 次 Adam 优化迭代，确保损失与 PyTorch 的结果一致。对于 GPU 版本，我们运行：\n\n```bash\n# fp32 测试（不支持 cuDNN）\nmake test_gpt2cu PRECISION=FP32 && .\u002Ftest_gpt2cu\n# 混合精度 cuDNN 测试\nmake test_gpt2cu USE_CUDNN=1 && .\u002Ftest_gpt2cu\n```\n\n这将分别测试 fp32 路径和混合精度路径。测试应该通过，并打印出 `overall okay: 1`。\n\n## 教程\n\n我在这里附上了一个非常简短的教程，位于 [doc\u002Flayernorm\u002Flayernorm.md](doc\u002Flayernorm\u002Flayernorm.md)。它是一个简单、循序渐进的指南，用于实现 GPT-2 模型中的单个层——LayerNorm 层。这是一个很好的起点，可以帮助你理解 C 语言中各层的具体实现方式。\n\n**Flash Attention**。截至 2024 年 5 月 1 日，我们使用 cuDNN 提供的 Flash Attention。由于 cuDNN 会将编译时间从几秒延长到约一分钟，且该代码路径目前仍处于早期阶段，因此默认情况下已禁用此功能。你可以通过以下方式启用它：\n\n```bash\nmake train_gpt2cu USE_CUDNN=1\n```\n\n这将尝试使用 cuDNN 进行编译并运行。你需要在系统上安装 cuDNN。按照 [cuDNN 安装说明](https:\u002F\u002Fdeveloper.nvidia.com\u002Fcudnn)，使用 apt-get 安装会获取默认的 cuDNN 软件包集。对于最小化配置，仅安装 cuDNN 开发包即可，例如在 Ubuntu 22.04 上针对 CUDA 12.x 的安装命令如下：\n\n```bash\nwget https:\u002F\u002Fdeveloper.download.nvidia.com\u002Fcompute\u002Fcuda\u002Frepos\u002Fubuntu2204\u002Fx86_64\u002Fcuda-keyring_1.1-1_all.deb\nsudo dpkg -i cuda-keyring_1.1-1_all.deb\nsudo apt-get update\nsudo apt-get -y install libcudnn9-dev-cuda-12\n```\n\n此外，你还需要 [cuDNN 前端库](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcudnn-frontend\u002Ftree\u002Fmain)，但它只是头文件。只需将该仓库克隆到你的磁盘上即可。当前的 Makefile 会在你的主目录或当前目录中查找它。如果你将其放置在其他位置，请在 `make` 命令行中添加 `CUDNN_FRONTEND_PATH=\u002Fpath\u002Fto\u002Fyour\u002Fcudnn-frontend\u002Finclude`。\n\n## 多 GPU 训练\n\n请确保已安装 MPI 和 NCCL，例如在 Linux 系统上：\n\n```bash\nsudo apt install openmpi-bin openmpi-doc libopenmpi-dev\n```\n\n对于 NCCL，请遵循 [官方网站](https:\u002F\u002Fdeveloper.nvidia.com\u002Fnccl\u002Fnccl-download) 的说明（例如使用网络安装程序），然后执行以下命令：\n\n```bash\nmake train_gpt2cu\nmpirun -np \u003CGPU数量> .\u002Ftrain_gpt2cu\n```\n\n或者直接运行 `.\u002Fscripts\u002F` 目录下的脚本之一。\n\n## 多节点训练\n\n请确保已按照“多 GPU 训练”部分的说明安装了 `NCCL`。\n\n我们目前支持三种多节点训练的方式：\n\n1) 使用 OpenMPI 交换 NCCL ID 并初始化 NCCL。详情请参阅 `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_mpi.sh` 脚本。\n2) 使用共享文件系统初始化 NCCL。详情请参阅 `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_fs.sbatch` 脚本。\n3) 使用 TCP 套接字初始化 NCCL。详情请参阅 `.\u002Fscripts\u002Fmulti_node\u002Frun_gpt2_124M_tcp.sbatch` 脚本。\n\n注意：\n* 如果你在 Slurm 环境中运行，而你的 Slurm 不支持 PMIx（考虑到 `slurm-wlm` 已经放弃了对 PMIx 的支持，这种情况很常见），则必须使用文件系统（2）或 TCP（3）方法。要测试你的 Slurm 是否支持 PMIx，请运行：`srun --mpi=list`，查看输出中是否包含 `pmix`。\n* 如果没有设置 Slurm，也可以使用 `mpirun` 来启动多节点运行——即 MPI（1）方法。\n\n这三种方法并无优劣之分，我们只是为你提供多种选择，以便你能够在特定环境中进行训练。\n\n## 实验 \u002F 参数扫描\n\n以一台拥有 4 个 GPU 的机器为例，对 TinyStories 数据集上的学习率进行扫描。运行一个名为 `sweep.sh` 的 Shell 脚本（当然，在运行之前需要先执行 `chmod u+x sweep.sh`）：\n\n```bash\n#!\u002Fbin\u002Fbash\n\nlearning_rates=(3e-5 1e-4 3e-4 1e-3)\n\nfor i in {0..3}; do\n    export CUDA_VISIBLE_DEVICES=$i\n    screen -dmS \"tr$i\" bash -c \".\u002Ftrain_gpt2cu -i data\u002FTinyStories -v 250 -s 250 -g 144 -l ${learning_rates[$i]} -o stories$i.log\"\ndone\n\n# 可以使用以下命令关闭这些会话：\n# screen -ls | grep -E \"tr[0-3]\" | cut -d. -f1 | xargs -I {} screen -X -S {} quit\n```\n\n此示例会打开 4 个 screen 会话，并分别使用不同的学习率运行命令。日志文件 `stories$i.log` 中会记录所有损失值，你可以根据需要使用 Python 绘制图表。关于如何解析和绘制这些日志文件的快速示例，请参阅 [dev\u002Fvislog.ipynb](dev\u002Fvislog.ipynb)。\n\n## 仓库\n\n关于我希望这个仓库成为什么样子，再补充几句：\n\n首先，我希望 `llm.c` 成为一个教育平台。例如，我们的 `dev\u002Fcuda` 文件夹就是一个库，存放着所有手动编写且文档详尽的层级内核，从最简单的内核开始，逐步过渡到更复杂、更高效的内核。如果你有新的内核，并且在不同权衡之间有所取舍，请随时在此贡献。\n\n与此同时，我也希望 `llm.c` 能够足够高效，甚至在实际应用中可用于训练神经网络。例如，作为起点，我们应该能够复现大型 GPT-2（16 亿参数）的训练过程。这就要求我们整合所有最快的内核，包括使用 cuBLAS、cuBLASLt、CUTLASS、cuDNN 等库。我认为这样做也有助于教育目的，可以建立一个专家级别的上限，作为一种衡量标准——比如，你可以说自己编写的内核达到了 cuBLAS 速度的 80% 等。这样，你可以选择进行超快速的训练，也可以选择“拖放”任意想要使用的手动内核来运行。\n\n然而，出于约束考虑，我希望保持根目录下 `llm.c` 主线代码的简洁和易读性。如果某个 PR 虽然能将性能提升 2%，但却增加了 500 行复杂的 C 代码，还可能引入一些奇特的第三方依赖，那么我可能会拒绝该 PR，因为其复杂性并不值得。举个具体的例子：在根目录的训练循环中将矩阵乘法的 cuBLAS 实现设为默认，这显然是明智之举——它不仅使主线代码运行得更快，而且只有一行可读的代码，同时也是一个非常常见的依赖项。与此同时，我们可以在 `dev\u002Fcuda` 中保留与 cuBLAS 竞争的手动实现。\n\n最后，我对项目根目录的复杂度会更加敏感，因为那里包含了项目的主文件和默认文件。相比之下，`dev\u002F` 文件夹更像是一个试验空间，用于开发内核或类库，并分享有用、相关或具有教育意义的代码，其中一些代码即使局部较为复杂也无妨。\n\n## 著名的分支\n\n- AMD 支持\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fanthonix\u002Fllm.c) 由 @[anthonix](https:\u002F\u002Fgithub.com\u002Fanthonix) 开发：支持 AMD 设备，例如 7900 XTX\n\n- C#\n  - [llm.cs](https:\u002F\u002Fgithub.com\u002Fazret\u002Fllm.cs) 由 @[azret](https:\u002F\u002Fgithub.com\u002Fazret) 开发：该项目的 C# 移植版本\n  - [Llm.cs](https:\u002F\u002Fgithub.com\u002Fnietras\u002FLlm.cs) 由 @[nietras](https:\u002F\u002Fgithub.com\u002Fnietras) 开发：专注于在任何平台上轻松上手的 C# 移植版本。克隆并运行 ✅\n\n- CUDA C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fgevtushenko\u002Fllm.c) 由 @[gevtushenko](https:\u002F\u002Fgithub.com\u002Fgevtushenko) 开发：使用 [CUDA C++ 核心库](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fcccl) 的移植版本\n     - 该分支曾在 [GPU MODE Discord 服务器](https:\u002F\u002Fdiscord.gg\u002Fcudamode) 的 [本次讲座](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=WiB_3Csfj_Q) 中被提及。\n\n- C++\u002FCUDA\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002Fzhangpiu\u002Fllm.cpp\u002Ftree\u002Fmaster\u002Fllmcpp) 由 @[zhangpiu](https:\u002F\u002Fgithub.com\u002Fzhangpiu) 开发：使用 [Eigen](https:\u002F\u002Fgitlab.com\u002Flibeigen\u002Feigen) 的移植版本，支持 CPU\u002FCUDA。\n\n- WebGPU C++\n  - [gpu.cpp](https:\u002F\u002Fgithub.com\u002FAnswerDotAI\u002Fgpu.cpp) 由 @[austinvhuang](https:\u002F\u002Fgithub.com\u002Faustinvhuang) 开发：一个基于原生 WebGPU 的 C++ 可移植 GPU 计算库。旨在成为一个通用库，同时也将 llm.c 内核移植到 WGSL。\n\n- C++\n  - [llm.cpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp) 由 @[GaoYusong](https:\u002F\u002Fgithub.com\u002FGaoYusong) 开发：该项目的 C++ 移植版本，包含一个 C++ 单头文件 [tinytorch.hpp](https:\u002F\u002Fgithub.com\u002FGaoYusong\u002Fllm.cpp\u002Fblob\u002Fmain\u002Ftinytorch.hpp) 库。\n\n- Go\n  - [llm.go](https:\u002F\u002Fgithub.com\u002Fjoshcarp\u002Fllm.go) 由 @[joshcarp](https:\u002F\u002Fgithub.com\u002Fjoshcarp) 开发：该项目的 Go 移植版本。\n\n- Java\n  - [llm.java](https:\u002F\u002Fgithub.com\u002Fharryjackson\u002Fllm.java) 由 @[harryjackson](https:\u002F\u002Fgithub.com\u002Fharryjackson) 开发：该项目的 Java 移植版本。\n\n- Metal\n  - [llm.metal](https:\u002F\u002Fgithub.com\u002Fregrettable-username\u002Fllm.metal) 由 @[regrettable-username](https:\u002F\u002Fgithub.com\u002Fregrettable-username) 开发：使用简单的原始 C\u002FMetal 着色语言进行 LLM 训练。\n\n- Mojo\n  - [llm.🔥](https:\u002F\u002Fgithub.com\u002Fdorjeduck\u002Fllm.mojo) 由 @[dorjeduck](https:\u002F\u002Fgithub.com\u002Fdorjeduck) 开发：该项目的 Mojo 移植版本。\n\n- OpenCL\n  - [llm.c](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj\u002Fllm.c) 由 @[krrishnarraj](https:\u002F\u002Fgithub.com\u002Fkrrishnarraj) 开发：该项目的 OpenCL 移植版本。\n\n- Rust\n  - [llm.rs](https:\u002F\u002Fgithub.com\u002Fyijunyu\u002Fllm.rs) 由 @[Yijun Yu](https:\u002F\u002Fgithub.com\u002Fyijunyu) 开发：以达到相同性能为目标的 Rust 重写版本\n  - [llm.rs](https:\u002F\u002Fgithub.com\u002FToJen\u002Fllm.rs) 由 @[ToJen](https:\u002F\u002Fgithub.com\u002FToJen) 开发：该项目的 Rust 移植版本。\n\n- Swift\n  - [llm.swift](https:\u002F\u002Fgithub.com\u002Fotabuzzman\u002Fllm.swift) 由 @[otabuzzman](https:\u002F\u002Fgithub.com\u002Fotabuzzman) 开发：该项目的 Swift 移植版本。\n\n- Zig\n  - [llm.zig](https:\u002F\u002Fgithub.com\u002FSaimirbaci\u002Fllm.zig) 由 @[saimirbaci](https:\u002F\u002Fgithub.com\u002FSaimirbaci) 开发：该项目的 Zig 移植版本。\n\n- Habana Gaudi2\n  - [llm.tpc](https:\u002F\u002Fgithub.com\u002Fabhilash1910\u002Fllm.tpc) 由 @[abhilash1910](https:\u002F\u002Fgithub.com\u002Fabhilash1910) 开发：该项目的 Habana Gaudi2 移植版本。\n\n- Nim\n  - [llm.nim](https:\u002F\u002Fgithub.com\u002FVindaar\u002Fllm.nim) 由 @[Vindaar](https:\u002F\u002Fgithub.com\u002FVindaar) 开发：该项目的 Nim 移植版本。\n\n## 讨论区\n\n开发组织方式：\n\n- 如果你在使用仓库时遇到具体问题，请使用 [Issues](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues)。\n- 如果你有代码要贡献，请打开一个 [PR](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fpulls)。\n- 如果你想讨论仓库、提问等，请查看 [Discussions](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fdiscussions)。\n- 如果需要更快速的交流，我在我的 [Zero to Hero Discord 频道](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp) 创建了一个新的 `#llmc` 频道。\n\n## 许可证\n\nMIT","# llm.c 快速上手指南\n\nllm.c 是一个使用纯 C\u002FCUDA 实现的大型语言模型（LLM）训练项目，无需依赖庞大的 PyTorch 或 Python 环境。本项目专注于复现 GPT-2 和 GPT-3 系列模型的预训练过程，代码简洁高效，适合学习 CUDA 编程和理解 LLM 底层原理。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04) 或 macOS (Apple Silicon 支持 CPU 版本)\n- **编译器**: GCC 或 Clang (支持 C99 标准)\n- **GPU 训练**: NVIDIA GPU + CUDA Toolkit (建议 12.x 版本)\n- **多卡\u002F多机训练**: 需安装 OpenMPI 和 NCCL\n\n### 前置依赖\n确保已安装以下基础工具：\n```bash\n# Ubuntu\u002FDebian 示例\nsudo apt-get update\nsudo apt-get install -y build-essential git wget curl\n\n# 若需 GPU 加速，请安装 CUDA Toolkit\n# 若需多卡训练，请安装 MPI\nsudo apt-get install -y openmpi-bin libopenmpi-dev\n```\n\n> **注意**：本项目核心逻辑为 C\u002FCUDA，但数据预处理脚本依赖 Python。如需自行处理数据，请安装 Python 及依赖：\n> ```bash\n> pip install torch numpy transformers datasets tiktoken wandb tqdm\n> ```\n\n## 安装步骤\n\n### 1. 克隆仓库\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c.git\ncd llm.c\n```\n\n### 2. 下载启动包（推荐新手）\n项目提供了一键脚本，自动下载预训练权重、分词器和测试数据集（TinyShakespeare）：\n```bash\nchmod u+x .\u002Fdev\u002Fdownload_starter_pack.sh\n.\u002Fdev\u002Fdownload_starter_pack.sh\n```\n*该脚本会生成 `.bin` 格式的二进制文件，包含 GPT-2 (124M) 权重、调试状态及 Tokenized 数据。*\n\n### 3. 编译项目\n根据硬件环境选择编译目标：\n\n- **单卡 GPU 训练 (FP32 精度，易读性高)**:\n  ```bash\n  make train_gpt2fp32cu\n  ```\n\n- **单卡 GPU 训练 (混合精度，性能更强，需 cuDNN)**:\n  ```bash\n  make train_gpt2cu USE_CUDNN=1\n  ```\n\n- **纯 CPU 训练 (仅用于演示或无 GPU 环境)**:\n  ```bash\n  make train_gpt2\n  ```\n\n> **调试提示**: 若需在 IDE (如 VSCode) 中打断点调试，请将 `make` 命令中的优化标志 `-O3` 替换为 `-g`。\n\n## 基本使用\n\n### 场景一：单卡 GPU 快速训练 (FP32)\n适合初学者学习 CUDA 内核实现。运行编译好的二进制文件即可开始训练：\n```bash\n.\u002Ftrain_gpt2fp32cu\n```\n程序将加载下载的 GPT-2 权重，在 TinyShakespeare 数据集上进行微调，并输出训练损失和生成的文本样本。\n\n### 场景二：纯 CPU 运行演示\n适用于没有 NVIDIA GPU 的用户（如 MacBook M 系列芯片）。由于速度限制，仅建议运行少量步数作为参考：\n```bash\n# 设置线程数 (根据 CPU 核心数调整，例如 8)\nOMP_NUM_THREADS=8 .\u002Ftrain_gpt2\n```\n**预期输出示例**:\n```text\n[GPT-2]\nmax_seq_len: 1024\nvocab_size: 50257\n...\nstep 0: train loss 5.356189 (took 1452.121000 ms)\nstep 1: train loss 4.301069 (took 1288.673000 ms)\n...\ngenerating:\n---\nCome Running Away,\nGreater conquer With the Imperial blood...\n---\n```\n\n### 场景三：验证代码正确性 (单元测试)\n编译并运行测试程序，确保 C\u002FCUDA 实现与 PyTorch 参考实现的计算结果一致：\n```bash\n# CPU 测试\nmake test_gpt2\n.\u002Ftest_gpt2\n\n# GPU 测试 (混合精度)\nmake test_gpt2cu USE_CUDNN=1 && .\u002Ftest_gpt2cu\n```\n若输出 `overall okay: 1`，则表示测试通过。\n\n### 进阶：自定义数据集\n若不直接使用启动包，可手动运行 Python 脚本处理数据：\n```bash\npython dev\u002Fdata\u002Ftinyshakespeare.py\n```\n这将生成 `tiny_shakespeare_train.bin` 和 `tiny_shakespeare_val.bin` 文件，供 C 程序读取。","一位嵌入式系统工程师希望在资源受限的边缘设备上复现并微调 GPT-2 模型，以验证轻量级大语言模型的可行性。\n\n### 没有 llm.c 时\n- **依赖包袱沉重**：必须安装庞大的 PyTorch（约 245MB）和 cPython 环境，这对于存储和内存紧张的边缘设备几乎是不可承受之重。\n- **黑盒调试困难**：框架层级过多，当训练出现数值异常或性能瓶颈时，难以深入底层 CUDA 内核进行逐行调试和定位。\n- **学习曲线陡峭**：想要理解 Transformer 的底层实现细节，需要跨越高级 API 抽象层，无法直接通过简洁代码掌握核心算法逻辑。\n- **部署移植复杂**：将训练好的模型从复杂的 Python 生态迁移到纯 C\u002FC++ 生产环境时，面临巨大的格式转换和运行时适配工作量。\n\n### 使用 llm.c 后\n- **极致轻量化**：仅需纯 C\u002FCUDA 代码即可运行，彻底摆脱了对 PyTorch 和 Python 解释器的依赖，显著降低了环境门槛和资源占用。\n- **代码透明可控**：核心训练逻辑浓缩在约 1000 行的单个 C 文件中，工程师可直接使用 IDE 单步调试 CUDA 内核，精准掌控每一个计算细节。\n- **教育与实践合一**：通过阅读清晰的参考实现，能快速透彻地理解 GPT-2 的前向传播与反向传播机制，成为学习大模型底层的绝佳教材。\n- **无缝生产集成**：由于本身就是原生 C\u002FCUDA 实现，训练代码稍作修改即可直接嵌入现有的高性能计算流水线，消除了跨语言部署的鸿沟。\n\nllm.c 通过回归最基础的编程语言，为大模型研究提供了零依赖、高透明且易于部署的极简解决方案。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_llm.c_3b7423c7.png","karpathy","Andrej","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkarpathy_75f033eb.jpg","I like to train Deep Neural Nets on large datasets.",null,"Stanford","andrej.karpathy@gmail.com","https:\u002F\u002Ftwitter.com\u002Fkarpathy","https:\u002F\u002Fgithub.com\u002Fkarpathy",[24,28,32,36,40,44,48],{"name":25,"color":26,"percentage":27},"Cuda","#3A4E3A",66.2,{"name":29,"color":30,"percentage":31},"Python","#3572A5",13.5,{"name":33,"color":34,"percentage":35},"C","#555555",12,{"name":37,"color":38,"percentage":39},"C++","#f34b7d",3.9,{"name":41,"color":42,"percentage":43},"Shell","#89e051",2.2,{"name":45,"color":46,"percentage":47},"Makefile","#427819",1.7,{"name":49,"color":50,"percentage":51},"Jupyter Notebook","#DA5B0B",0.5,29388,3491,"2026-04-05T21:18:21","MIT",4,"Linux, macOS","训练需 NVIDIA GPU (支持 CUDA)，CPU 版本可在无 GPU 环境运行 (如 Apple Silicon)。多卡需 NCCL，多节点需 MPI。Flash Attention 需 cuDNN。","未说明 (CPU 训练建议高性能多核 CPU 及大内存)",{"notes":61,"python":62,"dependencies":63},"核心训练代码为纯 C\u002FCUDA，无需 PyTorch 运行时，但需 Python 进行数据预处理和验证测试。多卡训练需安装 OpenMPI 和 NCCL；启用 cuDNN 加速需单独安装 cuDNN 库及 frontend 头文件。MacOS 用户可使用 CPU 版本进行演示，但速度较慢。","未说明 (需 Python 运行数据预处理脚本及参考实现)",[64,65,66,67,68,69],"CUDA Toolkit","cuDNN (可选，用于 Flash Attention)","NCCL (多卡训练)","OpenMPI (多节点训练)","Make","GCC\u002FClang",[71,72],"语言模型","开发框架",2,"ready","2026-03-27T02:49:30.150509","2026-04-06T09:24:05.212705",[78,83,88,93,98,103],{"id":79,"question_zh":80,"answer_zh":81,"source_url":82},18671,"运行 train_gpt2.cu 时遇到 \"CUDA ERROR: an illegal memory access was encountered\" 错误怎么办？","该错误通常由数组初始化问题引起。尝试修改代码第 1205 行，将生成 token 的数组初始化为：\nint gen_tokens[gen_max_length] = { 0, };\n此外，确保拉取最新的代码（特别是包含修复的 PR #122），重新编译后通常能解决此问题。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues\u002F114",{"id":84,"question_zh":85,"answer_zh":86,"source_url":87},18672,"使用 mpirun 进行多 GPU 训练时程序卡住不动怎么办？","这是一个已知问题，原因是 `common_start` 函数总是将 GPU 索引硬编码为 0，未正确处理多 GPU 配置。请更新到包含该修复的最新版本代码。如果问题依旧，检查是否在第一次调用 `malloc_and_point` 分配激活内存时发生阻塞，这通常意味着多 GPU 环境配置或代码版本不匹配。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues\u002F369",{"id":89,"question_zh":90,"answer_zh":91,"source_url":92},18673,"在 Ubuntu 上执行 make train_gpt2cu 时出现 \"__ushort_as_bfloat16\" 和 \"atomicAdd\" 未定义的错误如何解决？","这些错误通常是因为 CUDA 版本过旧或不支持 BF16（BFloat16）相关操作。确保你安装的 CUDA 版本足够新（建议 CUDA 12.x），并且 nvcc 编译器支持 `__nv_bfloat162` 类型及相关原子操作。如果使用的是较旧的 CUDA 版本，可能需要禁用 BF16 选项或升级 CUDA Toolkit。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues\u002F359",{"id":94,"question_zh":95,"answer_zh":96,"source_url":97},18674,"运行 .\u002Ftrain_gpt2.c 时报错 \"Error: must forward with targets before backward\" 是什么原因？","该错误表明在执行反向传播（backward）之前，没有正确执行带目标值的前向传播（forward with targets）。这通常是训练流程逻辑错误或数据加载问题。请检查输入数据是否正确加载，并确保训练脚本按顺序执行了前向计算。如果是自定义修改了代码，请确认 forward 函数调用时是否传入了正确的 target 参数。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues\u002F19",{"id":99,"question_zh":100,"answer_zh":101,"source_url":102},18675,"如何在 Windows 上构建包含 CUDA 支持的项目？","Windows 构建已通过 GitHub Actions 验证支持 CUDA 12.4。你可以使用 MSBuild 配合 Solution\u002FProject 文件进行构建。注意初始 CUDA 安装可能较慢，建议确保已安装最新版的 CUDA Toolkit (12.4)。Makefile 也已更新以支持 Windows 构建，可以直接使用 make 命令（需安装兼容的 make 工具如 mingw-make 或 nmake）。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllm.c\u002Fissues\u002F229",{"id":104,"question_zh":105,"answer_zh":106,"source_url":97},18676,"如何提升 CPU 上的训练性能？","可以通过以下方式优化 CPU 性能：\n1. 使用编译器优化标志如 -Ofast 或 -O3。\n2. 应用专门的 LLVM 优化 pass，直接针对 train_gpt2.c 中的函数和操作进行优化。\n3. 参考相关的 PR（如 #200 和 #168）获取具体的 CPU 优化实现细节。\n社区正在讨论将更多优化合并到主分支中。",[],[109,120,128,136,144,157],{"id":110,"name":111,"github_repo":112,"description_zh":113,"stars":114,"difficulty_score":115,"last_commit_at":116,"category_tags":117,"status":74},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[72,118,119],"图像","Agent",{"id":121,"name":122,"github_repo":123,"description_zh":124,"stars":125,"difficulty_score":73,"last_commit_at":126,"category_tags":127,"status":74},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,"2026-04-05T23:32:43",[72,119,71],{"id":129,"name":130,"github_repo":131,"description_zh":132,"stars":133,"difficulty_score":73,"last_commit_at":134,"category_tags":135,"status":74},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[72,118,119],{"id":137,"name":138,"github_repo":139,"description_zh":140,"stars":141,"difficulty_score":73,"last_commit_at":142,"category_tags":143,"status":74},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[72,71],{"id":145,"name":146,"github_repo":147,"description_zh":148,"stars":149,"difficulty_score":73,"last_commit_at":150,"category_tags":151,"status":74},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[118,152,153,154,119,155,71,72,156],"数据工具","视频","插件","其他","音频",{"id":158,"name":159,"github_repo":160,"description_zh":161,"stars":162,"difficulty_score":115,"last_commit_at":163,"category_tags":164,"status":74},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[119,118,72,71,155]]