[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-karpathy--llama2.c":3,"tool-karpathy--llama2.c":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",155373,2,"2026-04-14T11:34:08",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":76,"owner_website":79,"owner_url":80,"languages":81,"stars":102,"forks":103,"last_commit_at":104,"license":105,"difficulty_score":106,"env_os":107,"env_gpu":108,"env_ram":109,"env_deps":110,"category_tags":117,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":118,"updated_at":119,"faqs":120,"releases":156},7448,"karpathy\u002Fllama2.c","llama2.c","Inference Llama 2 in one file of pure C","llama2.c 是一个极简的开源项目，旨在用纯 C 语言实现 Llama 2 大模型的推理功能。它解决了传统大模型部署依赖复杂、环境配置繁琐的痛点，将原本需要庞大框架支持的推理过程浓缩为一个仅约 700 行代码的单文件（run.c），无需任何外部依赖即可编译运行。\n\n该项目特别适合开发者、研究人员以及希望深入理解大模型底层原理的教育者使用。通过它，用户可以在资源受限的环境甚至嵌入式设备上轻松运行小型 Llama 2 模型，快速验证想法或进行教学演示。虽然目前主要支持浮点精度（fp32）的小型模型（如 15M 或 42M 参数版本），但其架构与 Meta 官方的 Llama 2 完全一致，具备极高的参考价值。\n\nllama2.c 的核心亮点在于“极致简约”与“教育意义”。作者受 llama.cpp 启发，但更进一步硬编码了模型架构，摒弃了所有冗余库，让使用者能直接透过代码看清 Transformer 模型的运作机制。配合提供的训练脚本，它还构成了一个完整的“训练 + 推理”全栈解决方案，让用户不仅能跑通模型，还能亲手训练专属的小型故事生成模型，体验从零构建大模型的乐趣。","## llama2.c\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_llama2.c_readme_9cb9c66abd3d.jpg\" width=\"300\" height=\"300\" alt=\"Cute Llama\">\n\u003C\u002Fp>\n\nHave you ever wanted to inference a baby [Llama 2](https:\u002F\u002Fai.meta.com\u002Fllama\u002F) model in pure C? No? Well, now you can!\n\nTrain the Llama 2 LLM architecture in PyTorch then inference it with one simple 700-line C file ([run.c](run.c)). You might think that you need many billion parameter LLMs to do anything useful, but in fact very small LLMs can have surprisingly strong performance if you make the domain narrow enough (ref: [TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) paper). This repo is a \"fullstack\" train + inference solution for Llama 2 LLM, with focus on minimalism and simplicity.\n\nAs the architecture is identical, you can also load and inference Meta's Llama 2 models. However, the current code only inferences models in fp32, so you will most likely not be able to productively load models larger than 7B. Work on model quantization is currently ongoing.\n\nPlease note that this repo started recently as a fun weekend project: I took my earlier [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT), tuned it to implement the Llama-2 architecture instead of GPT-2, and the meat of it was writing the C inference engine in [run.c](run.c). So the project is young and moving quickly. Hat tip to the awesome [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp) for inspiring this project. Compared to llama.cpp, I wanted something super simple, minimal, and educational so I chose to hard-code the Llama 2 architecture and just roll one inference file of pure C with no dependencies.\n\n## feel the magic\n\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkarpathy\u002Fllama2.c\u002Fblob\u002Fmaster\u002Frun.ipynb)\n\nFirst, navigate to the folder where you keep your projects and clone this repository to this folder:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c.git\n```\n\nThen, open the repository folder:\n\n```bash\ncd llama2.c\n```\n\nNow, let's just run a baby Llama 2 model in C. You need a model checkpoint. Download this 15M parameter model I trained on the [TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) dataset (~60MB download):\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin\n```\n\nCompile and run the C code:\n\n```bash\nmake run\n.\u002Frun stories15M.bin\n```\n\nYou'll see the text stream a sample. On my M1 MacBook Air this runs at ~110 tokens\u002Fs. See [performance](#performance) or the Makefile for compile flags that can significantly speed this up. We can also try a bit bigger 42M parameter model:\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories42M.bin\n.\u002Frun stories42M.bin\n```\n\nThis still runs at interactive rates and samples more coherent and diverse stories:\n\n> Once upon a time, there was a little girl named Lily. She loved playing with her toys on top of her bed. One day, she decided to have a tea party with her stuffed animals. She poured some tea into a tiny teapot and put it on top of the teapot. Suddenly, her little brother Max came into the room and wanted to join the tea party too. Lily didn't want to share her tea and she told Max to go away. Max started to cry and Lily felt bad. She decided to yield her tea party to Max and they both shared the teapot. But then, something unexpected happened. The teapot started to shake and wiggle. Lily and Max were scared and didn't know what to do. Suddenly, the teapot started to fly towards the ceiling and landed on the top of the bed. Lily and Max were amazed and they hugged each other. They realized that sharing was much more fun than being selfish. From that day on, they always shared their tea parties and toys.\n\nYou can also prompt the model with a prefix or a number of additional command line arguments, e.g. to sample at temperature 0.8 for 256 steps and with a prompt:\n\n```bash\n.\u002Frun stories42M.bin -t 0.8 -n 256 -i \"One day, Lily met a Shoggoth\"\n```\n\n> One day, Lily met a Shoggoth. He was very shy, but was also very generous. Lily said “Hello Shoggy! Can I be your friend?” Shoggy was happy to have a friend and said “Yes, let’s explore the universe together!” So they set off on a journey to explore the universe. As they travelled, Shoggy was happy to explain to Lily about all the wonderful things in the universe. At the end of the day, Lily and Shoggy had gathered lots of wonderful things from the universe, and they both felt very proud. They promised to explore the universe as one big pair and to never stop being generous to each other.\n\nThere is also an even better 110M param model available, see [models](#models).\n\nQuick note on sampling, the recommendation for ~best results is to sample with `-t 1.0 -p 0.9`, i.e. temperature 1.0 (default) but also top-p sampling at 0.9 (default). Intuitively, top-p ensures that tokens with tiny probabilities do not get sampled, so we can't get \"unlucky\" during sampling, and we are less likely to go \"off the rails\" afterwards. More generally, to control the diversity of samples use either the temperature (i.e. vary `-t` between 0 and 1 and keep top-p off with `-p 0`) or the top-p value (i.e. vary `-p` between 0 and 1 and keep `-t 1`), but not both. Nice explainers on LLM sampling strategies include [this](https:\u002F\u002Fpeterchng.com\u002Fblog\u002F2023\u002F05\u002F02\u002Ftoken-selection-strategies-top-k-top-p-and-temperature\u002F), [this](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fcontrolling-generation-with-top-k-top-p) or [this](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhow-to-generate).\n\n## Meta's Llama 2 models\n\nAs the neural net architecture is identical, we can also inference the Llama 2 models released by Meta. Sadly there is a bit of friction here due to licensing (I can't directly upload the checkpoints, I think). So Step 1, get the Llama 2 checkpoints by following the [Meta instructions](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama). Once we have those checkpoints, we have to convert them into the llama2.c format.\nFor this we need to install the python dependencies (`pip install -r requirements.txt`) and then use the `export.py` file, e.g. for 7B model:\n\n```bash\npython export.py llama2_7b.bin --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\nThe export will take ~10 minutes or so and generate a 26GB file (the weights of the 7B model in float32) called `llama2_7b.bin` in the current directory. It has been [reported](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fpull\u002F85) that despite efforts. I would not attempt to run anything above 7B right now for two reasons: first, 13B+ currently doesn't work because of integer flow in pointer arithmetic, which is yet to be fixed, and second, even if it were fixed, this repo is doing float32 inference right now, so it would be fairly unusably slow. Once the export is done, we can run it:\n\n```bash\n.\u002Frun llama2_7b.bin\n```\n\nThis ran at about 4 tokens\u002Fs compiled with [OpenMP](#OpenMP) on 96 threads on my CPU Linux box in the cloud. (On my MacBook Air M1, currently it's closer to 30 seconds per token if you just build with `make runfast`.) Example output:\n\n> The purpose of this document is to highlight the state-of-the-art of CoO generation technologies, both recent developments and those in commercial use. The focus is on the technologies with the highest merit to become the dominating processes of the future and therefore to be technologies of interest to S&amp;T ... R&amp;D. As such, CoO generation technologies developed in Russia, Japan and Europe are described in some depth. The document starts with an introduction to cobalt oxides as complex products and a short view on cobalt as an essential material. The document continues with the discussion of the available CoO generation processes with respect to energy and capital consumption as well as to environmental damage.\n\nbase models... ¯\\\\_(ツ)_\u002F¯. Since we can inference the base model, it should be possible to also inference the chat model quite easily, and have a conversation with it. And if we can find a way to run 7B more efficiently, we can start adding LoRA to our training script, and going wild with finetunes all within the repo!\n\nYou can also chat with the Llama Chat models. Export the chat model exactly as above:\n\n```bash\npython export.py llama2_7b_chat.bin --meta-llama \u002Fpath\u002Fto\u002F7B-chat\n```\n\nThen chat with it by specifying the chat mode using the `-m` flag, e.g.:\n\n```bash\n.\u002Frun llama2_7b_chat.bin -m chat\n```\n\nYou can also try Meta's Code Llama models even if support for them is incomplete. In particular, some hyperparameters changed (e.g. the constant in RoPE layer), so the inference is not exactly correct and a bit buggy right now. Looking into fixes. Make sure to build the tokenizer for the plain and instruct variants and pass it when doing inference.\n\n```bash\npython export.py codellama2_7b.bin --meta-llama \u002Fpath\u002Fto\u002FCodeLlama-7b\npython tokenizer.py --tokenizer-model=\u002Fpath\u002Fto\u002FCodeLlama-7b\u002Ftokenizer.model\n.\u002Frun codellama2_7b.bin -z \u002Fpath\u002Fto\u002FCodeLlama-7b\u002Ftokenizer.bin\n```\n\nChat with Code Llama Instruct:\n\n```bash\npython export.py codellama2_7b_instruct.bin --meta-llama \u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\npython tokenizer.py --tokenizer-model=\u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\u002Ftokenizer.model\n.\u002Frun codellama2_7b_instruct.bin -m chat -z \u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\u002Ftokenizer.bin\n```\n\n## int8 quantization\n\nThe (default) script [run.c](run.c), above, uses a float32 forward pass, where the entire calculation of the forward pass is kept in fp32. This is very easy to understand as far as reference code goes, but it has the following downsides: the model checkpoint files are very large (it takes 4 bytes per every individual weight), and the forward pass is relatively slow. The (very) common inference optimization employed in practice is to quantize the model parameters to lower precision, giving up a little bit of correctness in return for smaller checkpoint sizes and faster forward passes (as most of the inference uses integer arithmetic). Empirically, LLMs can tolerate precisions as low as 4-bit (or even lower), but we use int8 here because it is a \"safe\" setting that gets us the benefits but doesn't sacrifice too much of the model accuracy. Only the weights that participate in matmuls are quantized. All the other parameters (e.g. especially the scale and bias in RMSNorm) are kept in float32, because these layers are very sensitive. Now, if all you're after is reduction in checkpoint sizes, you could quantize the weights, save the checkpoint, and then dequantize them in run.c, and do float32 inference as normal and call it a day. This is totally fine. But here, we go one step further (as is standard practice) and additionally quantize the activations in the forward pass. This requires us to dynamically quantize and dequantize between float32 and int8 at runtime, which adds overhead. But the benefit is that now the majority of the calculations (the matmuls especially!) are using pure integer arithmetic, where both weights and activations enter as int8. This is where the speedups can fundamentally come from. The version we use is the \"Q8_0\" quantization (llama.cpp terminology), where the 0 means that the weight quantization is symmetric around 0, quantizing to the range [-127, 127].\n\nThe quantized forward pass is implemented in [runq.c](runq.c). To use it, we have to export the model in the quantized format. For example, the float32 version of Llama 2 7B was exported as:\n\n```\npython export.py llama2_7b.bin --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\nThis creates a 26GB file, because each one of 7B parameters is 4 bytes (fp32). To export it quantized, we instead use version 2 export:\n\n```\npython export.py llama2_7b_q80.bin --version 2 --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\nThis runs for a few minutes, but now creates only a 6.7GB file. For exporting non-meta checkpoints you would use the --checkpoint arg instead of --meta-llama arg (more docs on this later, below). Now let's inference them. I like to use OMP here because these are big models, so e.g. on my Linux box:\n\n```\nmake runomp\nOMP_NUM_THREADS=64 .\u002Frun llama2_7b.bin -n 40\nOMP_NUM_THREADS=64 .\u002Frunq llama2_7b_q80.bin -n 40\n```\n\nThis runs 40 steps just to get a timing. The float32 version for me runs at 4.6 tok\u002Fs, and the int8 version at 14 tok\u002Fs. So we achieved a 3X speedup while reducing the checkpoint size by 4X. However, the forward pass is quantized to int8, and therefore silently very slightly lower quality.\n\n## huggingface models\n\nWe can load any huggingface models that use the Llama 2 architecture. See the script [export.py](export.py) and the `--hf` flag to export the model .bin file.\n\n## models\n\nFor the sake of examples of smaller, from-scratch models, I trained a small model series on TinyStories. All of these trained in a few hours on my training setup (4X A100 40GB GPUs). The 110M took around 24 hours. I am hosting them on huggingface hub [tinyllamas](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas), both in the original PyTorch .pt, and also in the llama2.c format .bin:\n\n| model | dim | n_layers | n_heads | n_kv_heads | max context length | parameters | val loss | download\n| --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| 260K | 64 | 5 | 8 | 4 | 512 | 260K | 1.297 | [stories260K](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Ftree\u002Fmain\u002Fstories260K)\n| OG | 288 | 6 | 6 | 6 | 256 | 15M | 1.072 | [stories15M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin) |\n| 42M| 512 | 8 | 8 | 8 | 1024 | 42M | 0.847 | [stories42M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories42M.bin) |\n| 110M| 768 | 12 | 12 | 12 | 1024 | 110M | 0.760 | [stories110M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories110M.bin) |\n\nYou'll notice that the 110M model is equivalent to GPT-1 in size. Alternatively, this is also the smallest model in the GPT-2 series (`GPT-2 small`), except the max context length is only 1024 instead of 2048. The only notable changes from GPT-1\u002F2 architecture is that Llama uses RoPE relatively positional embeddings instead of absolute\u002Flearned positional embeddings, a bit more fancy SwiGLU non-linearity in the MLP, RMSNorm instead of LayerNorm, bias=False on all Linear layers, and is optionally multiquery.\n\n## training\n\nLet's see how we can train a baby Llama 2 from scratch using the code in this repo. First let's download and pretokenize some source dataset, e.g. I like [TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) so this is the only example currently available in this repo. But it should be very easy to add datasets, see the code.\n\n```bash\npython tinystories.py download\npython tinystories.py pretokenize\n```\n\nThen train our model:\n\n```bash\npython train.py\n```\n\n**brief training guide**. See the train.py script for more exotic launches and hyperparameter overrides. Here is a brief guide to how to set the parameters. Look at the table at the very end of the [Chinchilla paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556) to get a sense of how the Transformer parameters (dim, n_layers, n_heads) grow or shrink together. Extrapolate\u002Finterpolate this pattern to get bigger or smaller transformers. Set the max context length however you wish, depending on the problem: this should be the max number of tokens that matter to predict the next token. E.g. Llama 2 uses 2048. Next, you want the _total_ batch size per update (printed by the script as \"tokens per iteration will be:\") to be somewhere around 100K tokens for medium-sized applications. For tiny applications it could be lower, for large training (e.g. GPTs\u002FLLamas) it is usually ~0.5M, or even more. You get there by first maxing out the batch_size to whatever your system allows (e.g. mine was 16 in a recent run because after that my GPU runs out of memory), and then you want to increase gradient_accumulation_steps to be as high as necessary to reach the total batch size of ~100K. Finally, you want to tune your learning_rate (LR). You want this to be as high as your training allows. Very small networks can get away with a large LR (e.g. 1e-3 or even higher). Large networks need lower LRs. 3e-4 is a safe choice in most medium-sized applications, but can be too low for small networks, so try to increase it! Finally, max_iters is the length of training. Play with different settings. I mostly only ever tune these parameters and leave most of the others unchanged. Here is an example of how I trained the 110M model, which I don't think is anywhere near optimal, but looked sensible to me: dim 768, n_layers 12, n_heads 12 (so size of each head is 768 \u002F 12 = 64 channels), seq len of 1024, batch size 16 (this is the most that fit my A100 40GB GPU), gradient_accumulation_steps = 8 was needed to get total tokens batch size to be 16 batch size * 1024 tokens in sequence * 8 grad_accum = 131,072 tokens per update. Good. Learning rate 4e-4 (probably a little too low). max_iters 200K (probably a bit too high). Dropout 0.1, as that usually helps a bit at medium size. That was it. I ran using Distributed Data Parallel (DDP) on 4 GPUs on my cloud machine, training took ~day or so.\n\nTotally understand if you want to skip model training, for simple demo just download one of the pretrained models (see [models](#models) section), e.g.:\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin\n```\n\nOnce we have the model.bin file, we can inference in C. Compile the C code first:\n\n```bash\nmake run\n```\n\nYou can now run it simply as\n\n```bash\n.\u002Frun stories15M.bin\n```\n\nWatch the tokens stream by, fun! We can also run the PyTorch inference script for a comparison. Download one of the models again from huggingface hub and point the `sample.py` script at it:\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.pt -P out15M\npython sample.py --checkpoint=out15M\u002Fstories15M.pt\n```\n\nWhich gives the same results.\n\n## custom tokenizers\n\nIn everything above, we've assumed the custom Lllama 2 tokenizer with 32,000 tokens. However, in many boutique LLMs, using vocabulary this big might be an overkill. If you have a small application you have in mind, you might be much better off training your own tokenizers. This can make everything nicer - with smaller vocabs your model has fewer parameters (because the token embedding table is a lot smaller), the inference is faster (because there are fewer tokens to predict), and your average sequence length per example could also get smaller (because the compression is a lot more efficient on your data). So let's see how we train a custom tokenizer.\n\nBy default, to pretokenize the tinystories dataset we had to run, in order:\n\n```\npython tinystories.py download\npython tinystories.py pretokenize\n```\n\nThe `pretokenize` stage here loads the Llama 2 tokenizer (vocab size 32,000) and uses it to convert the downloaded text into integers, and saves that to file. We now change this as follows, to train an example 4096-token tokenizer:\n\n```\npython tinystories.py download\npython tinystories.py train_vocab --vocab_size=4096\npython tinystories.py pretokenize --vocab_size=4096\n```\n\nThe `train_vocab` stage will call the `sentencepiece` library to train the tokenizer, storing it in a new file `data\u002Ftok4096.model`. I tried to reproduce as well as I could the settings that (I think) Meta used to train their vocabulary. This uses the Byte Pair Encoding algorithm that starts out with raw utf8 byte sequences of the text data and then iteratively merges the most common consecutive pairs of tokens to form the vocabulary. Inspect the `tinystories.py` file - the custom tokenizers are stored in a special directory structure indexed by the vocab size.\n\nA quick note of interest is that vocab size of 4096 trained specifically on tinystories creates integer sequences with about the same sequence length per example as the default Llama 2 tokenizer of 32000 tokens! This means that our custom, tailored tokenizer is a lot better adapted to our specific text, and can compress it very effectively. So our trained models are smaller and faster.\n\nNow that we have pretokenized the dataset with our custom tokenizer, we can train the model. The training script `train.py` doesn't care about the exact tokens, it only cares about the vocabulary size so it can correctly initialize the model. So when training your model, make sure to pass in\n\n```\npython train.py --vocab_source=custom --vocab_size=4096\n```\n\n(The defaults are `llama2` and `32000` respectively, which indicates the default Llama 2 tokenizer). This trains the model. Finally we are ready to run inference with our `run.c` script. For that we need two things. Number one, we have to export our tokenizer in the `.bin` format, do that with:\n\n```\npython tokenizer.py --tokenizer-model=data\u002Ftok4096.model\n```\n\nThis writes the tokenizer to `data\u002Ftok4096.bin`. Now we can run inference, pointing it to this tokenizer using the `-z` flag:\n\n```\n.\u002Frun out\u002Fmodel.bin -z data\u002Ftok4096.bin\n```\n\nThis should print the samples. If you leave out the `-z` flag, it will use the default Llama 2 tokenizer, which would generate a good sequence of integers, but they would get translated using a different vocabulary to text, so it would look like gibberish.\n\n## performance\n\nThere are many ways to potentially speed up this code depending on your system. Have a look at the [Makefile](Makefile), which contains a lot of notes. The `make run` command currently uses the `-O3` optimization by default, i.e.:\n\n```bash\ngcc -O3 -o run run.c -lm\n```\n\n-O3 includes optimizations that are expensive in terms of compile time and memory usage. Including vectorization, loop unrolling, and predicting branches.\n\nTo get a much better performance, try to compile with `make runfast`. This turns on the `-Ofast` flag, which includes additional optimizations that may break compliance with the C\u002FIEEE specifications, in addition to `-O3`. See [the GCC docs](https:\u002F\u002Fgcc.gnu.org\u002Fonlinedocs\u002Fgcc\u002FOptimize-Options.html) for more information.\n\nTry `-march=native` to compile the program to use the architecture of the machine you're compiling on rather than a more generic CPU. This may enable additional optimizations and hardware-specific tuning such as improved vector instructions\u002Fwidth.\n\nThe fastest throughput I saw so far on my MacBook Air (M1) so far is with `make runfast`.\n\nYou can also experiment with replacing `gcc` with `clang`.\n\nIf compiling with gcc, try experimenting with `-funroll-all-loops`, see PR [#183](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fpull\u002F183)\n\n**OpenMP**. Big improvements can also be achieved by compiling with OpenMP, which \"activates\" the `#pragma omp parallel for` inside the matmul and attention, allowing the work in the loops to be split up over multiple processors.\nYou'll need to install the OpenMP library and the clang compiler first (e.g. `apt install clang libomp-dev` on ubuntu). Then you can compile with `make runomp`, which does:\n\n```bash\nclang -Ofast -fopenmp -march=native run.c  -lm  -o run\n```\n\nWhen you run inference make sure to use OpenMP flags to set the number of threads, e.g.:\n\n```bash\nOMP_NUM_THREADS=4 .\u002Frun out\u002Fmodel.bin\n```\n\nDepending on your system resources you may want to tweak these hyperparameters and use more threads. But more is not always better, usually this is a bit U shaped. In particular, if your CPU has SMT (multithreading), try setting the number of threads to the number of physical cores rather than logical cores. The performance difference can be large due to cache thrashing and communication overhead. The PyTorch documentation [CPU specific optimizations\n](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Frecipes\u002Frecipes\u002Ftuning_guide.html#cpu-specific-optimizations) has some good information that applies here too.\n\n## platforms\n\nOn **Windows**, use `build_msvc.bat` in a Visual Studio Command Prompt to build with msvc, or you can use `make win64` to use mingw compiler toolchain from linux or windows to build the windows target. MSVC build will automatically use openmp and max threads appropriate for your CPU unless you set `OMP_NUM_THREADS` env.\n\nOn **Centos 7**, **Amazon Linux 2018** use `rungnu` Makefile target: `make rungnu` or `make runompgnu` to use openmp.\n\nOn **Mac**, use clang from brew for openmp build. Install clang as `brew install llvm` and use the installed clang binary to compile with openmp: `make runomp CC=\u002Fopt\u002Fhomebrew\u002Fopt\u002Fllvm\u002Fbin\u002Fclang`\n\n## tests\n\nYou can run tests simply with pytest:\n\n```bash\n$ pip install pytest\n$ pytest\n```\n\nThis will currently invoke two tests inside `test_all.py`, which forward the model in both C and Python for 200 steps and check the output against a known good expected output. The tests currently run in only a few seconds, but will have to download and cache the stories260K models in a temporary `test` directory (only ~2MB download).\n\nThere are also some tests in C, in the file [test.c](test.c). You can run these with `make testcc`, or to see more stuff printed:\n\n```\nmake testcc VERBOSITY=1\n```\n\nCall for help: help add more tests.\n\n## ack\n\nI trained the llama2.c storyteller models on a 4X A100 40GB box graciously provided by the excellent [Lambda labs](https:\u002F\u002Flambdalabs.com\u002Fservice\u002Fgpu-cloud), thank you.\n\n## discord\n\nFigured it's possible to reuse my existing discord channel (that I use for my [zero to hero youtube series](https:\u002F\u002Fkarpathy.ai\u002Fzero-to-hero.html)), see #llama2c channel on [discord](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp), for any quick questions, related discussions, etc.\n\n## contributing\n\nA few words on this repo and the kinds of PRs that are likely to be accepted. What is the goal of this repo? Basically I think there will be a lot of interest in training or finetuning custom micro-LLMs (think ~100M - ~1B params, but let's say up to ~10B params) across a large diversity of applications, and deploying them in edge-adjacent environments (think MCUs, phones, web browsers, laptops, etc.). I'd like this repo to be the simplest, smallest, most hackable repo to support this workflow, both training and inference. In particular, this repo is not a complex framework with a 1000 knobs controlling inscrutible code across a nested directory structure of hundreds of files. Instead, I expect most applications will wish to create a fork of this repo and hack it to their specific needs and deployment platforms.\n\nPeople who care about deployment efficiency above all else should look at [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp). This repo still cares about efficiency, but not at the cost of simplicity, readability or portability. Basically, I expect that a lot of people come to this repo because the training code is 2 readable .py files and the inference code is 500 lines of C. So I'd like this to continue to be a kind of simplest \"reference implementation\" that can be easily hacked in a separate fork into whatever downstream application people are excited about. It shouldn't be full-featured. It shouldn't take 100 different options or settings. It shouldn't be the most efficient. A few examples:\n\n- someone re-ordered two loops to improve data locality for a small efficieny win => instant merge.\n- someone added the one line \"pragma omp parallel for\", which allows you to compile with OpenMP and dramatically speed up the code, or acts as just a comment if you don't compile it that way => instant merge.\n- bug fixes and touchups etc. => happy to merge\n\nA few examples of PRs are that are not an excellent fit:\n\n- adding more than several #ifdefs all over the place in code. If they are localized \u002F few, might be okay.\n- adding a lot of code that is very specific to some specific platform (e.g. MCUs, or some special version of linux or processor). These may be a better fit for forks of the project, and I am very happy to maintain a list of these forks in section below.\n- adding hundreds of lines of code to run.c that are only active in specific scenarios or platforms.\n\nIf your candidate PRs have elements of these it doesn't mean they won't get merged, it just means they will make it into the gray territory. TLDR: I am eager to merge any mostly small, mostly localized, broadly applicable, clean changes that improve the efficiency and portability of the repo, while keep its hackability and readability. I appreciate all PRs seeking to help me improve the project, thank you! \u003C3.\n\n## notable forks\n\n- Rust\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fgaxler\u002Fllama2.rs) by @[gaxler](https:\u002F\u002Fgithub.com\u002Fgaxler): a Rust port of this project\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fleo-du\u002Fllama2.rs) by @[leo-du](https:\u002F\u002Fgithub.com\u002Fleo-du): A Rust port of this project\n  - [llama2-rs](https:\u002F\u002Fgithub.com\u002Fdanielgrittner\u002Fllama2-rs) by @[danielgrittner](https:\u002F\u002Fgithub.com\u002Fdanielgrittner): a Rust port of this project\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Flintian06\u002Fllama2.rs) by @[lintian06](https:\u002F\u002Fgithub.com\u002Flintian06): A Rust port of this project\n  - [pecca.rs](https:\u002F\u002Fgithub.com\u002Frahoua\u002Fpecca-rs) by @[rahoua](https:\u002F\u002Fgithub.com\u002Frahoua): A Rust port leveraging [ndarray](https:\u002F\u002Fgithub.com\u002Frust-ndarray\u002Fndarray), supports BLAS.\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fflaneur2020\u002Fllama2.rs) by @[flaneur2020](https:\u002F\u002Fgithub.com\u002Fflaneur2020): A Rust port of this project.\n  - [llama2-burn](https:\u002F\u002Fgithub.com\u002Fcode-cp\u002Fllama2-burn): A Rust port of this project leveraging [Burn](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fburn)\n- Go\n  - [go-llama2](https:\u002F\u002Fgithub.com\u002Ftmc\u002Fgo-llama2) by @[tmc](https:\u002F\u002Fgithub.com\u002Ftmc): a Go port of this project\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fnikolaydubina\u002Fllama2.go) by @[nikolaydubina](https:\u002F\u002Fgithub.com\u002Fnikolaydubina): a Go port of this project\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fhaormj\u002Fllama2.go) by @[haormj](https:\u002F\u002Fgithub.com\u002Fhaormj): a Go port of this project\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fsaracen\u002Fllama2.go) by @[saracen](https:\u002F\u002Fgithub.com\u002Fsaracen): a Go port of this project\n- Android\n  - [llama2.c-android](https:\u002F\u002Fgithub.com\u002FManuel030\u002Fllama2.c-android): by @[Manuel030](https:\u002F\u002Fgithub.com\u002FManuel030): adds Android binaries of this project\n  - [llama2.c-android-wrapper](https:\u002F\u002Fgithub.com\u002Fcelikin\u002Fllama2.c-android-wrapper): by @[celikin](https:\u002F\u002Fgithub.com\u002Fcelikin): added JNI wrapper, PoC\n- C\n  - [llama3.c](https:\u002F\u002Fgithub.com\u002Fjameswdelancey\u002Fllama3.c): by @[jameswdelancey](https:\u002F\u002Fgithub.com\u002Fjameswdelancey): a LLaMA 3 8B Base and Instruct port of this project\n- C++\n  - [llama2.cpp](https:\u002F\u002Fgithub.com\u002Fleloykun\u002Fllama2.cpp) by @[leloykun](https:\u002F\u002Fgithub.com\u002Fleloykun): a C++ port of this project\n  - [llama2.cpp](https:\u002F\u002Fgithub.com\u002Fcoldlarry\u002Fllama2.cpp) by @[coldlarry](https:\u002F\u002Fgithub.com\u002Fcoldlarry): a C++ port of this project\n- JavaScript\n  - [llama2.js](https:\u002F\u002Fgithub.com\u002Fepicure\u002Fllama2.js) by @[epicure](https:\u002F\u002Fgithub.com\u002Fepicure): a JavaScript port of this project\n  - [llamajs](https:\u002F\u002Fgithub.com\u002Fagershun\u002Fllamajs) by @[agershun](https:\u002F\u002Fgithub.com\u002Fagershun): a JavaScript port of this project\n  - [llama2.ts](https:\u002F\u002Fgithub.com\u002Fwizzard0\u002Fllama2.ts) by @[oleksandr_now](https:\u002F\u002Ftwitter.com\u002Foleksandr_now): a TypeScript port of this project. Full Llama2-7B capable.\n  - [llama2.c-emscripten](https:\u002F\u002Fgithub.com\u002Fgohai\u002Fllama2.c-emscripten) by @[gohai](https:\u002F\u002Fgithub.com\u002Fgohai): Emscripten (JavaScript) port, based on @ggerganov's initial prototype\n- Zig\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fcgbur\u002Fllama2.zig) by @[cgbur](https:\u002F\u002Fgithub.com\u002Fcgbur): A Zig port of this project\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fvodkaslime\u002Fllama2.zig) by @[vodkaslime](https:\u002F\u002Fgithub.com\u002Fvodkaslime): a Zig port of this project\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fclebert\u002Fllama2.zig) by @[clebert](https:\u002F\u002Fgithub.com\u002Fclebert): a Zig port of this project\n- Julia\n  - [llama2.jl](https:\u002F\u002Fgithub.com\u002Fjuvi21\u002Fllama2.jl) by @[juvi21](https:\u002F\u002Fgithub.com\u002Fjuvi21): a Julia port of this project\n- Scala\n  - [llama2.scala](https:\u002F\u002Fgithub.com\u002Fjrudolph\u002Fllama2.scala) by @[jrudolph](https:\u002F\u002Fgithub.com\u002Fjrudolph): a Scala port of this project\n- Java\n  - [llama2.java](https:\u002F\u002Fgithub.com\u002Fmukel\u002Fllama2.java) by @[mukel](https:\u002F\u002Fgithub.com\u002Fmukel): a Java port of this project\n  - [llama2.java](https:\u002F\u002Fgithub.com\u002Fneoremind\u002Fllama2.java) by @[neoremind](https:\u002F\u002Fgithub.com\u002Fneoremind): a Java port of this project\n  - [llama2.tornadovm.java](https:\u002F\u002Fgithub.com\u002Fmikepapadim\u002Fllama2.tornadovm.java) by @[mikepapadim](https:\u002F\u002Fgithub.com\u002Fmikepapadim): an extension of the llama2.java with GPU-support through [TornadoVM](https:\u002F\u002Fgithub.com\u002Fbeehive-lab\u002FTornadoVM).\n- Kotlin\n  - [llama2.kt](https:\u002F\u002Fgithub.com\u002Fmadroidmaq\u002Fllama2.kt) by @[madroidmaq](https:\u002F\u002Fgithub.com\u002Fmadroidmaq): a Kotlin port of this project\n  - [llama2-kmp](https:\u002F\u002Fgithub.com\u002Fstepango\u002Fllama2-kmp) by @[stepango](https:\u002F\u002Fgithub.com\u002Fstepango): a Kotlin multiplatform(KMP) port of this project \n- Python\n  - [llama2.py](https:\u002F\u002Fgithub.com\u002Ftairov\u002Fllama2.py) by @[tairov](https:\u002F\u002Fgithub.com\u002Ftairov): a simple one file pure Python port of this project with zero dependencies\n- C#\n  - [llama2.cs](https:\u002F\u002Fgithub.com\u002Ftrrahul\u002Fllama2.cs) by @[trrahul](https:\u002F\u002Fgithub.com\u002Ftrrahul): a C# port of this project\n- F#\n  - [llama2.fs](https:\u002F\u002Fgithub.com\u002Fmicsh\u002Fllama2.fs) by @[micsh](https:\u002F\u002Fgithub.com\u002Fmicsh): a F# port of this project\n- Dart\n  - [llama2.dart](https:\u002F\u002Fgithub.com\u002Fyiminghan\u002Fllama2.dart) by @[yiminghan](https:\u002F\u002Fgithub.com\u002Fyiminghan\u002Fllama2.dart): one-file dart port of this project, works with Flutter!\n- Web\n  - [llama2c-web](https:\u002F\u002Fgithub.com\u002Fdmarcos\u002Fllama2.c-web) by @[dmarcos](https:\u002F\u002Fgithub.com\u002Fdmarcos): Super simple way to build unmodified llama2.c to WASM and run it in the browser. [Demo](https:\u002F\u002Fdiegomarcos.com\u002Fllama2.c-web\u002F)\n  - [llama2.rs.wasm](https:\u002F\u002Fgithub.com\u002Fmtb0x1\u002Fllama2.rs.wasm) by @[mtb0x1](https:\u002F\u002Fgithub.com\u002Fmtb0x1\u002F) : a [Demo](https:\u002F\u002Fmtb0x1.github.io\u002Fllama2.rs.wasm\u002F) of all listed rust ports to WASM, all in one web page.\n- WebAssembly\n  - [icpp-llm](https:\u002F\u002Fgithub.com\u002FicppWorld\u002Ficpp-llm): LLMs for the Internet Computer\n- Fortran\n  - [llama2.f90](https:\u002F\u002Fgithub.com\u002Frbitr\u002Fllama2.f90): a Fortran port of this project\n- Mojo\n  - [llama2.🔥](https:\u002F\u002Fgithub.com\u002Ftairov\u002Fllama2.mojo) by @[tairov](https:\u002F\u002Fgithub.com\u002Ftairov): pure Mojo port of this project\n- OCaml\n  - [llama2.ml](https:\u002F\u002Fgithub.com\u002Fjackpeck\u002Fllama2.ml) by @[jackpeck](https:\u002F\u002Fgithub.com\u002Fjackpeck): an OCaml port of this project\n- Hare\n  - [llama2.ha](https:\u002F\u002Fsr.ht\u002F~dvshkn\u002Fllama2.ha) by @[dvshkn](https:\u002F\u002Fgit.sr.ht\u002F~dvshkn): a Hare port of this project\n- [llama2.c - Llama 2 Everywhere](https:\u002F\u002Fgithub.com\u002Ftrholding\u002Fllama2.c) by @[trholding](https:\u002F\u002Fgithub.com\u002Ftrholding): Standalone, Bootable & Portable Binary Llama 2\n- [llama2.c-zh - Bilingual Chinese and English](https:\u002F\u002Fgithub.com\u002FchenyangMl\u002Fllama2.c-zh) by @[chenyangMl](https:\u002F\u002Fgithub.com\u002FchenyangMl): Expand tokenizer to support training and inference in both Chinese and English\n- Haskell\n  - [llama2.hs](https:\u002F\u002Fgithub.com\u002Fchris-ch\u002Fllama2.hs) by @[chris-ch](https:\u002F\u002Fgithub.com\u002Fchris-ch): an Haskell port of this project\n\n## unsorted todos\n\n- add support in run.c of reading version 1+ files from export, later deprecate \"version 0\"\n- run.cu (CUDA) investigate and merge\n- add more tests inside [test.c](test.c)\n- add Engine class for use in sample.py that does efficient inference in PyTorch, e.g. KV cache keeping\n- make it easier to add a new dataset with not too much pain\n- (LoRA) finetuning and export of Llama 2 models\n\n## License\n\nMIT\n","## llama2.c\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_llama2.c_readme_9cb9c66abd3d.jpg\" width=\"300\" height=\"300\" alt=\"可爱的小羊驼\">\n\u003C\u002Fp>\n\n你有没有想过用纯 C 语言来推理一个小型的 [Llama 2](https:\u002F\u002Fai.meta.com\u002Fllama\u002F) 模型？没有吗？那现在就可以啦！\n\n你可以先用 PyTorch 训练 Llama 2 的大模型架构，然后用一个简单的、只有 700 行的 C 文件（[run.c](run.c)）来进行推理。你可能会觉得只有拥有数十亿参数的大模型才能做点有用的事情，但实际上，如果你把任务领域限定得足够窄，非常小的模型也能表现出令人惊讶的强大能力（参考：[TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) 论文）。这个仓库提供了一个“全栈”的 Llama 2 大模型训练与推理解决方案，重点在于极简和易用。\n\n由于架构完全一致，你也可以加载并推理 Meta 官方的 Llama 2 模型。不过，目前的代码仅支持以 fp32 精度进行推理，因此对于超过 70 亿参数的模型，可能很难高效地运行。我们正在开发模型量化功能。\n\n请注意，这个仓库最初只是一个有趣的周末项目：我基于之前做的 [nanoGPT](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002FnanoGPT)，将其调整为实现 Llama-2 架构而非 GPT-2，而其中的核心工作就是用 [run.c](run.c) 编写纯 C 语言的推理引擎。所以这个项目还很年轻，进展迅速。特别感谢优秀的 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)，它给了我很多启发。相比 llama.cpp，我希望做到超级简单、极简且具有教育意义，因此我选择将 Llama 2 架构硬编码进去，并只用一个没有任何依赖的纯 C 文件来完成推理。\n\n## 感受魔法\n\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fkarpathy\u002Fllama2.c\u002Fblob\u002Fmaster\u002Frun.ipynb)\n\n首先，导航到你存放项目的文件夹，并将这个仓库克隆到该文件夹中：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c.git\n```\n\n然后进入仓库目录：\n\n```bash\ncd llama2.c\n```\n\n现在，让我们用 C 语言运行一个小型的 Llama 2 模型。你需要一个模型检查点。下载我在这个 [TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) 数据集上训练的 1500 万参数模型（约 60MB 下载）：\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin\n```\n\n编译并运行 C 代码：\n\n```bash\nmake run\n.\u002Frun stories15M.bin\n```\n\n你会看到一段文本被逐步生成。在我使用的 M1 MacBook Air 上，它的速度约为 110 个 token\u002F秒。更多关于性能的信息以及可以显著提升速度的编译选项，请参阅 [性能](#performance) 部分或 Makefile。\n\n我们还可以尝试一个稍大的 4200 万参数模型：\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories42M.bin\n.\u002Frun stories42M.bin\n```\n\n这个模型仍然能够以交互式的速度运行，并生成更加连贯和多样的故事：\n\n> 从前，有一个叫莉莉的小女孩。她最喜欢在床上玩她的玩具。有一天，她决定和她的毛绒玩具们一起开个茶话会。她倒了一些茶进一个小茶壶里，然后把它放在茶壶上。突然，她的小弟弟马克也跑进了房间，想加入茶话会。莉莉不想和马克分享她的茶，于是她让马克走开。马克开始哭起来，莉莉心里很难过。她最终决定把茶话会让给马克，两人一起分享了茶壶。然而，就在这时，一件意想不到的事情发生了。茶壶突然开始摇晃起来，莉莉和马克都吓坏了，不知道该怎么办。突然，茶壶飞了起来，一直飞到床的顶端才停了下来。莉莉和马克都很惊讶，他们紧紧抱在一起。他们意识到，分享比自私要有趣得多。从那天起，他们总是会一起分享他们的茶话会和玩具。\n\n你还可以通过前缀或额外的命令行参数来引导模型生成，比如以温度 0.8 采样 256 步，并添加一个提示词：\n\n```bash\n.\u002Frun stories42M.bin -t 0.8 -n 256 -i \"有一天，莉莉遇到了一只修格斯\"\n```\n\n> 有一天，莉莉遇到了一只修格斯。他非常害羞，但也很慷慨。莉莉说：“你好，修格斯！我可以做你的朋友吗？”修格斯很高兴能交到朋友，于是回答道：“当然可以，让我们一起探索宇宙吧！”于是，他们踏上了一段旅程，去探索宇宙。在旅途中，修格斯耐心地向莉莉介绍着宇宙中的各种奇妙事物。一天结束时，莉莉和修格斯收集了许多来自宇宙的美好东西，他们都感到无比自豪。他们承诺今后会永远携手探索宇宙，并且始终彼此慷慨相待。\n\n此外，还有一个更好的 1.1 亿参数模型可供使用，详情请参阅 [模型](#models) 部分。\n\n关于采样的几点说明：为了获得最佳效果，建议使用 `-t 1.0 -p 0.9` 进行采样，即默认的温度 1.0，同时启用 top-p 采样，值为 0.9。直观来说，top-p 可以确保那些概率极低的 token 不会被选中，从而避免在采样过程中出现“运气不好”的情况，也减少了后续生成内容偏离主题的风险。更一般地，如果你想控制生成结果的多样性，可以选择调节温度（即在 0 到 1 之间调整 `-t` 参数，同时关闭 top-p 采样，设置为 `-p 0`），或者直接调整 top-p 值（即在 0 到 1 之间调整 `-p` 参数，同时保持 `-t 1`），但不要同时调整两者。关于大模型采样策略的优秀解释文章包括 [这篇](https:\u002F\u002Fpeterchng.com\u002Fblog\u002F2023\u002F05\u002F02\u002Ftoken-selection-strategies-top-k-top-p-and-temperature\u002F)、[这篇](https:\u002F\u002Fdocs.cohere.com\u002Fdocs\u002Fcontrolling-generation-with-top-k-top-p) 和 [这篇](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhow-to-generate)。\n\n## Meta的Llama 2模型\n\n由于神经网络架构完全相同，我们也可以对Meta发布的Llama 2模型进行推理。遗憾的是，这里存在一些许可方面的限制（我认为我无法直接上传检查点文件）。因此，第一步是按照[Meta的说明](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fllama)获取Llama 2的检查点文件。一旦有了这些检查点，我们需要将其转换为llama2.c格式。\n\n为此，我们需要安装Python依赖项（`pip install -r requirements.txt`），然后使用`export.py`脚本，例如对于7B模型：\n\n```bash\npython export.py llama2_7b.bin --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\n导出过程大约需要10分钟左右，并会在当前目录下生成一个大小为26GB的文件（即7B模型的float32权重），名为`llama2_7b.bin`。有报告指出，尽管已经做出了努力，但目前仍存在问题。基于以下两点原因，我现在不会尝试运行超过7B的模型：首先，13B及以上版本由于指针运算中的整数溢出问题尚未修复而无法正常工作；其次，即使修复了该问题，这个仓库目前仍然采用float32精度进行推理，速度会非常慢，实用性不高。导出完成后，我们可以运行它：\n\n```bash\n.\u002Frun llama2_7b.bin\n```\n\n在我的云服务器上，使用96线程的OpenMP编译后，该模型在CPU上的运行速度约为每秒4个token。（而在我的MacBook Air M1上，如果仅使用`make runfast`进行编译，每个token则需要近30秒。）示例输出如下：\n\n> 本文旨在突出钴氧化物生成技术的最新进展，包括近期的发展以及已投入商业应用的技术。重点放在那些最有潜力成为未来主流工艺、因而值得科学技术研发部门关注的技术上。因此，文中较为深入地介绍了俄罗斯、日本和欧洲开发的钴氧化物生成技术。文章首先简要介绍了钴氧化物这一复杂产品，并概述了钴作为关键材料的重要性。随后，文章讨论了现有钴氧化物生成工艺在能源和资本消耗以及环境影响方面的表现。\n\n基础模型……¯\\\\_(ツ)_\u002F¯。既然我们可以对基础模型进行推理，那么也应该能够相当容易地对聊天模型进行推理，并与之对话。如果我们能找到更高效地运行7B模型的方法，就可以开始将LoRA添加到我们的训练脚本中，在这个仓库内尽情地进行微调！\n\n你也可以与Llama聊天模型进行对话。导出聊天模型的方式与上述相同：\n\n```bash\npython export.py llama2_7b_chat.bin --meta-llama \u002Fpath\u002Fto\u002F7B-chat\n```\n\n然后通过指定`-m chat`标志来启用聊天模式，例如：\n\n```bash\n.\u002Frun llama2_7b_chat.bin -m chat\n```\n\n你还可以尝试Meta的Code Llama模型，尽管对其支持尚不完善。特别是，一些超参数发生了变化（例如RoPE层中的常数），因此当前的推理结果并不完全准确，且存在一定bug。目前正在寻找解决方案。请确保为普通版和指令版分别构建分词器，并在推理时传入相应的分词器。\n\n```bash\npython export.py codellama2_7b.bin --meta-llama \u002Fpath\u002Fto\u002FCodeLlama-7b\npython tokenizer.py --tokenizer-model=\u002Fpath\u002Fto\u002FCodeLlama-7b\u002Ftokenizer.model\n.\u002Frun codellama2_7b.bin -z \u002Fpath\u002Fto\u002FCodeLlama-7b\u002Ftokenizer.bin\n```\n\n与Code Llama Instruct对话：\n\n```bash\npython export.py codellama2_7b_instruct.bin --meta-llama \u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\npython tokenizer.py --tokenizer-model=\u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\u002Ftokenizer.model\n.\u002Frun codellama2_7b_instruct.bin -m chat -z \u002Fpath\u002Fto\u002FCodeLlama-7b-Instruct\u002Ftokenizer.bin\n```\n\n## int8量化\n\n上述默认脚本[run.c](run.c)采用的是float32前向传播，整个前向计算过程都以fp32精度进行。就参考代码而言，这种方式非常易于理解，但也存在以下缺点：模型检查点文件体积庞大（每个权重占用4字节），且前向传播速度相对较慢。实践中常用的优化方法是将模型参数量化为较低精度，以牺牲少量精度为代价，换取更小的检查点文件和更快的前向传播速度（因为大多数推理操作都使用整数运算）。经验表明，LLM可以容忍低至4位甚至更低的精度，但我们在这里选择int8，因为它是一种“安全”的设置，能够在获得优势的同时不过度牺牲模型的准确性。只有参与矩阵乘法的权重会被量化。其他参数（尤其是RMSNorm中的缩放和偏置）则保持为float32，因为这些层对精度非常敏感。如果你仅仅是为了减小检查点文件的大小，也可以只量化权重并保存检查点，然后在run.c中将其反量化回float32，再像往常一样进行推理即可，这样做并无不可。然而，在这里，我们更进一步（这也是行业标准做法），在前向传播过程中也对激活值进行量化。这需要我们在运行时动态地在float32和int8之间进行量化和反量化，从而增加了一定的开销。但好处在于，现在大部分计算（尤其是矩阵乘法！）都使用纯整数运算，权重和激活值均以int8形式输入。正是这一点带来了根本性的速度提升。我们使用的版本是“Q8_0”量化（llama.cpp术语），其中的0表示权重量化是对称于0的，量化范围为[-127, 127]。\n\n量化后的前向传播实现于[runq.c](runq.c)中。要使用它，我们必须以量化格式导出模型。例如，Llama 2 7B的float32版本导出命令如下：\n\n```bash\npython export.py llama2_7b.bin --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\n这会生成一个26GB的文件，因为7B模型中的每个参数都是4字节（fp32）。若要以量化格式导出，则需使用版本2的导出命令：\n\n```bash\npython export.py llama2_7b_q80.bin --version 2 --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n```\n\n这个过程只需几分钟，但最终生成的文件仅为6.7GB。对于非Meta的检查点，应使用`--checkpoint`参数而非`--meta-llama`参数（更多相关说明将在下文介绍）。接下来让我们对这些模型进行推理。我喜欢在这里使用OMP，因为这些都是大模型，例如在我的Linux服务器上：\n\n```bash\nmake runomp\nOMP_NUM_THREADS=64 .\u002Frun llama2_7b.bin -n 40\nOMP_NUM_THREADS=64 .\u002Frunq llama2_7b_q80.bin -n 40\n```\n\n这里运行40步只是为了测量速度。对我而言，float32版本的运行速度为每秒4.6个token，而int8版本则为每秒14个token。因此，我们在将检查点文件大小缩小4倍的同时，实现了3倍的速度提升。然而，由于前向传播被量化为int8，其质量也会相应地略有下降。\n\n## Hugging Face 模型\n\n我们可以加载任何使用 Llama 2 架构的 Hugging Face 模型。请参阅脚本 [export.py](export.py) 和 `--hf` 标志，以导出模型的 .bin 文件。\n\n## 模型\n\n为了提供一些较小、从头开始训练的示例模型，我在 TinyStories 数据集上训练了一系列小型模型。这些模型都在我的训练环境中（4 张 A100 40GB GPU）花费了几小时便完成了训练，其中 1.1 亿参数的模型大约用了 24 小时。我将这些模型托管在 Hugging Face Hub 的 [tinyllamas](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas) 仓库中，既有原始的 PyTorch .pt 格式，也有 llama2.c 格式的 .bin 文件：\n\n| 模型 | 模型维度 | 层数 | 注意力头数 | 用于键值的注意力头数 | 最大上下文长度 | 参数量 | 验证损失 | 下载链接 |\n| --- | --- | --- | --- | --- | --- | --- | --- | --- |\n| 26万 | 64 | 5 | 8 | 4 | 512 | 26万 | 1.297 | [stories260K](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Ftree\u002Fmain\u002Fstories260K) |\n| OG | 288 | 6 | 6 | 6 | 256 | 1500万 | 1.072 | [stories15M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin) |\n| 4200万 | 512 | 8 | 8 | 8 | 1024 | 4200万 | 0.847 | [stories42M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories42M.bin) |\n| 1.1亿 | 768 | 12 | 12 | 12 | 1024 | 1.1亿 | 0.760 | [stories110M.bin](https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories110M.bin) |\n\n你会发现，1.1 亿参数的模型在规模上与 GPT-1 相当。或者也可以将其视为 GPT-2 系列中最小的模型（GPT-2 small），只是最大上下文长度只有 1024 而非 2048。Llama 相较于 GPT-1\u002F2 架构的主要区别在于：Llama 使用 RoPE 相对位置编码而非绝对或可学习的位置编码，在 MLP 中采用了更为复杂的 SwiGLU 非线性激活函数，使用 RMSNorm 而不是 LayerNorm，所有 Linear 层均不使用偏置项，并且可以选择多查询注意力机制。\n\n## 训练\n\n让我们看看如何使用本仓库中的代码从零开始训练一个小型 Llama 2 模型。首先，我们需要下载并预处理一些源数据集。例如，我喜欢 [TinyStories](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Froneneldan\u002FTinyStories) 数据集，因此目前本仓库中仅提供了该数据集的示例。不过，添加其他数据集应该非常容易，请参考代码。\n\n```bash\npython tinystories.py download\npython tinystories.py pretokenize\n```\n\n然后就可以开始训练我们的模型了：\n\n```bash\npython train.py\n```\n\n**简要训练指南**。更多复杂的启动方式和超参数覆盖，请参阅 train.py 脚本。以下是一个关于如何设置参数的简要指南。请参考 [Chinchilla 论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.15556) 末尾的表格，了解 Transformer 的各个参数（模型维度、层数、注意力头数）是如何协同变化的。根据这一规律外推或插值，即可得到更大或更小的 Transformer 模型。至于最大上下文长度，则可以根据具体任务自行设定：它应为预测下一个 token 时需要考虑的最大 token 数量。例如，Llama 2 使用 2048。接下来，你需要确保每次更新的总批量大小（脚本会打印“每次迭代将处理的 token 数量为：”）在中等规模应用中保持在约 10 万个 token 左右。对于小型应用可以更低，而对于大规模训练（如 GPT 或 Llama）通常约为 50 万个，甚至更多。实现方法是先将 batch_size 设置到系统允许的最大值（例如，我最近的运行中设置为 16，因为再高就会导致 GPU 内存不足），然后再通过增加 gradient_accumulation_steps 来达到约 10 万个 token 的总批量大小。最后，你需要调整学习率（LR）。尽量将其设置到训练允许的最高值。非常小的网络可以承受较高的学习率（如 1e-3 甚至更高）。而大型网络则需要较低的学习率。3e-4 是大多数中等规模应用的安全选择，但对于小型网络可能过低，因此可以尝试适当提高。max_iters 则决定了训练的持续时间。你可以尝试不同的设置。我通常只调整这些参数，而其他大部分参数都保持不变。以下是我训练 1.1 亿参数模型的一个例子，虽然我认为并不算最优，但对我来说还算合理：模型维度 768，层数 12，注意力头数 12（每个头的大小为 768 \u002F 12 = 64 个通道），序列长度 1024，batch_size 16（这是我 A100 40GB GPU 能够容纳的最大值），gradient_accumulation_steps 设为 8，这样每次更新的总 token 数量就能达到 16 × 1024 × 8 = 131,072 个 token。很好。学习率设为 4e-4（可能还是有点低）。max_iters 设为 20 万（可能又有点高）。dropout 设为 0.1，因为在中等规模下这通常会有一定帮助。就这样。我在云服务器上的 4 张 GPU 上使用分布式数据并行（DDP）进行训练，整个过程大约花费了一天左右的时间。\n\n当然，如果你只想进行简单的演示，完全可以跳过模型训练，直接下载其中一个预训练好的模型（见 [模型](#models) 部分），例如：\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin\n```\n\n一旦我们有了 model.bin 文件，就可以用 C 语言进行推理。首先编译 C 代码：\n\n```bash\nmake run\n```\n\n然后就可以简单地运行：\n\n```bash\n.\u002Frun stories15M.bin\n```\n\n看着 token 流水般地输出，真是有趣！我们也可以运行 PyTorch 推理脚本进行对比。再次从 Hugging Face Hub 下载一个模型，并将其路径指向 sample.py 脚本：\n\n```bash\nwget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.pt -P out15M\npython sample.py --checkpoint=out15M\u002Fstories15M.pt\n```\n\n这样得到的结果是一样的。\n\n## 自定义分词器\n\n在上述内容中，我们一直假设使用的是拥有32,000个词汇的自定义 Llama 2 分词器。然而，在许多小型专精的 LLM 中，使用如此庞大的词汇表可能显得过于冗余。如果你有一个小型的应用场景，训练一个属于自己的分词器可能会更加合适。这样做有许多好处：更小的词汇表意味着模型参数量减少（因为词嵌入表会小很多），推理速度也会更快（因为需要预测的词元数量更少），而且每个样本的平均序列长度也可能缩短（因为在你的数据上压缩效率更高）。那么，接下来我们就来看看如何训练一个自定义分词器。\n\n默认情况下，为了对 tinystories 数据集进行预分词处理，我们需要依次运行以下命令：\n\n```\npython tinystories.py download\npython tinystories.py pretokenize\n```\n\n这里的 `pretokenize` 阶段会加载 Llama 2 分词器（词汇表大小为32,000），并用它将下载的文本转换为整数序列，然后保存到文件中。现在我们将按照如下方式修改，以训练一个4096个词元的示例分词器：\n\n```\npython tinystories.py download\npython tinystories.py train_vocab --vocab_size=4096\npython tinystories.py pretokenize --vocab_size=4096\n```\n\n`train_vocab` 阶段会调用 `sentencepiece` 库来训练分词器，并将其保存到一个新的文件 `data\u002Ftok4096.model` 中。我尽可能地复现了 Meta 训练其词汇表时所采用的设置。该过程使用字节对编码算法，从原始的 UTF-8 字节序列开始，逐步合并最常见的连续词元对，从而构建出词汇表。请查看 `tinystories.py` 文件——自定义分词器会被存储在一个特殊的目录结构中，该结构按词汇表大小进行索引。\n\n值得一提的是，专门针对 tinystories 数据训练的4096个词元的词汇表，生成的整数序列的平均长度与默认的32,000个词元的 Llama 2 分词器几乎相同！这意味着我们的自定义分词器更能适应特定的文本数据，能够非常高效地对其进行压缩。因此，我们训练出的模型体积更小、运行速度更快。\n\n现在我们已经使用自定义分词器对数据集进行了预分词处理，接下来就可以开始训练模型了。训练脚本 `train.py` 并不关心具体的词元内容，它只关注词汇表大小，以便正确初始化模型。所以在训练模型时，请确保传递以下参数：\n\n```\npython train.py --vocab_source=custom --vocab_size=4096\n```\n\n（默认值分别为 `llama2` 和 `32000`，即默认的 Llama 2 分词器。）这样就可以开始训练模型了。最后，我们就可以使用 `run.c` 脚本进行推理了。为此我们需要两样东西。第一，我们需要将分词器导出为 `.bin` 格式，可以通过以下命令完成：\n\n```\npython tokenizer.py --tokenizer-model=data\u002Ftok4096.model\n```\n\n这会将分词器写入 `data\u002Ftok4096.bin` 文件中。现在我们可以运行推理程序，并通过 `-z` 标志指向这个分词器：\n\n```\n.\u002Frun out\u002Fmodel.bin -z data\u002Ftok4096.bin\n```\n\n这样应该就能打印出生成的样本。如果省略 `-z` 标志，程序将会使用默认的 Llama 2 分词器，虽然它会生成一串不错的整数序列，但这些整数会使用不同的词汇表被翻译成文本，因此看起来会像乱码。\n\n## 性能\n\n根据你的系统配置，有多种方法可以进一步提升代码的性能。请查看 [Makefile](Makefile)，其中包含大量注释。目前，`make run` 命令默认使用 `-O3` 优化选项，即：\n\n```bash\ngcc -O3 -o run run.c -lm\n```\n\n-O3 包含一些编译时间和内存消耗较高的优化措施，例如向量化、循环展开以及分支预测等。若想获得更好的性能，可以尝试使用 `make runfast` 进行编译。该选项启用了 `-Ofast` 标志，除了包含 `-O3` 的所有优化外，还增加了一些可能不符合 C\u002FIEEE 标准的优化。更多信息请参阅 [GCC 文档](https:\u002F\u002Fgcc.gnu.org\u002Fonlinedocs\u002Fgcc\u002FOptimize-Options.html)。\n\n此外，还可以尝试使用 `-march=native` 编译选项，使程序针对当前编译机器的架构进行优化，而不是使用通用的 CPU 指令集。这可能会启用额外的优化和硬件相关的调优，比如改进的向量指令宽度等。我在 MacBook Air (M1) 上测试过的最快吞吐量就是使用 `make runfast` 得到的。\n\n你也可以尝试用 `clang` 替代 `gcc` 来编译。\n\n如果继续使用 gcc 编译，可以尝试实验 `-funroll-all-loops` 选项，详情请参见 PR [#183](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fpull\u002F183)。\n\n**OpenMP**。通过 OpenMP 编译也能带来显著的性能提升，它会“激活”矩阵乘法和注意力机制中的 `#pragma omp parallel for` 指令，从而将循环中的计算任务分配到多个处理器上。你需要先安装 OpenMP 库和 clang 编译器（例如，在 Ubuntu 系统上可以运行 `apt install clang libomp-dev`）。之后，你可以使用 `make runomp` 进行编译，该命令会执行以下操作：\n\n```bash\nclang -Ofast -fopenmp -march=native run.c  -lm  -o run\n```\n\n在运行推理时，请务必设置 OpenMP 相关的线程数标志，例如：\n\n```bash\nOMP_NUM_THREADS=4 .\u002Frun out\u002Fmodel.bin\n```\n\n根据你的系统资源情况，你可能需要调整这些超参数，使用更多的线程。不过，线程数并不是越多越好，通常呈现 U 形曲线。特别是如果你的 CPU 支持 SMT（多线程技术），建议将线程数设置为物理核心的数量，而非逻辑核心的数量。这是因为过多的线程会导致缓存抖动和通信开销，从而影响性能。PyTorch 官方文档中的 [CPU 特定优化指南](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Frecipes\u002Frecipes\u002Ftuning_guide.html#cpu-specific-optimizations) 也提供了与此相关的一些有用信息。\n\n## 平台支持\n\n在 **Windows** 系统上，可以在 Visual Studio 命令提示符中使用 `build_msvc.bat` 脚本通过 MSVC 编译器进行构建；或者使用 `make win64` 命令，利用来自 Linux 或 Windows 的 MinGW 编译工具链来构建适用于 Windows 的目标文件。MSVC 构建会自动启用 OpenMP，并根据你的 CPU 设置合适的线程数，除非你手动设置了 `OMP_NUM_THREADS` 环境变量。\n\n在 **CentOS 7** 和 **Amazon Linux 2018** 系统上，可以使用 `rungnu` Makefile 目标：`make rungnu` 或 `make runompgnu` 来启用 OpenMP。\n\n在 **Mac** 系统上，可以使用 Homebrew 安装的 clang 工具链来进行 OpenMP 编译。首先通过 `brew install llvm` 安装 clang，然后使用已安装的 clang 二进制文件进行编译：`make runomp CC=\u002Fopt\u002Fhomebrew\u002Fopt\u002Fllvm\u002Fbin\u002Fclang`。\n\n## 测试\n\n你可以使用 `pytest` 轻松运行测试：\n\n```bash\n$ pip install pytest\n$ pytest\n```\n\n目前这将调用 `test_all.py` 中的两个测试，它们分别以 C 和 Python 语言对模型进行 200 步前向传播，并将输出与已知正确的预期输出进行比对。这些测试通常只需几秒钟即可完成，不过在首次运行时会下载并缓存 stories260K 模型到一个临时的 `test` 目录中（下载量仅约 2MB）。\n\n此外，C 语言中也有部分测试，位于 [test.c](test.c) 文件中。你可以通过运行 `make testcc` 来执行这些测试，或者使用以下命令查看更多输出信息：\n\n```\nmake testcc VERBOSITY=1\n```\n\n诚邀大家帮忙添加更多测试！\n\n## 致谢\n\n我是在 Lambda Labs 提供的一台配备 4 张 A100 40GB 显卡的机器上训练了 llama2.c 故事生成模型，在此特别感谢他们提供的支持。\n\n## Discord\n\n我想可以复用我现有的 Discord 频道（用于我的 [零到英雄 YouTube 系列](https:\u002F\u002Fkarpathy.ai\u002Fzero-to-hero.html)），欢迎加入 [Discord](https:\u002F\u002Fdiscord.gg\u002F3zy8kqD9Cp) 上的 #llama2c 频道，讨论相关问题或进行交流。\n\n## 贡献说明\n\n关于本仓库以及可能被接受的 PR 类型，我想简单说明几点。这个仓库的目标是什么？我认为未来将会涌现出大量对训练或微调小型自定义大模型（参数量大约在 1 亿到 10 亿之间，最多可达 100 亿）的兴趣，这些模型将应用于各种不同的场景，并部署到边缘计算环境中（例如微控制器、手机、浏览器、笔记本电脑等）。我希望这个仓库能够成为最简单、最小、最具可 hack 性的工具库，同时支持训练和推理流程。具体来说，它并不是一个拥有上千个配置项、代码晦涩难懂且目录结构复杂、包含数百个文件的大型框架。相反，我预计大多数用户会选择基于本仓库创建分支，并根据自身需求和部署平台对其进行定制化改造。\n\n如果有人最关心部署效率，那么应该关注 [llama.cpp](https:\u002F\u002Fgithub.com\u002Fggerganov\u002Fllama.cpp)。本仓库同样重视效率，但不会以牺牲简洁性、可读性和可移植性为代价。实际上，许多人选择本仓库正是因为其训练代码仅有两份易于阅读的 Python 文件，而推理代码则是一段约 500 行的 C 代码。因此，我希望它继续作为最简单的“参考实现”，方便用户在其基础上创建分支，快速适配到自己感兴趣的下游应用中。它不需要具备完整功能，也不需要支持上百种选项或设置，更不必追求极致的效率。以下是一些例子：\n\n- 如果有人调整了两个循环的顺序以提升数据局部性从而获得小幅性能提升，这样的 PR 将会被立即合并。\n- 如果有人添加了一行 `#pragma omp parallel for`，这样就可以启用 OpenMP 编译来显著加速代码；如果不启用 OpenMP，则这一行代码也会被当作注释而不会产生影响——这种改动同样会被立即合并。\n- 对于 bug 修复和小的优化等，我也非常乐意合并。\n\n然而，以下类型的 PR 则不太适合：\n\n- 在代码中到处添加过多的 `#ifdef` 宏定义。如果这些宏定义是局部性的且数量不多，或许还可以接受。\n- 添加大量针对特定平台（如微控制器、某些特殊版本的 Linux 或处理器）的专用代码。这类内容更适合放在项目的分支中，我也很乐意在下方列出这些分支的链接。\n- 向 `run.c` 文件中添加数百行仅在特定场景或平台上生效的代码。\n\n需要注意的是，即使你的 PR 包含上述内容，也并不意味着一定无法被合并，只是可能会进入一个较为模糊的范畴。简而言之：我非常期待合并那些规模较小、改动局部化、适用范围广、代码整洁，并且能够在保持易用性和可读性的同时提升仓库效率和可移植性的更改。感谢所有致力于改进本项目的人！\u003C3\n\n## 值得关注的分支\n\n- Rust\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fgaxler\u002Fllama2.rs) 由 @[gaxler](https:\u002F\u002Fgithub.com\u002Fgaxler)：该项目的 Rust 移植版\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fleo-du\u002Fllama2.rs) 由 @[leo-du](https:\u002F\u002Fgithub.com\u002Fleo-du)：该项目的 Rust 移植版\n  - [llama2-rs](https:\u002F\u002Fgithub.com\u002Fdanielgrittner\u002Fllama2-rs) 由 @[danielgrittner](https:\u002F\u002Fgithub.com\u002Fdanielgrittner)：该项目的 Rust 移植版\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Flintian06\u002Fllama2.rs) 由 @[lintian06](https:\u002F\u002Fgithub.com\u002Flintian06)：该项目的 Rust 移植版\n  - [pecca.rs](https:\u002F\u002Fgithub.com\u002Frahoua\u002Fpecca-rs) 由 @[rahoua](https:\u002F\u002Fgithub.com\u002Frahoua)：利用 [ndarray](https:\u002F\u002Fgithub.com\u002Frust-ndarray\u002Fndarray) 的 Rust 移植版，支持 BLAS。\n  - [llama2.rs](https:\u002F\u002Fgithub.com\u002Fflaneur2020\u002Fllama2.rs) 由 @[flaneur2020](https:\u002F\u002Fgithub.com\u002Fflaneur2020)：该项目的 Rust 移植版。\n  - [llama2-burn](https:\u002F\u002Fgithub.com\u002Fcode-cp\u002Fllama2-burn)：利用 [Burn](https:\u002F\u002Fgithub.com\u002Ftracel-ai\u002Fburn) 的该项目的 Rust 移植版\n- Go\n  - [go-llama2](https:\u002F\u002Fgithub.com\u002Ftmc\u002Fgo-llama2) 由 @[tmc](https:\u002F\u002Fgithub.com\u002Ftmc)：该项目的 Go 移植版\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fnikolaydubina\u002Fllama2.go) 由 @[nikolaydubina](https:\u002F\u002Fgithub.com\u002Fnikolaydubina)：该项目的 Go 移植版\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fhaormj\u002Fllama2.go) 由 @[haormj](https:\u002F\u002Fgithub.com\u002Fhaormj)：该项目的 Go 移植版\n  - [llama2.go](https:\u002F\u002Fgithub.com\u002Fsaracen\u002Fllama2.go) 由 @[saracen](https:\u002F\u002Fgithub.com\u002Fsaracen)：该项目的 Go 移植版\n- Android\n  - [llama2.c-android](https:\u002F\u002Fgithub.com\u002FManuel030\u002Fllama2.c-android)：由 @[Manuel030](https:\u002F\u002Fgithub.com\u002FManuel030)：添加了该项目的 Android 二进制文件\n  - [llama2.c-android-wrapper](https:\u002F\u002Fgithub.com\u002Fcelikin\u002Fllama2.c-android-wrapper)：由 @[celikin](https:\u002F\u002Fgithub.com\u002Fcelikin)：添加了 JNI 封装，PoC\n- C\n  - [llama3.c](https:\u002F\u002Fgithub.com\u002Fjameswdelancey\u002Fllama3.c)：由 @[jameswdelancey](https:\u002F\u002Fgithub.com\u002Fjameswdelancey)：该项目的 LLaMA 3 8B Base 和 Instruct 移植版\n- C++\n  - [llama2.cpp](https:\u002F\u002Fgithub.com\u002Fleloykun\u002Fllama2.cpp) 由 @[leloykun](https:\u002F\u002Fgithub.com\u002Fleloykun)：该项目的 C++ 移植版\n  - [llama2.cpp](https:\u002F\u002Fgithub.com\u002Fcoldlarry\u002Fllama2.cpp) 由 @[coldlarry](https:\u002F\u002Fgithub.com\u002Fcoldlarry)：该项目的 C++ 移植版\n- JavaScript\n  - [llama2.js](https:\u002F\u002Fgithub.com\u002Fepicure\u002Fllama2.js) 由 @[epicure](https:\u002F\u002Fgithub.com\u002Fepicure)：该项目的 JavaScript 移植版\n  - [llamajs](https:\u002F\u002Fgithub.com\u002Fagershun\u002Fllamajs) 由 @[agershun](https:\u002F\u002Fgithub.com\u002Fagershun)：该项目的 JavaScript 移植版\n  - [llama2.ts](https:\u002F\u002Fgithub.com\u002Fwizzard0\u002Fllama2.ts) 由 @[oleksandr_now](https:\u002F\u002Ftwitter.com\u002Foleksandr_now)：该项目的 TypeScript 移植版。完全支持 Llama2-7B。\n  - [llama2.c-emscripten](https:\u002F\u002Fgithub.com\u002Fgohai\u002Fllama2.c-emscripten) 由 @[gohai](https:\u002F\u002Fgithub.com\u002Fgohai)：基于 @ggerganov 最初原型的 Emscripten (JavaScript) 移植版\n- Zig\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fcgbur\u002Fllama2.zig) 由 @[cgbur](https:\u002F\u002Fgithub.com\u002Fcgbur)：该项目的 Zig 移植版\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fvodkaslime\u002Fllama2.zig) 由 @[vodkaslime](https:\u002F\u002Fgithub.com\u002Fvodkaslime)：该项目的 Zig 移植版\n  - [llama2.zig](https:\u002F\u002Fgithub.com\u002Fclebert\u002Fllama2.zig) 由 @[clebert](https:\u002F\u002Fgithub.com\u002Fclebert)：该项目的 Zig 移植版\n- Julia\n  - [llama2.jl](https:\u002F\u002Fgithub.com\u002Fjuvi21\u002Fllama2.jl) 由 @[juvi21](https:\u002F\u002Fgithub.com\u002Fjuvi21)：该项目的 Julia 移植版\n- Scala\n  - [llama2.scala](https:\u002F\u002Fgithub.com\u002Fjrudolph\u002Fllama2.scala) 由 @[jrudolph](https:\u002F\u002Fgithub.com\u002Fjrudolph)：该项目的 Scala 移植版\n- Java\n  - [llama2.java](https:\u002F\u002Fgithub.com\u002Fmukel\u002Fllama2.java) 由 @[mukel](https:\u002F\u002Fgithub.com\u002Fmukel)：该项目的 Java 移植版\n  - [llama2.java](https:\u002F\u002Fgithub.com\u002Fneoremind\u002Fllama2.java) 由 @[neoremind](https:\u002F\u002Fgithub.com\u002Fneoremind)：该项目的 Java 移植版\n  - [llama2.tornadovm.java](https:\u002F\u002Fgithub.com\u002Fmikepapadim\u002Fllama2.tornadovm.java) 由 @[mikepapadim](https:\u002F\u002Fgithub.com\u002Fmikepapadim)：通过 [TornadoVM](https:\u002F\u002Fgithub.com\u002Fbeehive-lab\u002FTornadoVM) 为 llama2.java 添加 GPU 支持的扩展版本。\n- Kotlin\n  - [llama2.kt](https:\u002F\u002Fgithub.com\u002Fmadroidmaq\u002Fllama2.kt) 由 @[madroidmaq](https:\u002F\u002Fgithub.com\u002Fmadroidmaq)：该项目的 Kotlin 移植版\n  - [llama2-kmp](https:\u002F\u002Fgithub.com\u002Fstepango\u002Fllama2-kmp) 由 @[stepango](https:\u002F\u002Fgithub.com\u002Fstepango)：该项目的 Kotlin 多平台（KMP）移植版\n- Python\n  - [llama2.py](https:\u002F\u002Fgithub.com\u002Ftairov\u002Fllama2.py) 由 @[tairov](https:\u002F\u002Fgithub.com\u002Ftairov)：一个简单的单文件纯 Python 移植版，无任何依赖\n- C#\n  - [llama2.cs](https:\u002F\u002Fgithub.com\u002Ftrrahul\u002Fllama2.cs) 由 @[trrahul](https:\u002F\u002Fgithub.com\u002Ftrrahul)：该项目的 C# 移植版\n- F#\n  - [llama2.fs](https:\u002F\u002Fgithub.com\u002Fmicsh\u002Fllama2.fs) 由 @[micsh](https:\u002F\u002Fgithub.com\u002Fmicsh)：该项目的 F# 移植版\n- Dart\n  - [llama2.dart](https:\u002F\u002Fgithub.com\u002Fyiminghan\u002Fllama2.dart) 由 @[yiminghan](https:\u002F\u002Fgithub.com\u002Fyiminghan\u002Fllama2.dart)：该项目的一键式 Dart 移植版，可与 Flutter 配合使用！\n- Web\n  - [llama2c-web](https:\u002F\u002Fgithub.com\u002Fdmarcos\u002Fllama2.c-web) 由 @[dmarcos](https:\u002F\u002Fgithub.com\u002Fdmarcos)：将未修改的 llama2.c 轻松编译为 WASM 并在浏览器中运行。[演示](https:\u002F\u002Fdiegomarcos.com\u002Fllama2.c-web\u002F)\n  - [llama2.rs.wasm](https:\u002F\u002Fgithub.com\u002Fmtb0x1\u002Fllama2.rs.wasm) 由 @[mtb0x1](https:\u002F\u002Fgithub.com\u002Fmtb0x1\u002F)：所有列出的 Rust 移植版都被编译为 WASM，并整合到一个网页中，提供[演示](https:\u002F\u002Fmtb0x1.github.io\u002Fllama2.rs.wasm\u002F)。\n- WebAssembly\n  - [icpp-llm](https:\u002F\u002Fgithub.com\u002FicppWorld\u002Ficpp-llm)：适用于 Internet Computer 的 LLM\n- Fortran\n  - [llama2.f90](https:\u002F\u002Fgithub.com\u002Frbitr\u002Fllama2.f90)：该项目的 Fortran 移植版\n- Mojo\n  - [llama2.🔥](https:\u002F\u002Fgithub.com\u002Ftairov\u002Fllama2.mojo) 由 @[tairov](https:\u002F\u002Fgithub.com\u002Ftairov)：该项目的纯 Mojo 移植版\n- OCaml\n  - [llama2.ml](https:\u002F\u002Fgithub.com\u002Fjackpeck\u002Fllama2.ml) 由 @[jackpeck](https:\u002F\u002Fgithub.com\u002Fjackpeck)：该项目的 OCaml 移植版\n- Hare\n  - [llama2.ha](https:\u002F\u002Fsr.ht\u002F~dvshkn\u002Fllama2.ha) 由 @[dvshkn](https:\u002F\u002Fgit.sr.ht\u002F~dvshkn)：该项目的 Hare 移植版\n- [llama2.c - Llama 2 无处不在](https:\u002F\u002Fgithub.com\u002Ftrholding\u002Fllama2.c) 由 @[trholding](https:\u002F\u002Fgithub.com\u002Ftrholding)：独立、可引导且便携的二进制 Llama 2\n- [llama2.c-zh - 中英双语](https:\u002F\u002Fgithub.com\u002FchenyangMl\u002Fllama2.c-zh) 由 @[chenyangMl](https:\u002F\u002Fgithub.com\u002FchenyangMl)：扩展分词器，以支持中文和英文的训练与推理\n- Haskell\n  - [llama2.hs](https:\u002F\u002Fgithub.com\u002Fchris-ch\u002Fllama2.hs) 由 @[chris-ch](https:\u002F\u002Fgithub.com\u002Fchris-ch)：该项目的 Haskell 移植版\n\n## 未分类待办事项\n\n- 在 run.c 中添加支持从导出文件读取版本 1 及以上文件的功能，随后弃用“版本 0”\n- 对 run.cu (CUDA) 进行研究并合并\n- 在 [test.c](test.c) 中添加更多测试\n- 在 sample.py 中添加 Engine 类，用于在 PyTorch 中进行高效推理，例如保持 KV 缓存\n- 使添加新数据集变得更加容易，减少复杂性\n- （LoRA）对 Llama 2 模型进行微调并导出\n\n## 许可证\n\nMIT","# llama2.c 快速上手指南\n\nllama2.c 是一个极简的 Llama 2 大语言模型训练与推理项目。它允许你使用纯 C 语言（仅约 700 行代码）来推理 Llama 2 架构的模型，无需任何外部依赖，非常适合学习和轻量级部署。\n\n## 环境准备\n\n*   **操作系统**: Linux, macOS (推荐), 或 Windows (需配置 MinGW\u002FWSL)。\n*   **编译器**: 需要安装 `make` 和 C 编译器 (如 `gcc` 或 `clang`)。\n    *   Ubuntu\u002FDebian: `sudo apt-get install build-essential`\n    *   macOS: 安装 Xcode Command Line Tools (`xcode-select --install`)\n*   **Python (可选)**: 仅当你需要转换 Meta 官方模型或自定义训练时需要。如需转换官方模型，需安装依赖：`pip install -r requirements.txt`。\n*   **硬件建议**:\n    *   运行示例小模型 (15M\u002F42M): 任意现代 CPU 即可。\n    *   运行 7B 模型: 需要较大内存 (FP32 版本约需 26GB RAM)，建议使用多核 CPU 并开启 OpenMP 加速。\n\n## 安装步骤\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c.git\n    cd llama2.c\n    ```\n\n2.  **下载预训练模型**\n    为了快速体验，首先下载作者提供的基于 TinyStories 数据集训练的小型模型（无需注册 Meta 账号）。\n    \n    *   下载 15M 参数模型 (~60MB):\n        ```bash\n        wget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories15M.bin\n        ```\n    *   或者下载 42M 参数模型 (故事更连贯):\n        ```bash\n        wget https:\u002F\u002Fhuggingface.co\u002Fkarpathy\u002Ftinyllamas\u002Fresolve\u002Fmain\u002Fstories42M.bin\n        ```\n    \n    > **国内加速提示**: 如果 HuggingFace 下载速度慢，可使用国内镜像源（如 ModelScope 或 hf-mirror）代理下载，或在终端设置代理。\n\n3.  **编译项目**\n    使用 `make` 命令编译推理引擎。\n    ```bash\n    make run\n    ```\n    *(注：如果需要极致性能且编译器支持 OpenMP，可尝试 `make runfast`)*\n\n## 基本使用\n\n### 1. 运行最小模型\n直接运行编译好的程序并加载模型文件，即可看到模型生成故事。\n```bash\n.\u002Frun stories15M.bin\n```\n在 M1 MacBook Air 上，该模型推理速度约为 110 tokens\u002Fs。\n\n### 2. 运行更大模型并指定参数\n使用 42M 模型，并通过命令行参数控制温度 (`-t`)、生成长度 (`-n`) 和提示词 (`-i`)。\n```bash\n.\u002Frun stories42M.bin -t 0.8 -n 256 -i \"One day, Lily met a Shoggoth\"\n```\n*   `-t 0.8`: 采样温度为 0.8。\n*   `-n 256`: 生成 256 个 token。\n*   `-i \"...\"`: 输入提示词。\n\n**采样建议**: 为了获得最佳效果，推荐使用 `-t 1.0 -p 0.9`（默认配置），即保持温度为 1.0 并开启 Top-p 采样 (0.9)，以避免生成低概率的混乱文本。\n\n### 3. (进阶) 运行 Meta 官方 Llama 2 模型\n如果你已获取 Meta 官方的 Llama 2 权重，需先将其转换为 `llama2.c` 格式。\n\n1.  **导出模型** (需先 `pip install -r requirements.txt`):\n    ```bash\n    python export.py llama2_7b.bin --meta-llama path\u002Fto\u002Fllama\u002Fmodel\u002F7B\n    ```\n    *注意：当前版本主要支持 FP32 推理，7B 模型生成的文件约为 26GB，且推理速度较慢。暂不建议尝试 13B+ 模型。*\n\n2.  **运行模型**:\n    ```bash\n    .\u002Frun llama2_7b.bin\n    ```\n\n3.  **对话模式** (针对 Chat 模型):\n    ```bash\n    # 先导出 chat 模型\n    python export.py llama2_7b_chat.bin --meta-llama \u002Fpath\u002Fto\u002F7B-chat\n    \n    # 以对话模式运行\n    .\u002Frun llama2_7b_chat.bin -m chat\n    ```","一位嵌入式开发者需要在资源受限的物联网网关上部署一个能生成简单设备故障报告的本地 AI 助手，且无法依赖庞大的 Python 环境或联网服务。\n\n### 没有 llama2.c 时\n- **环境依赖沉重**：必须移植完整的 Python 解释器及 PyTorch 库到嵌入式 Linux，占用数百 MB 存储空间，远超硬件限制。\n- **推理延迟过高**：通用框架在低算力 CPU 上运行缓慢，生成一条简短报告需数秒，无法满足实时交互需求。\n- **部署流程复杂**：需要配置复杂的虚拟环境、处理版本兼容性问题，维护成本极高。\n- **内存开销巨大**：加载标准模型往往需要 GB 级内存，导致设备频繁触发 OOM（内存溢出）崩溃。\n\n### 使用 llama2.c 后\n- **极致轻量部署**：仅需一个约 700 行的纯 C 文件（run.c）和编译后的模型二进制文件，无需任何外部依赖，总占用仅几十 MB。\n- **原生高性能推理**：利用纯 C 直接调用硬件指令，在同样的低功耗芯片上实现了每秒上百 token 的流畅生成速度。\n- **一键编译运行**：通过简单的 `make` 命令即可完成编译，直接嵌入现有 C\u002FC++ 固件工程，大幅简化集成流程。\n- **可控内存占用**：支持加载 15M 至 42M 参数量的小型模型，将运行内存控制在几 MB 以内，确保系统稳定运行。\n\nllama2.c 通过将大模型推理浓缩为单文件纯 C 实现，让资源受限的边缘设备也能轻松拥有本地化的智能文本生成能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkarpathy_llama2.c_89ceaf42.png","karpathy","Andrej","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkarpathy_75f033eb.jpg","I like to train Deep Neural Nets on large datasets.",null,"Stanford","andrej.karpathy@gmail.com","https:\u002F\u002Ftwitter.com\u002Fkarpathy","https:\u002F\u002Fgithub.com\u002Fkarpathy",[82,86,90,94,98],{"name":83,"color":84,"percentage":85},"C","#555555",52.3,{"name":87,"color":88,"percentage":89},"Python","#3572A5",44.1,{"name":91,"color":92,"percentage":93},"Jupyter Notebook","#DA5B0B",2.3,{"name":95,"color":96,"percentage":97},"Makefile","#427819",1.3,{"name":99,"color":100,"percentage":101},"Batchfile","#C1F12E",0,19392,2503,"2026-04-14T02:16:48","MIT",4,"Linux, macOS","不需要 GPU，纯 CPU 运行（基于 C 语言实现）","运行 7B 模型需约 26GB+ RAM（fp32），小模型（15M-110M）仅需少量内存",{"notes":111,"python":112,"dependencies":113},"该项目核心推理引擎 (run.c) 为纯 C 编写，无外部依赖，可直接编译运行。若需使用 Meta 官方 Llama 2 模型，需先通过 Python 脚本 (export.py) 将权重转换为二进制格式，此过程需安装 PyTorch 等依赖。默认使用 fp32 精度，运行 7B 以上模型速度较慢且受整数指针算术限制可能无法运行；支持 int8 量化 (runq.c) 以减少内存占用并提升速度。在 M1 Mac 上小模型可达 110 tokens\u002Fs，7B 模型在多线程 CPU 上约 4 tokens\u002Fs。","未说明（导出脚本需 Python 环境）",[114,115,116],"PyTorch (用于训练\u002F导出)","numpy (隐含依赖)","OpenMP (可选，用于加速)",[35,14],"2026-03-27T02:49:30.150509","2026-04-14T20:44:53.080698",[121,126,131,136,141,146,151],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},33416,"如何在 Windows 上编译运行（遇到 sys\u002Fmman.h 缺失错误）？","`sys\u002Fmman.h` 是 Unix 特有的头文件，Windows 原生不支持。请拉取最新代码，维护者已合并了使 `run.c` 兼容 Windows 的 PR（如 #96），现在可以直接在 Windows 上构建和运行。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F80",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},33414,"如何在没有 GPU 的机器上运行代码（遇到 ProcessGroupNCCL 错误）？","该错误通常是因为使用了旧版本的代码。请尝试拉取最新的代码更新，维护者已修复了在没有 GPU 的环境下运行的问题。此外，相关 PR 已禁用了 MPS 后端以提高兼容性。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F70",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},33415,"设置 --compile=True 进行训练时崩溃（KeyError: torch.complex64）怎么办？","此问题与 PyTorch Inductor 对复数操作的支持有关。解决方案是使用 PyTorch 的最新开发版本（tip-of-tree），它会自动为不支持的操作提供回退机制并输出警告，从而允许训练继续进行。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F53",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},33417,"使用中文数据集训练后推理出现乱码如何解决？","乱码通常由编码或分词器引起。建议检查以下几点：1. 确保终端编码正确（Windows CMD 可尝试 `chcp 65001`，但 PowerShell 可能无效）；2. 核心原因可能是 `sentencepiece` 分词器对多语言支持的问题，建议参考相关视频教程确认分词器配置；3. 确保数据预处理和训练代码正确处理了 UTF-8 字符。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F158",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},33418,"运行新训练的模型时出现段错误（Segmentation fault）是什么原因？","这通常不是代码本身的问题，而是生成的权重文件（.bin）损坏或保存不正确导致的。建议重新生成模型文件，或者重新安装项目环境和 Python 虚拟环境来排除环境问题。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F237",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},33419,"如何使用提示词（Prompt）时避免输出奇怪的结果（如重复箭头符号）？","如果带提示词的效果不如不带提示词，可能是因为示例长度不一致或填充（padding）策略不当。虽然尝试用 eos_id 填充可能无明显改善，但该问题已在主分支（master）中通过代码合并得到修复，请确保使用最新版本的代码进行测试。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F204",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},33420,"如何运行 Llama-2-7b 模型？","项目支持运行 Llama-2-7b 模型。首先需要将模型转换为项目所需的 .bin 格式，然后使用 `.\u002Frun` 命令加载。例如：`OMP_NUM_THREADS=4 .\u002Frun ..\u002Fllama\u002Fllama2_7b.bin 0.6 128`。注意早期版本中 tok\u002Fs 的统计可能存在误报，已后续修复。","https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fllama2.c\u002Fissues\u002F46",[]]