[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-therealoliver--Deepdive-llama3-from-scratch":3,"tool-therealoliver--Deepdive-llama3-from-scratch":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":79,"owner_email":78,"owner_twitter":78,"owner_website":78,"owner_url":80,"languages":81,"stars":86,"forks":87,"last_commit_at":88,"license":89,"difficulty_score":23,"env_os":90,"env_gpu":91,"env_ram":92,"env_deps":93,"category_tags":100,"github_topics":101,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":122,"updated_at":123,"faqs":124,"releases":125},3392,"therealoliver\u002FDeepdive-llama3-from-scratch","Deepdive-llama3-from-scratch","Achieve the llama3 inference step-by-step, grasp the core concepts, master the process derivation, implement the code.","Deepdive-llama3-from-scratch 是一个专为深度学习爱好者打造的开源教育项目，旨在通过从零构建代码的方式，手把手带你掌握 Llama3 大模型的推理原理。针对许多开发者在研读大模型源码时面临的“黑盒”困境——即只知代码运行结果却不懂背后数学推导与维度变化的痛点，该项目对原有教程进行了全面升级。\n\n它特别适合希望深入理解 Transformer 架构的 AI 开发者、研究人员以及计算机专业学生。不同于普通的调用指南，Deepdive-llama3-from-scratch 的核心亮点在于其极致的透明度：不仅重构了学习路径使逻辑更清晰，还为每一行代码添加了详尽注释，并全程追踪矩阵维度的变化，让复杂的计算过程一目了然。项目更深入剖析了“为什么要这样做”，补充了包括 KV-Cache 机制在内的核心原理推导，帮助用户从根本上吃透模型设计思想。此外，项目提供高质量的原生中英文双语文档与代码，避免了机器翻译的歧义，让中文用户能更顺畅地开启大模型底层技术的学习之旅。","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c0e3fb6c4846.png\" width=\"600px\"\u002F>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">Deepdive-llama3-from-scratch\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\" alt=\"License\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\" alt=\"GitHub stars\">\u003C\u002Fa>\n    \u003Ca href=\"#from_me\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F☕%20Buy%20me%20a%20coffee-ff69b4\" alt=\"Buy me a coffee\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\n    \u003Cp>\n        \u003Cb>[ View in English | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fblob\u002Fmain\u002FREADME_zh.md\">中文版文档点这里\u003C\u002Fa> ]\u003C\u002Fb>\n    \u003C\u002Fp>\n\u003C\u002Fh3>\n\n---\n\nThis project is an enhanced version based on [naklecha\u002Fllama3-from-scratch](https:\u002F\u002Fgithub.com\u002Fnaklecha\u002Fllama3-from-scratch). It has been comprehensively improved and optimized on the basis of the original project, aiming to help everyone more easily understand and master the implementation principle and the detailed reasoning process of the Llama3 model. Thanks to the contributions of the original author :)\n\u003Cbr>\u003Cbr>\n\u003Ch3>\nThe following are the core improvements of this project:\n\u003C\u002Fh3>\n\n\n1. **Structural Optimization**  \n   The presentation sequence of the content has been rearranged, and the directory structure has been adjusted to make the learning process clearer and more reasonable, facilitating everyone to understand the code step by step.\n   \n2. **Code Annotations**  \n   A large number of detailed code annotations have been added to teach you how to understand the function of each piece of code. Even beginners can get started easily.\n   \n3. **Dimension Tracking**  \n   The changes in the matrix dimensions in each step of the calculation are fully annotated, making it easier for you to understand the entire process.\n   \n4. **Principle Explanation**  \n   Abundant principle-related explanations and a large number of detailed derivations have been added. It not only tells you \"what to do\" but also deeply explains \"why to do it\", helping you fundamentally master the design concept of the model.\n   \n5. **KV-Cache Insights**  \n   An additional derivation chapter on KV-Cache has been added, covering detailed core concepts, principle derivations, and the application process in the attention mechanism, allowing you to understand every detail and philosophy of KV-Cache from its roots. \n   \n6. **Bilingual Documents**  \n   Code files in both Chinese and English are provided. The native Chinese translation avoids the problem of inaccurate expressions caused by machine translation.\n\u003Cbr>\u003Cbr>   \n\n---\n\n\u003Ch2 align=\"center\">Table of Contents\u003C\u002Fh2>\n\n- [Loading the model](#loading-the-model)\n  - [Loading the tokenizer](#loading-the-tokenizer)\n  - [Reading model files and configuration files](#reading-model-files-and-configuration-files)\n    - [Inferring model details using the configuration file](#inferring-model-details-using-the-configuration-file)\n- [Convert the input text into embeddings](#convert-the-input-text-into-embeddings)\n  - [Convert the text into a sequence of token ids](#convert-the-text-into-a-sequence-of-token-ids)\n  - [Convert the sequence of token ids into embeddings](#convert-the-sequence-of-token-ids-into-embeddings)\n- [Build the first Transformer block](#build-the-first-transformer-block)\n  - [Normalization](#normalization)\n    - [Using RMS normalization for embeddings](#using-rms-normalization-for-embeddings)\n  - [Implementing the single-head attention mechanism from scratch](#implementing-the-single-head-attention-mechanism-from-scratch)\n    - [Obtain the QKV vectors corresponding to the input tokens](#obtain-the-qkv-vectors-corresponding-to-the-input-tokens)\n      - [Obtain the query vector](#obtain-the-query-vector)\n        - [Unfold the query weight matrix](#unfold-the-query-weight-matrix)\n        - [Obtain the first head](#obtain-the-first-head)\n        - [Multiply the token embeddings by the query weights to obtain the query vectors corresponding to the tokens](#multiply-the-token-embeddings-by-the-query-weights-to-obtain-the-query-vectors-corresponding-to-the-tokens)\n      - [Obtain the key vector (almost the same as the query vector)](#obtain-the-key-vector-almost-the-same-as-the-query-vector)\n      - [Obtain the value vector (almost the same as the key vector)](#obtain-the-value-vector-almost-the-same-as-the-key-vector)\n    - [Add positional information to the query and key vectors](#add-positional-information-to-the-query-and-key-vectors)\n      - [Rotary Position Encoding (RoPE)](#rotary-position-encoding-rope)\n      - [Add positional information to the query vectors](#add-positional-information-to-the-query-vectors)\n      - [Add positional information to the key vectors (same as the query)](#add-positional-information-to-the-key-vectors-same-as-the-query)\n    - [Everything's ready. Let's start calculating the attention weights between tokens.](#everythings-ready-lets-start-calculating-the-attention-weights-between-tokens)\n      - [Multiply the query and key vectors to obtain the attention scores.](#multiply-the-query-and-key-vectors-to-obtain-the-attention-scores)\n      - [Now we must mask the future query-key scores.](#now-we-must-mask-the-future-query-key-scores)\n      - [Calculate the final attention weights, that is, softmax(score).](#calculate-the-final-attention-weights-that-is-softmaxscore)\n    - [Finally! Calculate the final result of the single-head attention mechanism!](#finally-calculate-the-final-result-of-the-single-head-attention-mechanism)\n  - [Calculate the multi-head attention mechanism (a simple loop to repeat the above process)](#calculate-the-multi-head-attention-mechanism-a-simple-loop-to-repeat-the-above-process)\n    - [Calculate the result for each head](#calculate-the-result-for-each-head)\n    - [Merge the results of each head into a large matrix](#merge-the-results-of-each-head-into-a-large-matrix)\n    - [Head-to-head information interaction (linear mapping), the final step of the self-attention layer!](#head-to-head-information-interaction-linear-mapping-the-final-step-of-the-self-attention-layer)\n  - [Perform the residual operation (add)](#perform-the-residual-operation-add)\n  - [Perform the second normalization operation](#perform-the-second-normalization-operation)\n  - [Perform the calculation of the FFN (Feed-Forward Neural Network) layer](#perform-the-calculation-of-the-ffn-feed-forward-neural-network-layer)\n  - [Perform the residual operation again (Finally, we get the final output of the Transformer block!)](#perform-the-residual-operation-again-finally-we-get-the-final-output-of-the-transformer-block)\n- [Everything is here. Let's complete the calculation of all 32 Transformer blocks. Happy reading :)](#everything-is-here-lets-complete-the-calculation-of-all-32-transformer-blocks-happy-reading-)\n- [Let's complete the last step and predict the next token](#lets-complete-the-last-step-and-predict-the-next-token)\n  - [First, perform one last normalization on the output of the last Transformer layer](#first-perform-one-last-normalization-on-the-output-of-the-last-transformer-layer)\n  - [Then, make the prediction based on the embedding corresponding to the last token (perform a linear mapping to the vocabulary dimension)](#then-make-the-prediction-based-on-the-embedding-corresponding-to-the-last-token-perform-a-linear-mapping-to-the-vocabulary-dimension)\n  - [Here's the prediction result!](#heres-the-prediction-result)\n- [Let's dive deeper and see how different embeddings or token masking strategies might affect the prediction results :)](#lets-dive-deeper-and-see-how-different-embeddings-or-token-masking-strategies-might-affect-the-prediction-results-)\n- [Need to predict multiple tokens? Just using KV-Cache! (It really took me a lot of effort to sort this out. Orz)](#need-to-predict-multiple-tokens-just-using-kv-cache-it-really-took-me-a-lot-of-effort-to-sort-this-out-orz)\n- [Thank you all. Thanks for your continuous learning. Love you all :)](#thank-you-all-thanks-for-your-continuous-learning-love-you-all-)\n  - [From Me](#from-me)\n  - [From the author of predecessor project](#from-the-author-of-predecessor-project)\n- [LICENSE](#license)\n\n---\n\n\u003Ch3>\nNow, let's start the formal learning process!\n\u003C\u002Fh3>\n\u003Cbr>\n\nIn this file, I implemented Llama3 from scratch, one tensor and matrix multiplication at a time.\n\u003Cbr>\nAlso, I'm going to load tensors directly from the model file that meta provided for Llama3 (Meta-Llama-3-8B), you need to download the weights before running this file. Here is the offical link to download the weights: https:\u002F\u002Fllama.meta.com\u002Fllama-downloads\u002F\n\u003Cbr>\u003Cbr>\nNote 1: This project has adopted the model file download method based on Huggingface. You will see it in the loading model section below. Similarly, you can also download the models directly from the official website or ModelScope and other model download sources without running the model download code below.\n\u003Cbr>\nNote 2: This project uses the original model files, that is, the models in the \"original\" folder of the downloaded model files.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\n    Please Note! There is a small mistake in the figure:\u003Cbr>\n    \u003Ch4>\n        In each Transformer block, the input of the second \"add\" operation should be the output of the feed-forward layer and the output of the first \"add\" operation, instead of the result after normalization.\n        \u003Cbr>\n        If we consider multi-head self-attention and feed-forward as the same type of operations (both for feature transformation), then the forms and processes of the two \"normalization - feature transformation - residual connection (add)\" are exactly the same.\n    \u003C\u002Fh4>\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_a048f83904eb.png\"\u002F>\n\u003C\u002Fdiv>\n\n# Loading the model\n## Loading the tokenizer\n\nThe tokenizer is used to split the input text string into a sequence of sub-words, making it easier to input to the model.\n\u003Cbr>\nI'm not going to implement a bpe tokenizer (but andrej karpathy has a really clean implementation), \nlink to his implementation: https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c52ffe6786ea.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>Summary of the steps to load the BPE-based tokenizer:\u003C\u002Fh3>\n\n1. Loading regular words: Load the local tokenizer model dictionary (which only contains regular subwords and no special tokens).\n2. Definition of the special words: Manually define special tokens (using ready-made ones or modifying based on the ready-made ones).\n3. Definition of the text rough-splitting rule: Define the regular expression for text rough-splitting (just using a ready-made one). The input will go through two steps of rough-splitting (based on the regular expression) and fine-splitting (based on BPE) to obtain the final tokenization result.\n4. Create tokenizer: Create a text encoder-decoder object based on the open-sourced tiktoken library by OpenAI (which can further split the rough-splitting result based on the BPE algorithm).\n\n\n```python\n# Loading the BPE-based Tokenizer\n\n# Import related libraries\nfrom pathlib import Path  # Used to obtain the file name\u002Fmodel name from the file path\nimport tiktoken  # An open-source library developed by OpenAI for text encoding and decoding (mutual conversion between text and token IDs)\nfrom tiktoken.load import load_tiktoken_bpe  # Load the BPE model\nimport torch  # Used for building models and matrix calculations\nimport json  # Used for loading configuration files\nimport matplotlib.pyplot as plt  # Used for plotting graphs\n\n\ntokenizer_path = \"Meta-Llama-3-8B\u002Foriginal\u002Ftokenizer.model\"  # Path to the tokenizer model\n\n# Special tokens outside the regular dictionary.\n# These special tokens are present in the 'added_tokens' field of both 'tokenizer.json' and 'tokenizer_config.json' in the \"Meta-Llama-3-8B\u002F\" path\nspecial_tokens = [\n            \"\u003C|begin_of_text|>\",\n            \"\u003C|end_of_text|>\",\n            \"\u003C|reserved_special_token_0|>\",  # Reserved special tokens from 0 to 250\n            \"\u003C|reserved_special_token_1|>\",\n            \"\u003C|reserved_special_token_2|>\",\n            \"\u003C|reserved_special_token_3|>\",\n            \"\u003C|start_header_id|>\",  # Start of header information, used to mark the header information that wraps structured data, such as metadata\n            \"\u003C|end_header_id|>\",  # End of header information\n            \"\u003C|reserved_special_token_4|>\",\n            \"\u003C|eot_id|>\",  # end of turn, used to mark the end of the current turn in multi-turn conversations\n        ] + [f\"\u003C|reserved_special_token_{i}|>\" for i in range(5, 256 - 5)]\n\n\n# Load the BPE model (actually a dictionary)\n# A dictionary of subword(bytes type, decoded with utf-8)-rank(id) pairs, with 128000 words, not including the 256 special tokens above,\n# so the total size of the model's dictionary will be 128256 in after operation (but not here)\n# The rank values are an increasing sequence starting from 0, used to determine the priority order of subword unit merging,\n# the higher the priority, the earlier the merging. Therefore, the variable name here is \"mergeable_ranks\" instead of something like BPE or word dictionary\n# The special tokens are not added to the dictionary probably for flexibility,\n# making it easy to add specific tokens when facing different model architectures or tasks with different special tokens, and keeping the dictionary size unchanged\nmergeable_ranks = load_tiktoken_bpe(tokenizer_path)\n\n\n# Create a text encoder-decoder object\n# The pat_str is roughly divided into three types: words with abbreviations & words, Chinese segments, 1-3-digit numbers & other special characters\ntokenizer = tiktoken.Encoding(\n    name=Path(tokenizer_path).name,  # Name of the encoder, which is convenient when debugging and logging to use different encoders\n    pat_str=r\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\",  # Regular expression for initially roughly splitting the text into a token sequence\n    mergeable_ranks=mergeable_ranks,  # Pass in the loaded BPE model\n    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},  # Dictionary for adding special token-id pairs\n)\n\n\n# Test whether the creation is successful, that is, whether the encoder-decoder can run correctly\nprint(tokenizer.decode(tokenizer.encode(\"create tokenizer successed!\")))\n\n\n# The following is a case test to test the effects and differences between the rough splitting of pat_str and the fine splitting of the tokenizer\n# The regular expression of pat_str only provides a preliminary splitting,\n# some long sentences or Chinese text will not be split and will be further refined based on the BPE algorithm in the tokenizer\nimport regex  # Since some Unicode syntax such as \\p{L} is used in pat_str, the re library cannot be used\n\n## Create a regular expression\npat_str=r\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\"\npattern = regex.compile(pat_str)\n\n## Text segmentation\ntext = \"Hello world! It's a test. 这是一个测试. alongwords. a long words. 123 456 789.\"  # testing string\nre_tokens = pattern.findall(text)  # Split the string using the regular expression\nmerge_tokens_id = tokenizer.encode(text)  # Split the string using the tokenizer\nmerge_tokens = [tokenizer.decode([i]) for i in merge_tokens_id]  # Convert the id sequence of the tokenizer's splitting result into an actual subword sequence\n\n## Output result\nprint(\"Original string: \", text)\nprint(\"Regular expression splitting result: \", re_tokens)\nprint(\"Tokenizer splitting result: \", merge_tokens)\nprint(\"Tokenizer splitting result ids: \", list(zip(merge_tokens, merge_tokens_id)))\n\n## From the results, it can be seen that the leading spaces of all words are retained, rather than being a single space token or being deleted.\n## This is beneficial for the model to correctly understand the boundary information between words, such as 'alongwords' in the example.\n```\n\n    create tokenizer successed!\n    Original string:  Hello world! It's a test. 这是一个测试. alongwords. a long words. 123 456 789.\n    Regular expression splitting result:  ['Hello', ' world', '!', ' It', \"'s\", ' a', ' test', '.', ' 这是一个测试', '.', ' alongwords', '.', ' a', ' long', ' words', '.', ' ', '123', ' ', '456', ' ', '789', '.']\n    Tokenizer splitting result:  ['Hello', ' world', '!', ' It', \"'s\", ' a', ' test', '.', ' 这', '是一个', '测试', '.', ' along', 'words', '.', ' a', ' long', ' words', '.', ' ', '123', ' ', '456', ' ', '789', '.']\n    Tokenizer splitting result ids:  [('Hello', 9906), (' world', 1917), ('!', 0), (' It', 1102), (\"'s\", 596), (' a', 264), (' test', 1296), ('.', 13), (' 这', 122255), ('是一个', 122503), ('测试', 82805), ('.', 13), (' along', 3235), ('words', 5880), ('.', 13), (' a', 264), (' long', 1317), (' words', 4339), ('.', 13), (' ', 220), ('123', 4513), (' ', 220), ('456', 10961), (' ', 220), ('789', 16474), ('.', 13)]\n\n\n## Reading model files and configuration files\n\nGenerally, reading a model file depends on how its model class is written and the variable names within it.\n\u003Cbr>\nHowever, since we are implementing Llama3 from scratch, we will read one tensor file at a time.\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_76842a6c1f45.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Load the model, a dictionary such as {\"network-layer-name\": tensor-type parameters}\nmodel = torch.load(\"Meta-Llama-3-8B\u002Foriginal\u002Fconsolidated.00.pth\")\n\n# Print the names of the first 20 network layers to verify if the model is loaded correctly.\nprint(json.dumps(list(model.keys())[:20], indent=4))\n```\n\n    [\n        \"tok_embeddings.weight\",\n        \"layers.0.attention.wq.weight\",\n        \"layers.0.attention.wk.weight\",\n        \"layers.0.attention.wv.weight\",\n        \"layers.0.attention.wo.weight\",\n        \"layers.0.feed_forward.w1.weight\",\n        \"layers.0.feed_forward.w3.weight\",\n        \"layers.0.feed_forward.w2.weight\",\n        \"layers.0.attention_norm.weight\",\n        \"layers.0.ffn_norm.weight\",\n        \"layers.1.attention.wq.weight\",\n        \"layers.1.attention.wk.weight\",\n        \"layers.1.attention.wv.weight\",\n        \"layers.1.attention.wo.weight\",\n        \"layers.1.feed_forward.w1.weight\",\n        \"layers.1.feed_forward.w3.weight\",\n        \"layers.1.feed_forward.w2.weight\",\n        \"layers.1.attention_norm.weight\",\n        \"layers.1.ffn_norm.weight\",\n        \"layers.2.attention.wq.weight\"\n    ]\n\n\n\n```python\n# Load the configuration file.\n# The specific meaning of each configuration is described in the next section.\nwith open(\"Meta-Llama-3-8B\u002Foriginal\u002Fparams.json\", \"r\") as f:\n    config = json.load(f)\nconfig\n```\n\n\n\n\n    {'dim': 4096,\n     'n_layers': 32,\n     'n_heads': 32,\n     'n_kv_heads': 8,\n     'vocab_size': 128256,\n     'multiple_of': 1024,\n     'ffn_dim_multiplier': 1.3,\n     'norm_eps': 1e-05,\n     'rope_theta': 500000.0}\n\n\n\n### Inferring model details using the configuration file\n\n| Configuration Item | Configuration Value | Meaning |\n| ---- | ---- | ---- |\n| dim | 4096 | Dimension of the hidden layer, i.e., the vector representation of each token has a dimension of 4096. |\n| n_layers | 32 | Number of model layers, i.e., the model has 32 Transformer layers or say Transformer blocks. |\n| n_heads | 32 | Number of heads in multi-head attention, i.e., each multi-head attention block has 32 heads. The so-called multi-head means that multiple independent attention mechanisms are used simultaneously to capture different features or information of the input data. |\n| n_kv_heads | 8 | Number of heads in key-value attention, used for Grouped Query Attention (GQA). That is, the key-value attention has 8 heads, while the query has n_heads=32 heads. Every 4 query heads will share a set of key-value pairs. |\n| vocab_size | 128256 | Size of the vocabulary, including 128000 ordinary tokens and 256 special tokens. |\n| multiple_of | 1024 | Multiple constraint on the dimension of the hidden layer. That is, the dimension of the model's hidden layer should be a multiple of 1024 to optimize computational efficiency. |\n| ffn_dim_multiplier | 1.3 | Multiplier for the hidden layer dimension of the feed-forward network layer, used to calculate the hidden layer dimension of the FFN. The calculation process can be seen in the corresponding section. |\n| norm_eps | 1e-05 | Constant added to the denominator in layer normalization calculation to prevent division by zero and ensure numerical stability. |\n| rope_theta | 500000.0 | Basic frequency scaling factor in Rotary Position Encoding (RoPE), which controls the periodicity and resolution of position encoding, thus affecting the model's ability to capture sequences of different lengths and positional relationships. |\n\n\u003Cbr>\n\n\u003Ch3>\nBased on the configuration details, the internal calculation process of attention given an input can be inferred as follows:\n\u003C\u002Fh3>\n\n\u003Cpre>\ninput(L, 4096) -> query_proj(L, 128, 32)\n               -> key_proj(L, 128, 8)\n               -> value_proj(L, 128, 8)\n                                           -> group_query_attention(L, 128, 32)\n                                           -> output_proj(L, 4096)\n                                                                                   -> output(L, 4096)\n\u003C\u002Fpre>\n\n\n```python\n# Record these configurations, which will be gradually used later.\ndim = config[\"dim\"]\nn_layers = config[\"n_layers\"]\nn_heads = config[\"n_heads\"]\nn_kv_heads = config[\"n_kv_heads\"]\nvocab_size = config[\"vocab_size\"]\nmultiple_of = config[\"multiple_of\"]\nffn_dim_multiplier = config[\"ffn_dim_multiplier\"]\nnorm_eps = config[\"norm_eps\"]\nrope_theta = torch.tensor(config[\"rope_theta\"])\n```\n\n# Convert the input text into embeddings\n\nBefore inputting the text in string form to the network layer, it needs to be converted into vector form for mathematical calculations.\n\u003Cbr>\nThe required process is: use the tokenizer to split the input text into a subword sequence -> convert the subwords into vector representations.\n\n## Convert the text into a sequence of token ids\nHere, we use tiktoken (a library from OpenAI) as the tokenizer.\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c4ace96c20b0.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Convert the input prompt into a sequence of token ids\nprompt = \"the answer to the ultimate question of life, the universe, and everything is \"  # Input text\ntokens = [128000] + tokenizer.encode(prompt)  # Perform subword segmentation and add a special token \u003C|begin_of_text|> indicating the start of the text at the beginning of the text. Dimension: [17]\nprint(tokens)  # Check the segmentation result\ntokens = torch.tensor(tokens)  # Convert to tensor type for subsequent matrix calculations. [17]\n\n# Convert the token ids into a specific sequence of token subwords, which is only for display purposes and not actually needed\nprompt_split_as_tokens = [tokenizer.decode([token]) for token in tokens]\nprint(prompt_split_as_tokens)\n```\n\n    [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]\n    ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n\n\n## Convert the sequence of token ids into embeddings\n\nSorry, this is the only part in this codebase where I use built-in neural network modules.\n\u003Cbr>\nIn short, our original [17x1] token sequence is now [17x4096], that is, 17 embeddings of length 4096 (one for each token).\n\u003Cbr>\u003Cbr>\nNote: Pay attention to the change in the shape of this tensor, which will make it easier for you to understand the entire process (And i will annotate the shape changes in all steps).\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_533946884ac0.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Create an embedding network layer to map discrete token ids to a continuous vector space\nembedding_layer = torch.nn.Embedding(vocab_size, dim)\n\n# Update the parameters of the embedding network with the pre-trained parameter values in Llama3\nembedding_layer.weight.data.copy_(model[\"tok_embeddings.weight\"])\n\n# Use the embedding network to convert the input sequence of token ids into vector representations\n# The embedding network only looks up the corresponding vectors based on the ids in a dictionary and does not involve interactions between tokens.\n# [17] -> [17x4096]\ntoken_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)  # By default, it is in full-precision float32. Here, we change it to the half-precision format to reduce memory usage.\n\ntoken_embeddings_unnormalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# Build the first Transformer block\n\nFrom the pre-trained parameters involved in the first Transformer block shown below, it includes:\n1. Two normalizations (attention_norm and ffn_norm)\n2. Implementation of the attention mechanism (4 attention.w)\n3. Implementation of the feed-forward network layer (3 feed_forward.w)\n4. (Of course, it also includes two residual connection operations that do not require pre-trained parameters)\n\nIn general, the operation process in a Transformer block is as follows:\n\u003Cbr>\nNormalization -> Multi-head self-attention -> Residual connection -> Normalization -> Feed-forward neural network -> Residual connection\n\n\n```python\n# Display all the weight parameters and their shapes of the first Transformer block\nfor k, v in model.items():\n    if not k.startswith('layers'):\n        continue\n    if k.startswith('layers.1'):\n        break\n    print(k, v.shape)\n```\n\n    layers.0.attention.wq.weight torch.Size([4096, 4096])\n    layers.0.attention.wk.weight torch.Size([1024, 4096])\n    layers.0.attention.wv.weight torch.Size([1024, 4096])\n    layers.0.attention.wo.weight torch.Size([4096, 4096])\n    layers.0.feed_forward.w1.weight torch.Size([14336, 4096])\n    layers.0.feed_forward.w3.weight torch.Size([14336, 4096])\n    layers.0.feed_forward.w2.weight torch.Size([4096, 14336])\n    layers.0.attention_norm.weight torch.Size([4096])\n    layers.0.ffn_norm.weight torch.Size([4096])\n\n\nThere are two points to note here:\n1. The shape of the weight matrix of a neural network is (output dimension, input dimension). During the calculation, the parameter matrix W will be transposed to (input dimension, output dimension) and then multiplied by the input X, i.e., the output Y = XW.T. You will see this in the subsequent calculations.\n2. Since Llama3 uses the grouped attention mechanism, every 4 query heads will share a set of kv vectors (for details, see the section on the details of the configuration file above). Therefore, the dimension of the weight matrix of kv is [1024, 4096], which is 1\u002F4 of that of q ([4096, 4096]).\n\n## Normalization\n\nThe normalization operation aims to constrain the scale differences in the data, avoiding issues such as unstable training process caused by excessive differences in vector values.\n\u003Cbr>\u003Cbr>\nAfter normalization, the shape of the tensor remains [17x4096], the same as that of the embedding.\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_b6286bc7e59d.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n### Using RMS normalization for embeddings\n\nLlama3 uses the Root Mean Square (RMS) normalization method, and its calculation formula is shown in the figure below.\n\u003Cbr>\nIt should be noted that we need a norm_eps parameter (from the configurations) because we don't want to accidentally set the RMS to 0 and perform a division by zero.\n\u003Cbr>\nThe formula is as follows:\n\u003Cdiv>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_2f575a0b6e6f.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\nIn addition, you may have noticed the gi parameter in the formula. This is a scaling factor learned during the model training process, used to scale the normalization result of each dimension again to enhance the model's expressive ability. Its dimension is the same as the feature dimension of the embedding, i.e., [4096].\n\n\n```python\n# Define the calculation function for RMS normalization\n# Each token will be normalized independently\n# norm_weights is the pre-trained scaling factor (i.e., gi in the formula) to enhance the model's representational ability. It can be loaded from the model file and has 4096 dimensions\n# torch.rsqrt used to calculates the reciprocal of the square root of a tensor, i.e., 1\u002FRMS(a)\ndef rms_norm(tensor, norm_weights):\n    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights\n```\n\n\n```python\n# Normalize the input\ntoken_embeddings = rms_norm(token_embeddings_unnormalized, model[\"layers.0.attention_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\nmodel[\"layers.0.attention_norm.weight\"].shape, token_embeddings.shape\n```\n\n\n\n\n    (torch.Size([4096]), torch.Size([17, 4096]))\n\n\n\n## Implementing the single-head attention mechanism from scratch\n\nIn the calculation of multi-head attention on each layer, 32 heads are involved. However, the calculation processes of these heads are completely identical and independent. Therefore, in this section, we will first implement the single-head attention calculation process, and expand it to multi-head calculation in the next section.\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_08965382b82c.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\nThe core calculation of the attention mechanism is the calculation formula shown in the following figure.\n\u003C\u002Fh3>\n\n1. We need to obtain the query, key, and value vectors by performing a linear mapping on the input embeddings.\n2. Subsequently, based on the QK vectors, we obtain the attention weights between tokens, that is, for each token, the scores of the importance or relevance of other tokens to it.\n3. Finally, based on the attention weights, we weight the value vectors to obtain the attention results corresponding to each token.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8574b10f57e4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nBack to the point. Let's first load the attention heads of the first-layer Transformer.\n\u003Cbr>\n&gt; When we load the query, key, value, and output weight matrices from the model (the output weight is used for information fusion among multiple heads to generate the final attention output), we will notice that their shapes are: [4096x4096], [1024x4096], [1024x4096], [4096x4096].\n\u003Cbr>\n&gt; At first glance, this seems strange because ideally, we would like the q, k, v of each head to be independent of each other (in which case their shapes would be: 32x[128x4096], 8x[128x4096], 8x[128x4096]).\n\u003Cbr>\n&gt; The author of the code binds them together because this helps parallelize the multiplication calculation of the attention heads.\n\u003Cbr>\n&gt; But we will unfold everything...\n\n\n```python\n# Show the shapes of the attention weight matrices of the current q, k, v and o.\nprint(\n    model[\"layers.0.attention.wq.weight\"].shape,  # [4096x4096]\n    model[\"layers.0.attention.wk.weight\"].shape,  # [1024x4096]\n    model[\"layers.0.attention.wv.weight\"].shape,  # [1024x4096]\n    model[\"layers.0.attention.wo.weight\"].shape   # [4096x4096]\n)\n```\n\n    torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])\n\n\n### Obtain the QKV vectors corresponding to the input tokens\n\nIn this section, we will convert the input token embeddings into query, key, and value vectors for the purpose of attention mechanism computation.\n\n#### Obtain the query vector\n\n##### Unfold the query weight matrix\n\nWe will first unfold the queries from multiple attention heads, and the final shape will be [32x128x4096].\n\u003Cbr>\nHere, 32 is the number of attention heads in Llama3, 128 is the vector dimension of the query head, and 4096 is the dimension of the token embedding (the reason why the dimension of the embedding is in the last dimension is that when multiplying the input and the weight, it is usually = X*W.T, that is, multiplying by the transpose of the weight).\n\n\n```python\n# Load and modify the shape of the query weight matrix of layers.0 to unfold it in the form of multiple heads\nq_layer0 = model[\"layers.0.attention.wq.weight\"]  # Default shape is [4096x4096]\nhead_dim = q_layer0.shape[0] \u002F\u002F n_heads  # Dimension of the attention head, 4096\u002F32 = 128\nq_layer0 = q_layer0.view(n_heads, head_dim, dim)  # Unfolded dimension, [32x128x4096]\nq_layer0.shape\n```\n\n\n\n\n    torch.Size([32, 128, 4096])\n\n\n\n##### Obtain the first head\nHere, I access the first head of the query weight matrix in the first layer. The shape of this query weight matrix is [128x4096].\n\n\n```python\n# Extract the weights of the first head\nq_layer0_head0 = q_layer0[0]  # [32x128x4096] -> [128x4096]\nq_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n##### Multiply the token embeddings by the query weights to obtain the query vectors corresponding to the tokens\n\nHere, you can see that the shape of the result is [17x128]. This is because we have 17 tokens, and for each token, there is a query vector of length 128.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_832fd0cbc155.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Calculate the query values of inputs on the first query head\n# Q0_head0 = XW0_Q_head0.T\nq_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)  # [17x4096] x [4096x128] = [17x128]\nq_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### Obtain the key vector (almost the same as the query vector)\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_fb63a357c607.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nI want to take a lazy, so I won't elaborate on the calculation process of the key vector again. Orz. The only thing you need to remember is:\n\u003Cbr>\n&gt; The key also generates a 128-dimensional vector.\n\u003Cbr>\n&gt; The number of parameters of the weight matrix of the key is only 1\u002F4 of that of the query, because the weight of each key is shared by 4 heads simultaneously to reduce the required amount of calculation.\n\n\n```python\n# Load and modify the shape of the key weight matrix of layers.0 to expand it in a multi-head form\n# Different from the query weight matrix, the key has 8 attention heads, so the number of parameters is 1\u002F4 of that of the query matrix\nk_layer0 = model[\"layers.0.attention.wk.weight\"]  # [1024x4096]\nk_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0] \u002F\u002F n_kv_heads, dim) # [8x128x4096]\nk_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\n\n```python\n# Extract the weights of the first head\nk_layer0_head0 = k_layer0[0]  # [8x128x4096] -> [128x4096]\nk_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n\n```python\n# Calculate the key vectors corresponding to the inputs of the first head\n# K0_head0 = XW0_K_head0.T\nk_per_token = torch.matmul(token_embeddings, k_layer0_head0.T)  # [17x4096] x [4096x128] = [17x128]\nk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### Obtain the value vector (almost the same as the key vector)\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8fc2f0557634.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n&gt; Similar to the key weights, the value weights are also shared by every 4 attention heads (to save computation).\n\u003Cbr>\n&gt; Therefore, the shape of the value weight matrix is [8x128x4096].\n\n\n```python\n# Load and modify the shape of the value weight matrix of layers.0 to expand it in a multi-head form\n# Similar to the key weight matrix, the value also has 8 attention heads, so the number of parameters is also 1\u002F4 of that of the query matrix\nv_layer0 = model[\"layers.0.attention.wv.weight\"]  # [1024x4096]\nv_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] \u002F\u002F n_kv_heads, dim)  # [1024x4096] -> [8x128x4096]\nv_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\n\n```python\n# Extract the weights of the first head\nv_layer0_head0 = v_layer0[0]  # [8x128x4096] -> [128x4096]\nv_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n\n```python\n# Calculate the value vectors corresponding to the inputs of the first head\n# V0_head0 = XW0_V_head0.T\nv_per_token = torch.matmul(token_embeddings, v_layer0_head0.T)  # [17x4096] x [4096x128] = [17x128]\nv_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n### Add positional information to the query and key vectors\n\n- For natural language, the sequential relationship and relative positions between words are extremely important. For example, \"The dog bites the man\" and \"The man bites the dog\" have completely different semantic information. Moreover, our intuition also tells us that the correlation between relatively close words is usually greater than that between distant words.\n- Therefore, we need to provide positional information between tokens during the attention calculation process, so that the model can better capture the dependencies in the sequence.\n- Why add it to the query and key vectors? Because the query and key vectors are used to calculate the attention weights, i.e., the importance of each token to other tokens. This requires both of them to know the positions and relative positional relationships of any two tokens when calculating the similarity between them.\n- Why not add it to the value vectors? Because the value vectors are only used for weighted summation. The positional information has already been considered in the interaction between the query and the key. Therefore, the value vectors only need to provide content information.\n\nWe will use RoPE (Rotary Position Encoding) to add positional information to these vectors.\n\n#### Rotary Position Encoding (RoPE)\n\nYou can watch this video to understand its mathematical principles in detail (this is also the one I watched):\nhttps:\u002F\u002Fwww.youtube.com\u002Fwatch?v=o29P0Kpobz0&t=530s\n\u003Cbr>\u003Cbr>\nThe general idea of RoPE is to regard the vector as being in the complex space, and then generate specific rotation matrix based on positions. By multiplying the vector with the rotation matrix, rotation in the complex space can be achieved, thereby adding the relative position information to the vector. (That is, taking the positional relationship of the input vectors as a rotation at different angles in a complex space.)\n\u003Cbr>\n(Similar to the rotation of planar position coordinates around an axis through the multiplication of trigonometric-function-based matrices in robot kinematics.)\n\u003Cbr>\u003Cbr>\nRoPE is usually applied to the query and key vectors in the self-attention mechanism. When calculating the attention scores, the query and key vectors are first rotated based on the corresponding rotation matrix of RoPE. Then, operations such as dot-product calculation and softmax normalization are performed. In this way, the Transformer can take positional information into account when calculating attentions and better capture the dependencies in the text.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_fde9bd425df6.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n\u003Ch3>\nThe specific calculation process of RoPE is as follows:\n\u003C\u002Fh3>\n\n1. Divide the dimensions of each vector into pairs (because the derivation of high-dimensional rotation matrices is complex, and excessively high dimensions will significantly increase the computational complexity, while the formulas for two-dimensional rotation are relatively mature and simple, making them easy to calculate).\n2. For each pair, obtain $\\Large \\theta=\\frac{1}{rope\\\\_theta^{i\u002FD}}$, where $i$ is the $i$-th pair and $D$ is the total number of pairs. That is, the positional information of the current dimension pair within the vector.\n3. For each vector, obtain $\\Large m$, which represents that the vector corresponds to the $m$-th token. That is, the positional information of the current vector within the entire input vectors.\n4. For each pair, ![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_a12f47598060.png) , where $res$ is the result of rotating the vector pair by $m\\theta$ degrees in the complex space.\n5. Perform the above calculations on all dimension pairs of all vectors to obtain the final RoPE result.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8c5c472c7eb2.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n\u003Ch3>\nIn the actual code implementation, to simplify the calculation process, the above-mentioned calculation based on the rotation matrix (Step 4) will be converted into a calculation in the complex number domain. The principle is as follows:\n\u003C\u002Fh3>\n\n1. The rectangular coordinates $(x, y)$ can be regarded as the coordinate representation of the complex number $\\large x + yi$ on the complex plane.\n2. The polar form of a complex number can be expressed as $\\large re^{i\\theta}$, where $r$ is the modulus and $\\theta$ is the angle.\n3. The multiplication calculation in polar coordinates $\\large r_1e^{i\\theta_1} \\times r_2e^{i\\theta_2} = r_1r_2e^{i(\\theta_1 + \\theta_2)}$ can be regarded as increasing the length of coordinate_1 by $r_2$ times and rotating the angle by $\\theta_2$ degrees.\n4. Therefore, if you want to rotate the coordinates by $m\\theta$ degrees, you can define a rotation factor $\\large e^{im\\theta}$ with a modulus of 1 and an angle of $m\\theta$. Multiplying it by the coordinates will be equivalent to the rotation method based on the rotation matrix.\n5. In addition, according to Euler's formula, we have $\\large re^{i\\theta} = r\\cos\\theta + r\\sin{\\theta i} = x + yi$ and $\\large e^{im\\theta} = \\cos{m\\theta} + \\sin{m\\theta i}$.\n6. Therefore, rotating a two-dimensional coordinate $(x, y)$ by $m\\theta$ degrees can be obtained through $\\large re^{i\\theta^\\prime} \\times e^{im\\theta} = (x + yi) \\times (\\cos{m\\theta} + \\sin{m\\theta i})$ (the product of two complex numbers).\n\n#### Add positional information to the query vectors\n\nIn the following steps, we will first split the query vectors into pairs, and then perform angle rotation on each pair, as shown in the above steps.\n\u003Cbr>\u003Cbr>\nNow we have a vector with a shape of [17x64x2]. This is obtained by splitting the 128-dimensional query vectors corresponding to each token in the prompt into 64 pairs, and each pair will be rotated by $m\\theta$ degrees.\n\n\n```python\n# Split the query vectors in pairs along the dimension direction.\n# .float() is for switch back to full precision to ensure the precision and numerical stability in the subsequent trigonometric function calculations.\n# [17x128] -> [17x64x2]\nq_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)\nq_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\u003Ch3>\nStart to obtain the complex-domain representation of the rotation matrix.\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_17d4fd3664b5.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Calculate θ. Step 1: Obtain i\u002FD.\n# [64]\nzero_to_one_split_into_64_parts = torch.tensor(range(64))\u002F64  # Each feature has 64 dimension pairs after segmentation, so 64 θ values are required\nzero_to_one_split_into_64_parts\n```\n\n\n\n\n    tensor([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250,\n            0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656,\n            0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062,\n            0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469,\n            0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875,\n            0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281,\n            0.8438, 0.8594, 0.8750, 0.8906, 0.9062, 0.9219, 0.9375, 0.9531, 0.9688,\n            0.9844])\n\n\n\n\n```python\n# Calculate θ. Step 2: Obtain θ.\n# rope_theta is used to control information such as the periodicity of the position encoding.\n# For details, please refer to the configuration information section.\nfreqs = 1.0 \u002F (rope_theta ** zero_to_one_split_into_64_parts)  # [64]\nfreqs\n```\n\n\n\n\n    tensor([1.0000e+00, 8.1462e-01, 6.6360e-01, 5.4058e-01, 4.4037e-01, 3.5873e-01,\n            2.9223e-01, 2.3805e-01, 1.9392e-01, 1.5797e-01, 1.2869e-01, 1.0483e-01,\n            8.5397e-02, 6.9566e-02, 5.6670e-02, 4.6164e-02, 3.7606e-02, 3.0635e-02,\n            2.4955e-02, 2.0329e-02, 1.6560e-02, 1.3490e-02, 1.0990e-02, 8.9523e-03,\n            7.2927e-03, 5.9407e-03, 4.8394e-03, 3.9423e-03, 3.2114e-03, 2.6161e-03,\n            2.1311e-03, 1.7360e-03, 1.4142e-03, 1.1520e-03, 9.3847e-04, 7.6450e-04,\n            6.2277e-04, 5.0732e-04, 4.1327e-04, 3.3666e-04, 2.7425e-04, 2.2341e-04,\n            1.8199e-04, 1.4825e-04, 1.2077e-04, 9.8381e-05, 8.0143e-05, 6.5286e-05,\n            5.3183e-05, 4.3324e-05, 3.5292e-05, 2.8750e-05, 2.3420e-05, 1.9078e-05,\n            1.5542e-05, 1.2660e-05, 1.0313e-05, 8.4015e-06, 6.8440e-06, 5.5752e-06,\n            4.5417e-06, 3.6997e-06, 3.0139e-06, 2.4551e-06])\n\n\n\n\n```python\n# Calculate mθ\n# 'outer' is for outer product calculation, and 'arange(17)' represents the m corresponding to each vector (since the input has 17 tokens, 17 m values are needed).\n# The result has a shape of [17x64], meaning that each vector corresponding to a token has 64 mθ values, which are used to calculate the rotation of each of the 64 dimension pairs.\nfreqs_for_each_token = torch.outer(torch.arange(17), freqs)  # [17] & [64] -> [17x64]\n```\n\n\n```python\n# Obtain (cosmθ + sinmθi), that is, convert mθ to the complex-number form\n# Regard the rotation angle mθ as a polar-coordinate form with a modulus of 1, and then convert it to a complex-number representation\n# The two inputs of 'polar' represent the modulus (set to 1, meaning only the angle is changed without affecting the length) and the angle (i.e., mθ) respectively\nfreqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)  # [17x64] -> [17x64]\nprint(freqs_cis.shape)\n\n# View freqs_cis at some positions, just for display\ntoken_to_show = [1, 3, 5]  # View the 2nd, 4th, and 6th rows\nfig, axs = plt.subplots(1, len(token_to_show), figsize=(5 * len(token_to_show), 4))  # Generate a figure window with 3 sub-plots in 1 row and 3 columns\nfor i, index in enumerate(token_to_show):\n    value = freqs_cis[index]\n    for j, element in enumerate(value):\n        # Plot a blue line from the origin to the coordinate point, with the real part as the x-coordinate and the imaginary part as the y-coordinate.\n        axs[i].plot([0, element.real], [0, element.imag], color='blue', linewidth=1, label=f\"Index: {j}\")\n        # Draw red numerical annotations to represent the i-th pair of dimensions.\n        axs[i].annotate(f\"{j}\", xy=(element.real, element.imag), color='red')\n    axs[i].set_xlabel('Real')\n    axs[i].set_ylabel('Imaginary')\n    axs[i].set_title(f'Plot of {index + 1}th of freqs_cis')\nplt.show()\n\n\"\"\"\nNote: As shown in the figures, tokens in later positions have larger rotation angles, but within a single token, earlier vector dimensions have larger rotation angles.\n      You can explore further to see if there are any mathematical reasons behind this if you are interested. X_X\n\"\"\"\n```\n\n    torch.Size([17, 64])\n\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_95b7152370ba.png)\n    \n\n\n\n\n\n    '\\nNote: As shown in the figures, tokens in later positions have larger rotation angles, but within a single token, earlier vector dimensions have larger rotation angles.\\n      You can explore further to see if there are any mathematical reasons behind this if you are interested. X_X\\n'\n\n\n\n\u003Ch3>\nNow we have provided a complex number (an angle-changing vector) for each dimension pair of the query vector corresponding to each token.\n\u003C\u002Fh3>\n\u003Cbr>\nNow we can convert our query (the one divided into pairs) into complex numbers and then rotate these queries through dot-product calculation. :)\n\n\n```python\n# Obtain (x + yi)\n# That is, convert the dimension pairs into complex numbers. After the conversion, the shape of the dimensions will change from [17x64x2] to [17x64].\nq_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # [17x64x2] -> [17x64]\nq_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\n# Calculate (x + yi) * (cosmθ + sinmθi)\n# That is, perform the rotation operation to obtain the final result.\nq_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis  # [17x64] * [17x64] = [17x64]\nq_per_token_as_complex_numbers_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\u003Ch3>\nObtain the rotated vectors (restore the shape).\n\u003C\u002Fh3>\n\u003Cbr>\nWe can represent the complex numbers as real numbers again to obtain the query results in the form of dimension pairs.\n\n\n```python\n# Convert the complex-number results back to the real-number dimension-pair form.\nq_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # [17x64] -> [17x64x2]\nq_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\nMerge the rotated dimensions. In this way, we obtain a new query vector (the rotated query vector) with a shape of [17x128], where 17 represents the number of tokens and 128 represents the dimension of the query vector.\n\n\n```python\n# Restore the dimension-pair results to the original form of the query vectors, and obtain the final query vector.\nq_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # [17x64x2] -> [17x128]\nq_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### Add positional information to the key vectors (same as the query)\n\n\n```python\n# Split the key vectors into pairs along the dimension direction to form dimension pairs (modify the shape).\nk_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # [17x128] -> [17x64x2]\nk_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\n# Obtain (x + yi)\n# That is, convert the dimension pairs into complex numbers. After the conversion, the shape of the dimensions will change from [17x64x2] to [17x64].\nk_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # [17x64x2] -> [17x64]\nk_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\n# Calculate (x + yi) * (cosmθ + sinmθi)\n# That is, perform the rotation operation to obtain the final result.\n# And convert the result back to the real-number form.\nk_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)  # [17x64] * [17x64] = [17x64] -> [17x64x2]\nk_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\n# Restore the dimension-pair results to the original form of the key vectors, and obtain the final key vector.\nk_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # [17x64x2] -> [17x128]\nk_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n\u003Ch3>\nAt this stage, we have the rotated query vectors and key vectors corresponding to each token.\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_e044c5abbf69.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nThe shape of each query and key vector remains [17x128].\n\n### Everything's ready. Let's start calculating the attention weights between tokens.\n\nThis will involve a three-step process:\n1. Calculate the attention scores: score = Q x K\n2. Mask the future tokens: score = mask(score)\n3. Calculate the attention weights: res = softmax(score)\n\nLet's get started! :)\n\n#### Multiply the query and key vectors to obtain the attention scores.\n\nIn this way, we will get the score values between each token and all other tokens.\n\u003Cbr>\nThese scores represent how strongly each token's query relates to every other token's key.\n\u003Cbr>\nThis is the self-attention!\n\u003Cbr>\nThe shape of this attention score matrix (qk_per_token) is [17x17], where 17 is the number of tokens in the input prompt.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8804d1a9ce96.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Calculate the attention score\n# At the same time, perform normalization to prevent the subsequent softmax calculation results from being overly skewed towards 0 or 1,\n# (the dot-product values may be too large when the dimensions are large),\n# which could lead to vanishing gradients or exploding gradients, so as to maintain numerical stability.\nqk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(head_dim)**0.5  # [17x128] x [128x17] = [17x17]\nqk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 17])\n\n\n\n#### Now we must mask the future query-key scores.\n\nDuring the training process of Llama 3, the QK scores of future tokens will be masked.\n\u003Cbr>\nWhy? Because during training, we only learn how to use past tokens to predict the current token. If we don't mask future tokens, it will lead to the leakage of prediction information.\n\u003Cbr>\nTherefore, during the inference process, we also need to set the future tokens to 0 (to ensure the logical consistency between the training and inference processes).\n\u003Cbr>\n\nOf course, if you're as curious as I am about what would happen without masking, you can check the results of the additional experiment I conducted in the last section after you've finished learning. (^_\u003C) \n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_85e31c7ffd44.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# First, take a look at the score matrix before masking\ndef display_qk_heatmap(qk_per_token):\n    _, ax = plt.subplots()  # Create a figure window\n\n    # `imshow` is commonly used to display data in the form of a two-dimensional array or matrix,\n    # it maps the matrix elements to grayscale or color values, so it can be used to draw heatmaps.\n    # Convert the tensor back to full precision, then detach it from the computational graph to avoid potential gradient calculation and storage issues.\n    # Specify to use the 'viridis' color mapping scheme to display the image (blue -> green -> yellow).\n    im = ax.imshow(qk_per_token.to(float).detach(), cmap='viridis')\n\n    # Set the number and labels of the x and y axis ticks to ensure correct one-to-one correspondence.\n    ax.set_xticks(range(len(prompt_split_as_tokens)))\n    ax.set_yticks(range(len(prompt_split_as_tokens)))\n    ax.set_xticklabels(prompt_split_as_tokens)\n    ax.set_yticklabels(prompt_split_as_tokens)\n\n    # Add a color bar on the side.\n    # Specify `im` to identify the correct color mapping and value range.\n    # Specify the sub-plot it belongs to as `ax` (if there are multiple sub-plots, it would be `ax = ax[i]`).\n    ax.figure.colorbar(im, ax=ax)\n    \ndisplay_qk_heatmap(qk_per_token)\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_d4d8169ef08f.png)\n    \n\n\n\n```python\n# Generate the masking matrix\n# Set the positions of the elements to be masked to negative infinity, and set the positions that do not need to be masked to 0.\n# Then, add it to the score matrix to achieve the masking effect (negative infinity will tend to 0 when calculating the softmax).\n\n# `torch.full` is used to generate a tensor with a specified shape and filling value.\n# Here, a [17x17] matrix filled with negative infinity is first generated.\n# Specify that the device of this matrix is the same as that of the previous tokens to ensure that there are no errors in subsequent calculations,\n# for example, if the previous tokens are on the GPU and the device is not specified here, the `mask` will be newly created on the CPU, and an error will occur when adding the two.\nmask = torch.full((len(tokens), len(tokens)), float(\"-inf\"), device=tokens.device)  # [17x17]\n\n# `torch.triu` is used to return the upper-triangular part of the matrix, and set the rest to 0 (use `torch.tril` to get the lower-triangular part).\n# `diagonal` is the offset of the diagonal. When it's 1, it means taking the upper-triangular part starting from 1 position above the main diagonal to avoid masking the token itself.\nmask = torch.triu(mask, diagonal=1)  # [17x17]\n\nmask, mask.shape\n```\n\n\n\n\n    (tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),\n     torch.Size([17, 17]))\n\n\n\n\n```python\n# Mask the scores of future tokens\nqk_per_token_after_masking = qk_per_token + mask  # [17x17] + [17x17] = [17x17]\ndisplay_qk_heatmap(qk_per_token_after_masking)  # Display the attention scores after masking\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_43a2d88c2dbe.png)\n    \n\n\n#### Calculate the final attention weights, that is, softmax(score).\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8574b10f57e4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Calculate the attention weights\n# That is, calculate the softmax values of the scores.\n# `dim = 1` indicates that the softmax calculation is performed row-by-row, and the result is converted to half-precision to be consistent with the subsequent value vectors.\nqk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # [17x17] -> [17x17]\ndisplay_qk_heatmap(qk_per_token_after_masking_after_softmax)\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_7d690f953c6c.png)\n    \n\n\n### Finally! Calculate the final result of the single-head attention mechanism!\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_1dac0fb02c00.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nPrinciple: The previous attention weights (ranging from 0 to 1) are used to determine what proportion of each value vector should be used for each token (i.e., to weight the value vectors).\n\nExample: If the input consists of 3 tokens, the attention result of the first token might be: res = 0.6 * value_1 + 0.3 * value_2 + 0.1 * value_3\n\nThe shape of the attention result after the multiplication of the weight matrix and the value matrix is [17x128].\n\n\n```python\n# Calculate the final result of the single-head attention\nqkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17x17] x [17x128] = [17x128]\nqkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n## Calculate the multi-head attention mechanism (a simple loop to repeat the above process)\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c5ab2100efca.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nWe now have the attention values for the first head of the first layer.\n\u003Cbr>\n\nNow we need to run a loop to perform exactly the same mathematical process as in the previous cell, but for each head in the first layer.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\nIt's worth noting that in the \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmeta-llama\u002Fllama3\u002Fblob\u002Fmain\u002Fllama\u002Fmodel.py#L90\">official Llama3 code implementation\u003C\u002Fa>, the multi-head attention calculation adopts the method of one-time matrix multiplication instead of time consuming for-loop calculations. The general process is as follows:\n\u003C\u002Fh3>\n\n1. Based on matrix parallelism, calculate the QKV vectors: [17x4096] x [4096x4096] or [4096x1024] = [17x4096] or [17x1024], and then reshape them to [32x17x128] or [8x17x128].\n2. After obtaining the QKV vectors, duplicate the internal parts of the K and V vectors to make their shapes consistent with the Q vector. At this time, the shapes of all of them are [32x17x128].\n3. When calculating the scores, use the transpose method to exchange the positions of the last two dimensions of the tensors to complete the matrix multiplication. For example, `torch.matmul(q, k.transpose(1,2)) \u002F head_dim ** 0.5`. At this time, it is [32x17x128] x [32x128x17] = [32x17x17].\n4. The same principle applies to other matrix calculations.\n\nNote: The matrix shape changes in each step of the above process are simplified versions, only for illustration to facilitate understanding, which are different from the change process in the official Llama3 implementation (the official implementation involves a large number of shape change processes). \n\n### Calculate the result for each head\n\n\n```python\n# Calculate the multi-head attention results\n# That is, a loop of the previous single-head attention calculation process\nqkv_attention_store = []\n\nfor head in range(n_heads):\n    # Extract the QKV weight matrices corresponding to the current head\n    q_layer0_head = q_layer0[head]  # [32x128x4096] -> [128x4096]\n    k_layer0_head = k_layer0[head\u002F\u002F4]  # Every 4 heads share one key weight, [8x128x4096] -> [128x4096]\n    v_layer0_head = v_layer0[head\u002F\u002F4]  # Every 4 heads share one value weight, [8x128x4096] -> [128x4096]\n    \n    # Calculate XW to obtain the QKV vectors\n    # [17x4096] x [4096x128] = [17x128]\n    q_per_token = torch.matmul(token_embeddings, q_layer0_head.T)\n    k_per_token = torch.matmul(token_embeddings, k_layer0_head.T)\n    v_per_token = torch.matmul(token_embeddings, v_layer0_head.T)\n    \n    # Add position information to the query vector (RoPE)\n    q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)  # Divide vector into pairs along the dimensions direction to form dimension pairs. [17x128] -> [17x64x2]\n    q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # Convert to complex number representation, (x,y) -> (x+yi). [17x64x2] -> [17x64]\n    q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis[:len(tokens)]  # Calculate (x+yi)*(cosmθ+sinmθi) to complete the rotation operation. [17x64] * [17x64] = [17x64]\n    q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # Convert the result back to real number representation, (x+yi) -> (x,y). [17x64] -> [17x64x2]\n    q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # Convert the result back to the original vector shape to obtain the final query vector. [17x64x2] -> [17x128]\n\n    # Add position information to the key vector (RoPE)\n    k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # Divide vector into pairs along the dimensions direction to form dimension pairs. [17x128] -> [17x64x2]\n    k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # Convert to complex number representation, (x,y) -> (x+yi). [17x64x2] -> [17x64]\n    k_per_token_as_complex_numbers_rotated = k_per_token_as_complex_numbers * freqs_cis[:len(tokens)]  # Calculate (x+yi)*(cosmθ+sinmθi) to complete the rotation operation. [17x64] * [17x64] = [17x64]\n    k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers_rotated)  # Convert the result back to real number representation, (x+yi) -> (x,y). [17x64] -> [17x64x2]\n    k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # Convert the result back to the original vector shape to obtain the final key vector. [17x64x2] -> [17x128]\n\n    # Calculate the attention scores and normalize the scores simultaneously (i.e., Q×K\u002Fsqrt(dim))\n    qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(head_dim)**0.5  # [17x128] x [128x17] = [17x17]\n    \n    # Mask the scores of future tokens\n    mask = torch.full(qk_per_token.shape, float(\"-inf\"), device=tokens.device)  # Create a matrix with the same shape as the attention scores, filled with negative infinity, and stored in the same device as other vectors to prevent errors in subsequent calculations. [17x17]\n    mask = torch.triu(mask, diagonal=1)  # Keep the negative infinity in the upper-triangular part and set others to 0 (i.e., the upper-triangular area represents future tokens that need to be masked). The diagonal offset is 1 to avoid masking the token itself. [17x17]\n    qk_per_token_after_masking = qk_per_token + mask  # Add the attention scores with the masking matrix, making the upper-triangular part of the score matrix become negative infinity, which will tend to 0 after the subsequent softmax operation. [17x17]\n    \n    # Calculate the attention weights (i.e., softmax(score))\n    # Meanwhile, convert it back to half-precision (because it will be multiplied with the value vector v_per_token later, so the data types need to be the same).\n    qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # Calculate the softmax row-by-row. [17x17]\n    \n    # Calculate the final result of the attention mechanism (i.e., softmax(score) × V)\n    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17x17] × [17x128] = [17x128]\n    \n    # Record the result of this head\n    qkv_attention_store.append(qkv_attention)\n\nlen(qkv_attention_store)\n```\n\n\n\n\n    32\n\n\n\n### Merge the results of each head into a large matrix\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_ea75afe08f8d.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nNow we have the results of the attention mechanism for all 32 heads in the first layer. Next, we'll merge all the attention values into a large matrix with a shape of [17x4096].\n\u003Cbr>\nWe're almost done with the calculation of the attention layer :)\n\n\n```python\n# Merge the multi-head attention matrices\nstacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)  # Concatenate along the second dimension, 32x[17x128] -> [17x4096]\nstacked_qkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n### Head-to-head information interaction (linear mapping), the final step of the self-attention layer!\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_996c739db7aa.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nThe last step of the attention calculation for layer0 is to perform the final linear mapping, that is, multiply the combined attention matrix by the output weight matrix.\n\n\n```python\n# Load the output weight matrix of layers.0\nw_layer0 = model[\"layers.0.attention.wo.weight\"]  # [4096x4096]\nw_layer0.shape\n```\n\n\n\n\n    torch.Size([4096, 4096])\n\n\n\nThis is just a simple linear layer, so we only need matrix multiplication.\n\n\n```python\n# Perform the linear mapping of the attention matrix\n# This is the final output of the attention layer\nembedding_delta = torch.matmul(stacked_qkv_attention, w_layer0.T)  # [17x4096] x [4096x4096] = [17x4096]\nembedding_delta.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## Perform the residual operation (add)\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_0399128c7400.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nNow we have the value of the input vector after the attention mechanism is applied. At this time, we need to add the original input vector to it (i.e., the residual operation, to ensure that information is not easily lost and alleviate the problem of gradient vanishing).\n\n\n```python\n# Add the output of the attention layer to the original input to complete the residual operation\nembedding_after_edit = token_embeddings_unnormalized + embedding_delta  # [17x4096] + [17x4096] = [17x4096]\nembedding_after_edit.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## Perform the second normalization operation\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_872485ee5e22.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# Normalize the result of the residual operation\nembedding_after_edit_normalized = rms_norm(embedding_after_edit, model[\"layers.0.ffn_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\nembedding_after_edit_normalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## Perform the calculation of the FFN (Feed-Forward Neural Network) layer\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_73bafe1bed53.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\nIn Llama3, they used the SwiGLU feed-forward network. This network architecture can effectively increase nonlinear characteristics when the model needs.\n\u003Cbr>\nNowadays, this kind of feed-forward network architecture is very common in large language models.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>Why Introduce Nonlinear Layers:\u003C\u002Fh3>\n\n- The Nonlinearity is at the core of why neural network models can be considered \"universal function approximators\". In traditional neural network models, we use nonlinear activation functions (such as sigmoid, ReLU, etc.) to increase the model's expressive power, enabling it to fit the complex patterns hidden in the training data.\n- However, in the Transformer, the attention mechanism is essentially a linear weighted sum of the value vectors (even though the weights are obtained through nonlinear calculation of the softmax function, it's still just a linear weighting for the values). Therefore, although it can capture global dependencies, its output is still only a linear combination of the input. At this time, the Transformer model is actually lacks nonlinear capabilities.\n- So, it is necessary to add an FFN network after the self-attention layer to introduce nonlinear transformation capabilities to the model, thus improving the model's ability to model complex semantic relationships.\n\n\u003Ch3>Generally, introducing nonlinear layers can play the following roles:\u003C\u002Fh3>\n\n1. Add nonlinear capabilities to the model to facilitate the model's learning and training.\n2. Enhance the model's information abstraction ability, enabling the model to represent data features and patterns at different levels during the layer-by-layer learning process. For example, the lower-layer networks can identify basic language structures (such as part-of-speech), while the higher-layer networks can understand more complex semantic information (such as sentiment, intention).\n3. In addition, a current view holds that the attention layer is mainly used for input context interaction, while the FFN layer is where the LLMs mainly stores and remembers general knowledge during training (given to its nonlinear representation ability), so that it can find answers to input questions from general knowledge.\n\n\u003Ch3>SwiGLU Network Structure:\u003C\u002Fh3>\n\n1. Perform a linear transformation on the input: $X^\\prime = XW_3$\n2. Gating unit: $GATE = Activation\\\\_Function(XW_1)$, which is used to selectively pass information. That is, assuming that the information in $X^\\prime$ has different importance, so the information should be weighted and passed based on the score of the gating unit, thus improving the expressive ability of the model.\n3. The activation function used is a Swish activation function (hence the network is called SwiGLU, which is a combination of the Swish activation function and the Gated Linear Unit (GLU)). The formula is: $Swish = X \\cdot \\sigma(\\beta X)$, where $\\sigma$ is the sigmoid activation function. In SwiGLU, $\\beta$ is set to 1 (in the original formula, it is a learnable parameter).\n4. Therefore, the specific calculation of the gating unit is: $GATE = XW_1 \\cdot \\sigma(XW_1)$. In PyTorch, this activation function is called silu, that is $GATE = silu(XW_1)$.\n5. Application of the gating mechanism: $X^\\prime = X^\\prime \\cdot GATE$\n6. Perform a linear transformation again: $Y = X^\\prime W_2$\n\n\u003Ch3>Calculation of the Dimension Size of the Hidden Layer in the Feed-Forward Layer (Based on the Official Implementation Process of Llama3):\u003C\u002Fh3>\n\n1. Input dimension is dim = 4096\n2. hidden_dim = 4 * dim = 16384  # First, magnify it by four times. When initializing the feed-forward layer in the Transformer block, the input hidden_dim is multiplied by four.\n3. hidden_dim = int(2 * hidden_dim \u002F 3) = 10922 # Then, magnify it by 2\u002F3 times. Such scaling is first performed within the feed-forward layer.\n4. hidden_dim = int(ffn_dim_multiplier * hidden_dim) = int(1.3 * 10922) = 14198  # Then, magnify it by ffn_dim_multiplier times. The ffn_dim_multiplier is defined as 1.3 in the model configuration file.\n5. hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) \u002F\u002F multiple_of) = 1024 * ((14198 + 1024 - 1) \u002F\u002F 1024) = 14336  # Adjust it to an integer multiple of multiple_of. The multiple_of is defined as 1024 in the model configuration file to ensure that the dimensions of all hidden layers in the model are multiples of 1024, so as to improve the computational efficiency.\n6. Finally, we get the dimension size of the hidden layer is 14336.\n\n\n```python\n# Calculate the feed-forward network layer\n# The dimension size of the hidden layer is 14336\nw1 = model[\"layers.0.feed_forward.w1.weight\"]  # [14336x4096]\nw3 = model[\"layers.0.feed_forward.w3.weight\"]  # [14336x4096]\nw2 = model[\"layers.0.feed_forward.w2.weight\"]  # [4096x14336]\nprint(w1.shape, w3.shape, w2.shape)\n\n# output = (silu(XW1) * XW3)W2\n# [17x4096] x [4096x14336] x [14336x4096] = [17x4096]\noutput_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\noutput_after_feedforward.shape\n```\n\n    torch.Size([14336, 4096]) torch.Size([14336, 4096]) torch.Size([4096, 14336]) torch.Size([17, 4096])\n\n\n\n## Perform the residual operation again (Finally, we get the final output of the Transformer block!)\n\n\n```python\n# Add the output of the feed-forward layer to the original input to complete the residual operation\n# This is the final result of a Transformer block\nlayer_0_embedding = embedding_after_edit+output_after_feedforward  # [17x4096] + [17x4096] = [17x4096]\nlayer_0_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n\u003Ch3>\nFinally, we have the new embeddings of each token after passing through the first layer.\n\u003C\u002Fh3>\n\u003Cbr>\nThere are only 31 layers left to complete (just one for loop away).\n\u003Cbr>\nYou can imagine that this processed embedding contains all the information of the tokens proposed in the first layer.\n\u003Cbr>\nNow, each layer will encode more complex queries in the asked question. Until the end, we will get an embedding that knows all the information about the next token we need.\n\n# Everything is here. Let's complete the calculation of all 32 Transformer blocks. Happy reading :)\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_578c125085b4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\nYes, that's it. All the work we've done before will be presented here at once to complete the calculation of each layer.\n\u003Cbr>\n\n\n```python\n# Now, let's start to complete the calculation of all 32 Transformer blocks!\n\n# Use the embeddings of the input tokens as the initial input.\nfinal_embedding = token_embeddings_unnormalized  # [17x4096]\n\n# Perform layer-by-layer calculation for the 32-layer Transformer blocks\nfor layer in range(n_layers):\n    #########################################################################################################################\n    ################### Round 1: Normalization - Feature Transformation - Residual Operation ###############################\n    \n    ########################### The first normalization ###################################################\n    \n    # The first normalization\n    layer_embedding_norm = rms_norm(final_embedding, model[f\"layers.{layer}.attention_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\n    \n    ################ The first feature transformation - Multi-Head Self-Attention ########################\n    \n    # Obtain the qkv weight matrix of the attention mechanism for the current layer\n    q_layer = model[f\"layers.{layer}.attention.wq.weight\"]  # [4096x4096]\n    q_layer = q_layer.view(n_heads, q_layer.shape[0] \u002F\u002F n_heads, dim)  # [32x128x4096]\n    k_layer = model[f\"layers.{layer}.attention.wk.weight\"]  # [1024x4096]\n    k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] \u002F\u002F n_kv_heads, dim)  # [8x128x4096]\n    v_layer = model[f\"layers.{layer}.attention.wv.weight\"]  # [1024x4096]\n    v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] \u002F\u002F n_kv_heads, dim)  # [8x128x4096]\n    \n    # Used to store the calculation results of the attention mechanism for each head\n    qkv_attention_store = []\n    \n    # Calculate the attention mechanism results for each head\n    for head in range(n_heads):\n        # Extract the QKV weight matrices corresponding to the current head\n        q_layer_head = q_layer[head]  # [32x128x4096] -> [128x4096]\n        k_layer_head = k_layer[head\u002F\u002F4]  # Every 4 heads share one key weight, [8x128x4096] -> [128x4096]\n        v_layer_head = v_layer[head\u002F\u002F4]  # Every 4 heads share one value weight, [8x128x4096] -> [128x4096]\n        \n        # Calculate XW to obtain the QKV vectors\n        # [17x4096] x [4096x128] = [17x128]\n        q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)\n        k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)\n        v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)\n        \n        # Add position information to the query vector (RoPE)\n        q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)  # Divide vector into pairs along the dimensions direction to form dimension pairs. [17x128] -> [17x64x2]\n        q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # Convert to complex number representation, (x,y) -> (x+yi). [17x64x2] -> [17x64]\n        q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis  # Calculate (x+yi)*(cosmθ+sinmθi) to complete the rotation operation. [17x64] * [17x64] = [17x64]\n        q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # Convert the result back to real number representation, (x+yi) -> (x,y). [17x64] -> [17x64x2]\n        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # Convert the result back to the original vector shape to obtain the final query vector. [17x64x2] -> [17x128]\n        \n        # Add position information to the key vector (RoPE)\n        k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # Divide vector into pairs along the dimensions direction to form dimension pairs. [17x128] -> [17x64x2]\n        k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # Convert to complex number representation, (x,y) -> (x+yi). [17x64x2] -> [17x64]\n        k_per_token_as_complex_numbers_rotated = k_per_token_as_complex_numbers * freqs_cis  # Calculate (x+yi)*(cosmθ+sinmθi) to complete the rotation operation. [17x64] * [17x64] = [17x64]\n        k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers_rotated)  # Convert the result back to real number representation, (x+yi) -> (x,y). [17x64] -> [17x64x2]\n        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # Convert the result back to the original vector shape to obtain the final key vector. [17x64x2] -> [17x128]\n        \n        # Calculate the attention scores and normalize the scores simultaneously (i.e., Q×K\u002Fsqrt(dim))\n        qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(128)**0.5  # [17x128] x [128x17] = [17x17]\n        \n        # Mask the scores of future tokens\n        mask = torch.full(qk_per_token.shape, float(\"-inf\"), device=qk_per_token.device)  # Create a matrix with the same shape as the attention scores, filled with negative infinity, and stored in the same device as other vectors to prevent errors in subsequent calculations. [17x17]\n        mask = torch.triu(mask, diagonal=1)  # Keep the negative infinity in the upper-triangular part and set others to 0 (i.e., the upper-triangular area represents future tokens that need to be masked). The diagonal offset is 1 to avoid masking the token itself. [17x17]\n        qk_per_token_after_masking = qk_per_token + mask  # Add the attention scores with the masking matrix, making the upper-triangular part of the score matrix become negative infinity, which will tend to 0 after the subsequent softmax operation. [17x17]\n        \n        # Calculate the attention weights (i.e., softmax(score))\n        # Meanwhile, convert it back to half-precision (because it will be multiplied with the value vector v_per_token later, so the data types need to be the same).\n        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # Calculate the softmax row-by-row. [17x17]\n        \n        # Calculate the final result of the attention mechanism (i.e., softmax(score) × V)\n        qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17x17] x [17x128] = [17x128]\n        \n        # Record the result of this head\n        qkv_attention_store.append(qkv_attention)\n    \n    # Merge the multi-head attention results\n    stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)  # Merge the second dimension, that is, 32x[17x128] -> [17x4096]\n    \n    # Perform a linear mapping on the results to generate the final multi-head self-attention mechanism results\n    o_layer = model[f\"layers.{layer}.attention.wo.weight\"]\n    embedding_delta = torch.matmul(stacked_qkv_attention, o_layer.T)  # [17x4096] x [4096x4096] = [17x4096]\n\n    ########################### The first residual operation ##############################################\n    \n    # The first Residual Operation\n    # Add the output of the attention layer to the original input to complete the residual operation\n    embedding_after_edit = final_embedding + embedding_delta  # [17x4096] + [17x4096] = [17x4096]\n    \n    \n    #########################################################################################################################\n    #################### Round 2: Normalization - Feature Transformation - Residual Operation ##############################\n    \n    ########################### The second normalization ##################################################\n    \n    # The second normalization\n    embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f\"layers.{layer}.ffn_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\n    \n    ################## The second feature transformation - Feed-Forward Network ##########################\n    \n    # Load the parameter matrix of the feed-forward network (SwiGLU)\n    w1 = model[f\"layers.{layer}.feed_forward.w1.weight\"]  # [14336x4096]\n    w3 = model[f\"layers.{layer}.feed_forward.w3.weight\"]  # [14336x4096]\n    w2 = model[f\"layers.{layer}.feed_forward.w2.weight\"]  # [4096x14336]\n    \n    # Calculate the results of the feed-forward network (output = (silu(XW1) * XW3)W2)\n    # [17x4096] x [4096x14336] x [14336x4096] = [17x4096]\n    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\n    \n    ########################### The second residual operation ##############################################\n    \n    # The second residual operation, obtain the final output result of the current Transformer block\n    # Add the output of the feed-forward layer to the original input to complete the residual operation\n    final_embedding = embedding_after_edit+output_after_feedforward  # [17x4096] + [17x4096] = [17x4096]\n```\n\n# Let's complete the last step and predict the next token\n\nNow we have obtained the final embeddings, which contains all the information we needed to predict the next token.\n\u003Cbr>\nThe shape of this embedding is the same as that of the input token embedding, both being [17x4096], where 17 is the number of tokens and 4096 is the dimension of the embedding.\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_93e18e3d2e08.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n## First, perform one last normalization on the output of the last Transformer layer\n\n\n```python\n# Perform the last normalization in the entire model\nfinal_embedding = rms_norm(final_embedding, model[\"norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\nfinal_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## Then, make the prediction based on the embedding corresponding to the last token (perform a linear mapping to the vocabulary dimension)\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_69eadbc654a8.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr>\nWe will use the output decoder (a linear mapping layer) to convert the embedding vector of the last token into a prediction result for the next token (the dimension is the size of the vocabulary. If we apply a softmax function to the result, the value of each dimension represents the probability that the next token belongs to that word).\n\u003Cbr>\u003Cbr>\n\nWhy do we only use the output vector of the last token to predict the next token?\n\u003Cbr>\nBecause during training, the model's objective is to predict the next token based on the current token and all previous tokens. Therefore, the output vector corresponding to each token is used to predict the next token relative to itself, rather than the next token for the entire input.\n\u003Cbr>\u003Cbr>\n\nWe hope the answer is 42 in our example :)\n\u003Cbr>\nNote: 42 is the answer to \"the answer to the ultimate question of life, the universe, and everything is \" according to the book *The Hitchhiker's Guide to the Galaxy*. Most modern large language models will answer 42, which will verify the correctness of our entire code! Good luck to us :)\n\n\n```python\n# Perform the last linear mapping to map the embeddings to the size of the vocabulary dimension as a prediction for the next token\nlogits = torch.matmul(final_embedding[-1], model[\"output.weight\"].T)  # [17x4096] -> [4096] -> [4096] x [4096x128256] = [128256]\nlogits.shape\n```\n\n\n\n\n    torch.Size([128256])\n\n\n\n## Here's the prediction result!\n\n\n```python\n# Extract the id corresponding to the dimension with the highest probability,\n# is gonna be the predicted next token's id\nnext_token = torch.argmax(logits, dim=-1)  # Get the index corresponding to the maximum value, which is the predicted next token id. [128256] -> [1]\nnext_token\n```\n\n\n\n\n    tensor(2983)\n\n\n\n\n```python\n# Based on the predicted id, restore it to the specific predicted value\ntokenizer.decode([next_token.item()])\n```\n\n\n\n\n    '42'\n\n\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_b7e00f296458.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n# Let's dive deeper and see how different embeddings or token masking strategies might affect the prediction results :)\n\nNow we've got the final prediction results. If you're still interested, let's explore some of the issues that might have been mentioned before~\n\u003Cbr>\n\nWe'll briefly explore three scenarios:\n1. Apart from the top-1 result, what else is predicted in the current prediction, that is, the top-k results?\n2. What can be predicted if we use the output embedding of other tokens for prediction?\n3. If the future tokens were not masked during the attention calculation before, how would the prediction results differ?\n\n\n```python\n# Let's first take a look at the top-k prediction results\nlogits_sort, logits_idx = torch.sort(logits, dim=-1, descending=True)  # Put the token with the highest probability prediction at the front, [128256]\n[tokenizer.decode([i]) for i in logits_idx[:10]]  # View the top 10 high-probability results\n```\n\n\n\n\n    ['42', '6', '43', '41', '4', '1', '45', '3', '2', '46']\n\n\n\n\n```python\n# Next, let's to see what can we get by using the embeddings of other tokens for prediction\nlogits_all_token = torch.matmul(final_embedding, model[\"output.weight\"].T)  # Map the embeddings to the same size as the vocabulary, [17x4096] x [4096x128256] = [17x128256]\nlogits_all_token_sort, logits_all_token_idx = torch.sort(logits_all_token, dim=-1, descending=True)  # Put the token with the highest probability prediction at the front, [17x128256]\n\nprint('Input tokens:', prompt_split_as_tokens)  # Display the input tokens, [17]\n\n# Display the results of the next-token prediction based on the output embedding of each token\nfor i in range(len(final_embedding)):\n    print(f'Predict results based on {i+1}th token:', [tokenizer.decode([j]) for j in logits_all_token_idx[i][:10]])  # Output the top 10 high-probability results\n    \n_=\"\"\"\nIt can be seen that when making predictions based on each token, the prediction result is the possible result of the next token after the \"current token\",\nrather than the prediction result of the entire complete input.\nTherefore, in actual prediction, only the embedding of the last token will be used for prediction.\n\"\"\"\n```\n\n    Input tokens: ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n    Predict results based on 1th token: ['Question', 'def', '#', 'The', 'import', 'Tags', 'A', 'package', 'Home', 'I']\n    Predict results based on 2th token: [' ', ' best', ' first', ' most', ' new', ' world', ' last', ' same', ' way', ' number']\n    Predict results based on 3th token: [' to', ' is', ' was', ' of', ' lies', ',', ' for', ' you', ' key', ' will']\n    Predict results based on 4th token: [' the', ' this', ' your', ' all', ' that', ' a', ' my', ' life', ' \"', ' everything']\n    Predict results based on 5th token: [' question', ' problem', ' above', ' ultimate', ' first', ' r', ' following', ' questions', ' most', ' previous']\n    Predict results based on 6th token: [' question', ' questions', ' mystery', '\\xa0', ' quest', '\\n', ' life', ' philosophical', ' qu', ' problem']\n    Predict results based on 7th token: [' of', '\\n', ' to', ' is', '?\\n', ',', '.\\n', ':', '...\\n', ' about']\n    Predict results based on 8th token: [' life', ' Life', ' the', '\\xa0', ' everything', ' existence', '\\n', ' LIFE', ' all', ' human']\n    Predict results based on 9th token: [',', ' the', '\\n', ' and', ' is', ',\\n', '.\\n', '?\\n', '...\\n', '...']\n    Predict results based on 10th token: [' the', ' universe', ' and', ' etc', '\\xa0', ' is', ' death', ' of', ' or', ' everything']\n    Predict results based on 11th token: [' universe', ' Universe', '\\n', '\\xa0', ' un', ' univers', ' uni', ' cosmos', ' universal', ' u']\n    Predict results based on 12th token: [',', ' and', ' &', '\\n', ',\\n', ' ,', '...\\n', ',and', '...', '\\xa0']\n    Predict results based on 13th token: [' and', ' everything', ' &', ' the', ' etc', '\\xa0', ' is', ' or', ' ...\\n', ' an']\n    Predict results based on 14th token: [' everything', '\\xa0', ' the', ' every', '\\n', ' ever', ' all', ' Everything', ' EVERY', '...']\n    Predict results based on 15th token: ['\\n', ' is', '.\\n', '.', '?\\n', ',', ' (', '\\n\\n', '...\\n', ' in']\n    Predict results based on 16th token: [' ', '\\n', '...', '...\\n', ':', ' forty', ' not', ' \"', '…', ' a']\n    Predict results based on 17th token: ['42', '6', '43', '41', '4', '1', '45', '3', '2', '46']\n\n\n\n```python\n# Finally, let's take a look at what the prediction results will be if we don't mask future tokens when calculating attention\n# At this time, the prediction results based on each token will be as follows\n# It can be seen that due to the visibility of future tokens, the embeddings of each token will more accurately predict \"the next token for it\" (it's a bit like \"cheating\") \n\n_=\"\"\"\nInput tokens: ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\nPredict results based on 1th token: [':\u002F\u002F', '.Forms', '_REF', ' Angeles', '.swing', '�', 'php', 'во', 'ysics', '�']\nPredict results based on 2th token: [' answer', ' Hitch', ' universe', ' question', ' ultimate', ' meaning', ' hitch', ' Universe', ' Answer', ' reason']\nPredict results based on 3th token: [' to', ' is', ',', ':', ' was', '\\n', ' ', ' (', '\\n\\n', ' of']\nPredict results based on 4th token: [' the', ' life', ' this', ' which', ' everything', ' that', ' how', ' why', ' ', ' all']\nPredict results based on 5th token: [' ultimate', ' question', ' great', ' meaning', ' universe', ' Ultimate', ' everything', ' life', ' holy', ' greatest']\nPredict results based on 6th token: [' question', ' answer', ' is', ' was', '\\n', ' questions', ' mystery', '\\n\\n', ' what', ' Question']\nPredict results based on 7th token: [' of', ' is', '\\n', ',', ' about', ':', ' to', ' in', ' (', '\u003C|end_of_text|>']\nPredict results based on 8th token: [' life', ' existence', ' everything', ' Life', ' the', ' death', ' time', ' all', ' why', ' which']\nPredict results based on 9th token: [',', ' is', ' the', '\\n', ':', ' (', '...', ' and', ' ,', ' -']\nPredict results based on 10th token: [' the', ' and', ' is', ' death', ' The', ' which', ' or', '\\xa0', ' existence', ' don']\nPredict results based on 11th token: [' universe', ' answer', ' cosmos', ' world', ' existence', ' Universe', ' everything', ' un', ' meaning', ' question']\nPredict results based on 12th token: [',', ' and', ' is', ' &', '\\n', ' ,', '.', '...', ' (', ' ']\nPredict results based on 13th token: [' and', ' &', ' don', ' the', ' is', ' a', ' or', ' Douglas', '\\xa0', '\u003C|end_of_text|>']\nPredict results based on 14th token: [' everything', ' dough', ' don', ' ever', ' deep', ' Douglas', ' the', ' every', ' all', ' death']\nPredict results based on 15th token: ['\\n', ' is', ',', '.', ' ', ' (', ':', '\u003C|end_of_text|>', '\\n\\n', '.\\n']\nPredict results based on 16th token: [' ', '\\n', ' forty', '...', ' \"', '42', ' the', ':', '\\xa0', ' to']\nPredict results based on 17th token: ['42', '6', '4', '41', '1', '2', '3', '7', '5', '43']\n\"\"\"\n```\n\n# Need to predict multiple tokens? Just using KV-Cache! (It really took me a lot of effort to sort this out. Orz)\n\n\n\u003Ch3>How to Continuously Predict Multiple Tokens\u003C\u002Fh3>\n\nNow, we've completed the prediction of the next word for the input text. But what if our expected output requires multiple tokens?\n\u003Cbr>\nFor example, in practical llm applications, models usually don't output just one word. Instead, they often output a passage of text, or even a very long text. How is this ability achieved?\n\u003Cbr>\nActually, it's quite simple. We just need to repeatedly call the llm's prediction process to gradually generate a complete sentence or paragraph.\n\u003Cbr>\nThis process is like \"snowballing\". Each time we predict a word, we add this word to the current input sequence, and then call the model again for a new round of prediction. The prediction stops when we encounter a stop symbol (a special token \"\u003C|end_of_text|>\" in llama3) or reach the maximum length limit (a hyperparameter max_seq_len).\n\u003Cbr>\u003Cbr>\nDoes this sound inefficient? Yes!\n\u003Cbr>\nThat's why there are well-known caching mechanisms like KV-Cache. By caching the KV vectors of historical tokens, we can reduce the input and computational load, thus improving the inference efficiency.\n\u003Cbr>\nThanks to the caching mechanism, when we use a large model for inference, you may notice that waiting for the first token to be output is often the most time-consuming. But once the first token is output, the output speed of subsequent tokens will increase significantly.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>Advantages and Disadvantages of KV-Cache\u003C\u002Fh3>\n\n**Advantage**: When continuously predicting, we only need to input the new token each time instead of the entire text sequence. This greatly improves the calculation speed during inference.\n\u003Cbr>\n**Disadvantage**: Due to the caching mechanism, it will consume more memory resources during inference.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>Principle Derivation of KV-Cache\u003C\u002Fh3>\n\nKV-Cache comes from the observation and analysis of the above matrix calculation process. By analyzing the calculation process of each input token, we can find that in most calculation steps, the calculation of each token is actually relatively independent and rately involves interaction with other tokens. Only when calculating the attention mechanism will token-to-token interactions be involved, thus requiring the caching of historical KV vectors.\n\u003Cbr>\n\n\u003Ch3>Here is the specific derivation logic of KV-Cache:\u003C\u002Fh3>\n\n1. **Premise**: To predict the next token, we only need to get the output result of the last token (just as we did in the prediction chapter).\n2. **Non-attention parts only needs to calculate the new tokens**: Except for the attention calculation, the calculations of all other parts are independent among tokens. So we only need to calculate the new tokens and don't need to input historical tokens (I'll expand the analysis below).\n3. **Attention parts also only needs to calculate the new tokens**: In the attention layer, due to the masking mechanism, the output results of historical tokens won't be affected by future new tokens. So their inputs and outputs at each layer are fixed, that is, the QKV vectors of historical tokens will not change because of the addition of new tokens. Thus, we only need to calculate the attention of the new tokens.\n4. **Calculate the new token's attention mechanism**: The attention layer is used to let the token obtain the context information of historical tokens. So, for each new token, we need to calculate the weighted sum using the value vectors of all tokens. Therefore, we need to store the values of historical tokens.\n5. **Calculate the new token's attention weights**: As known from point 4, we also need to obtain the importance information, i.e., weights, between the new tokens and historical tokens first. So we need to calculate the product of the key vectors of the new tokens with the key vectors of all tokens. Therefore, we need to store the keys of historical tokens.\n6. **Acquisition of KV-Cache**: As known from points 4 and 5, we need to store the KV vectors of historical tokens. Since the query vectors are not used, we don't need to store them. This is how the kv-cache came about.\n7. **Efficiency of KV-Cache**: As known from point 3, the historical KV vectors won't change. So they can be incrementally updated during the continuous prediction process without modifying the historical content. In this way, each time we predict, we only need to input and calculate the result of the newly added tokens instead of taking the complete sequence as input, thus greatly improving the inference efficiency.\n\n\u003Ch3>Additional: Analysis of the Independence of Token Calculation in KV-Cache\u003C\u002Fh3>\n\n**All components except the attention layer (no interaction among them)**:\n1. **Two times normalizations**: Each token vector is normalized in its own feature dimension without using other tokens.\n2. **Two times residual connections (add)**: Each token vector adds its own output result to itself without using other tokens.\n3. **Feed-forward network (FFN)**: Each token vector is multiplied by the same weight matrices W1, W2, W3 to get the result, and other tokens are not used during this process. Imagine that if the number of input tokens is 17, the calculation of FFN can be simplified as: [17x4096] x [4096x14336] x [14336x4096] = [17x4096]. This is actually equivalent to inputting one token at a time and then concatenating the 17 results into a matrix, that is: 17 times ([1x4096] x [4096x14336] x [14336x4096] = [1x4096]) = 17x[1x4096] => [17x4096]. Therefore, when each token is calculated in the feed-forward layer, there is actually no interaction with other tokens.\n\n**Attention layer (only have one-way interaction between new tokens and historical tokens)**:\n1. **Calculate QKV vectors**: Each token vector is multiplied by the same QKV weight matrices to get the result without using other tokens.\n2. **Add positional information to QK vectors**: Each token vector performs an independent rotation operation based on its own position without using the specific content of other tokens.\n3. **Calculate attention weights**: The attention weights represent the correlation between each token and every historical tokens preceding it, and are independent of future tokens. Therefore, the results of historical tokens are independent of new tokens. And new tokens need the key vector cache of historical tokens.\n4. **Calculate the result of the attention mechanism**: The attention mechanism calculates the weighted sum of value vectors based on attention weights. So, similar to the conclusion in the previous point, the results of historical tokens are also independent of new tokens. And new tokens need the value vector cache of historical tokens.\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>Attention Calculation Process Based on KV-Cache\u003C\u002Fh3>\n\nTo clearly show the calculation process, we only derive the single-head scenario (the principle and process of extending it to the multi-head scenario are exactly the same as the previous multi-head attention implementation):\n1. Assume that the historical input tokens are $S_1$ with a length of N. Based on KV-Cache, we will store the KV result matrix of each head. The shape of a single head is [Nxhead_dim] = [Nx128].\n2. Assume that the newly added input tokens are $S_2$ with a length of M (it can be newly predicted tokens or the input of a new round of user dialogue or any other scenarios).\n3. Calculate the QKV vectors of the new tokens: $Q,K,V = S_2W_{Q,K,V}$ => [Mx4096] x [4096x128] = [Mx128].\n4. Add positional information to the QK vectors: The positions of new tokens should start from N + 1, not from 0. [Mx128] -> [Mx128].\n5. Add the new KV values to the KV cache to get the updated KV matrix, that is, [Nx128] -> [(N + M)x128].\n6. Calculate the attention weights of the new tokens: Attention_weight = softmax(QK\u002Fsqrt(d) + mask) => [Mx128] x [128x(N + M)] = [Mx(N + M)].\n7. Calculate the final result of the attention mechanism for the new tokens: Attention_weight x V => [Mx(N + M)] x [(N + M)x128] = [Mx128].\n8. Concatenate the results of each head and perform a linear mapping to get the final output of the attention layer, with a shape of 32x[Mx128] -> [Mx4096].\n\u003Cbr>\u003Cbr>\n\nSince our previous learning process has been quite comprehensive, we won't implement the code for the optimization scheme here (if you're interested, you can refer to the official code of Llama 3, which is relatively easy to implement). Just like the parallel calculation of multi-head attention mentioned before, knowing that the calculation process can be optimized is enough~\n\n# Thank you all. Thanks for your continuous learning. Love you all :)\n\nOur learning has come to an end. I hope you have also enjoyed this reading process!\n\n\u003Ca id=\"from_me\">\u003C\u002Fa>\n\n## From Me\nIf you've come across this work, thank you for your trust and for learning all the way to this point. I'm glad to be of help to you~\n\u003Cbr>\n\nIf you'd like to support my work\n1. give it a star⭐~ :) \n2. buy me a coffee~ [https:\u002F\u002Fko-fi.com\u002Ftherealoliver](https:\u002F\u002Fko-fi.com\u002Ftherealoliver)\n\n\u003Cbr>\n\n## From the author of predecessor project\n\nIf you want to support my work\n\n1. follow me on twitter https:\u002F\u002Ftwitter.com\u002Fnaklecha \n2. or, buy me a coffee [https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha](https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha)\n\nHonestly, if you made it this far you already made my day :)\n\nwhat motivates me?\n\nMy friends and I are on a mission - to make research more accessible!\nWe created a research lab called A10 - [AAAAAAAAAA.org](http:\u002F\u002Faaaaaaaaaa.org\u002F)\n\nA10 twitter - https:\u002F\u002Ftwitter.com\u002Faaaaaaaaaaorg\n\nour thesis:\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_3b5392087d3b.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\nThanks again to the original author for the base code and illustrations, which also taught me a lot\n\u003Cbr>\u003Cbr>\n\n\n# LICENSE\n\nCopyright (c) 2025 Jinlong Zhang (https:\u002F\u002Fgithub.com\u002Ftherealoliver)\n\nCopyright (c) 2024 Nishant Aklecha\n\nMIT\n","\u003Cp align=\"center\">\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c0e3fb6c4846.png\" width=\"600px\"\u002F>\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">从零开始深度解析 Llama3\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\" alt=\"许可证\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fstargazers\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\" alt=\"GitHub 星标\">\u003C\u002Fa>\n    \u003Ca href=\"#from_me\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F☕%20请我喝杯咖啡-ff69b4\" alt=\"请我喝杯咖啡\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Ch3 align=\"center\">\n    \u003Cp>\n        \u003Cb>[ 查看英文 | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch\u002Fblob\u002Fmain\u002FREADME_zh.md\">中文版文档点这里\u003C\u002Fa> ]\u003C\u002Fb>\n    \u003C\u002Fp>\n\u003C\u002Fh3>\n\n---\n\n本项目是在 [naklecha\u002Fllama3-from-scratch](https:\u002F\u002Fgithub.com\u002Fnaklecha\u002Fllama3-from-scratch) 的基础上进行的增强版本。在原项目的基础上，我们对其进行了全面的改进和优化，旨在帮助大家更轻松地理解并掌握 Llama3 模型的实现原理及其详细的推导过程。感谢原作者的贡献 :)\n\u003Cbr>\u003Cbr>\n\u003Ch3>\n以下是本项目的几项核心改进：\n\u003C\u002Fh3>\n\n\n1. **结构优化**  \n   重新梳理了内容的呈现顺序，并调整了目录结构，使学习流程更加清晰合理，便于大家逐步理解代码。\n\n2. **代码注释**  \n   添加了大量的详细注释，手把手教你理解每一行代码的作用，即使是初学者也能轻松上手。\n\n3. **维度追踪**  \n   对每一步计算中矩阵维度的变化都进行了完整标注，让你能够更直观地把握整个计算流程。\n\n4. **原理讲解**  \n   增加了丰富的原理性说明和大量细致的推导过程，不仅告诉你“怎么做”，还深入解释“为什么这么做”，帮助你从根本上掌握模型的设计思想。\n\n5. **KV 缓存详解**  \n   新增了 KV 缓存的推导章节，涵盖了其核心概念、原理推导以及在注意力机制中的应用过程，让你能够从根源上理解 KV 缓存的每一个细节与设计哲学。\n\n6. **双语文档**  \n   提供了中英文双语版本的代码文件，其中中文翻译为本地化编写，避免了机器翻译可能导致的表达不准确问题。\n\u003Cbr>\u003Cbr>   \n\n---\n\n\u003Ch2 align=\"center\">目录\u003C\u002Fh2>\n\n- [加载模型](#loading-the-model)\n  - [加载分词器](#loading-the-tokenizer)\n  - [读取模型文件和配置文件](#reading-model-files-and-configuration-files)\n    - [通过配置文件推断模型细节](#inferring-model-details-using-the-configuration-file)\n- [将输入文本转换为嵌入向量](#convert-the-input-text-into-embeddings)\n  - [将文本转换为标记ID序列](#convert-the-text-into-a-sequence-of-token-ids)\n  - [将标记ID序列转换为嵌入向量](#convert-the-sequence-of-token-ids-into-embeddings)\n- [构建第一个Transformer块](#build-the-first-transformer-block)\n  - [归一化](#normalization)\n    - [对嵌入向量使用RMS归一化](#using-rms-normalization-for-embeddings)\n  - [从头实现单头注意力机制](#implementing-the-single-head-attention-mechanism-from-scratch)\n    - [获取输入标记对应的QKV向量](#obtain-the-qkv-vectors-corresponding-to-the-input-tokens)\n      - [获取查询向量](#obtain-the-query-vector)\n        - [展开查询权重矩阵](#unfold-the-query-weight-matrix)\n        - [获取第一个头](#obtain-the-first-head)\n        - [将标记嵌入与查询权重相乘，得到各标记对应的查询向量](#multiply-the-token-embeddings-by-the-query-weights-to-obtain-the-query-vectors-corresponding-to-the-tokens)\n      - [获取键向量（几乎与查询向量相同）](#obtain-the-key-vector-almost-the-same-as-the-query-vector)\n      - [获取值向量（几乎与键向量相同）](#obtain-the-value-vector-almost-the-same-as-the-key-vector)\n    - [为查询和键向量添加位置信息](#add-positional-information-to-the-query-and-key-vectors)\n      - [旋转位置编码（RoPE）](#rotary-position-encoding-rope)\n      - [为查询向量添加位置信息](#add-positional-information-to-the-query-vectors)\n      - [为键向量添加位置信息（与查询向量相同）](#add-positional-information-to-the-key-vectors-same-as-the-query)\n    - [一切准备就绪，开始计算标记之间的注意力权重吧。](#everythings-ready-lets-start-calculating-the-attention-weights-between-tokens)\n      - [将查询和键向量相乘，得到注意力分数。](#multiply-the-query-and-key-vectors-to-obtain-the-attention-scores)\n      - [现在必须屏蔽未来的查询-键分数。](#now-we-must-mask-the-future-query-key-scores)\n      - [计算最终的注意力权重，即对分数进行Softmax操作。](#calculate-the-final-attention-weights-that-is-softmaxscore)\n    - [终于！计算单头注意力机制的最终结果！](#finally-calculate-the-final-result-of-the-single-head-attention-mechanism)\n  - [计算多头注意力机制（简单循环重复上述过程）](#calculate-the-multi-head-attention-mechanism-a-simple-loop-to-repeat-the-above-process)\n    - [计算每个头的结果](#calculate-the-result-for-each-head)\n    - [将各头的结果合并成一个大矩阵](#merge-the-results-of-each-head-into-a-large-matrix)\n    - [头与头之间的信息交互（线性映射），自注意力层的最后一道工序！](#head-to-head-information-interaction-linear-mapping-the-final-step-of-the-self-attention-layer)\n  - [执行残差运算（加法）](#perform-the-residual-operation-add)\n  - [执行第二次归一化操作](#perform-the-second-normalization-operation)\n  - [执行FFN（前馈神经网络）层的计算](#perform-the-calculation-of-the-ffn-feed-forward-neural-network-layer)\n  - [再次执行残差运算（终于得到了Transformer块的最终输出！）](#perform-the-residual-operation-again-finally-we-get-the-final-output-of-the-transformer-block)\n- [一切都准备好了，让我们完成所有32个Transformer块的计算吧。祝阅读愉快：)](#everything-is-here-lets-complete-the-calculation-of-all-32-transformer-blocks-happy-reading-)\n- [让我们完成最后一步，预测下一个标记](#lets-complete-the-last-step-and-predict-the-next-token)\n  - [首先，对最后一个Transformer层的输出进行最后一次归一化](#first-perform-one-last-normalization-on-the-output-of-the-last-transformer-layer)\n  - [然后，基于最后一个标记对应的嵌入向量进行预测（将其线性映射到词汇表维度）](#then-make-the-prediction-based-on-the-embedding-corresponding-to-the-last-token-perform-a-linear-mapping-to-the-vocabulary-dimension)\n  - [这就是预测结果！](#heres-the-prediction-result)\n- [让我们深入探讨一下，不同的嵌入方式或标记掩码策略可能会如何影响预测结果：)](#lets-dive-deeper-and-see-how-different-embeddings-or-token-masking-strategies-might-affect-the-prediction-results-)\n- [需要预测多个标记吗？只需使用KV缓存即可！（这真的让我费了很大的劲才弄清楚。Orz）](#need-to-predict-multiple-tokens-just-using-kv-cache-it-really-took-me-a-lot-of-effort-to-sort-this-out-orz)\n- [感谢大家。感谢你们持续的学习。爱你们：)](#thank-you-all-thanks-for-your-continuous-learning-love-you-all-)\n  - [来自我](#from-me)\n  - [来自前代项目的作者](#from-the-author-of-predecessor-project)\n- [LICENSE](#license)\n\n---\n\n\u003Ch3>\n现在，让我们正式开始学习吧！\n\u003C\u002Fh3>\n\u003Cbr>\n\n在这个文件中，我从零开始逐次实现Llama3，每次只进行一次张量和矩阵的乘法。\n\u003Cbr>\n此外，我将直接从Meta为Llama3提供的模型文件中加载张量（Meta-Llama-3-8B），在运行此文件之前，你需要先下载权重。以下是下载权重的官方链接：https:\u002F\u002Fllama.meta.com\u002Fllama-downloads\u002F\n\u003Cbr>\u003Cbr>\n注1：本项目采用了基于Huggingface的模型文件下载方法。你将在下面的加载模型部分看到相关内容。同样，你也可以直接从官方网站、ModelScope或其他模型下载平台下载模型，而无需运行下方的模型下载代码。\n\u003Cbr>\n注2：本项目使用的是原始模型文件，即下载的模型文件中“original”文件夹内的模型。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\n    请注意！图中有一个小错误：\u003Cbr>\n    \u003Ch4>\n        在每个Transformer块中，第二个“add”操作的输入应该是前馈层的输出和第一个“add”操作的输出，而不是归一化后的结果。\n        \u003Cbr>\n        如果我们将多头自注意力和前馈层视为同一类型的操作（都用于特征变换），那么这两个“归一化 - 特征变换 - 残差连接（add）”的形式和流程是完全相同的。\n    \u003C\u002Fh4>\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_a048f83904eb.png\"\u002F>\n\u003C\u002Fdiv>\n\n\n\n# 加载模型\n## 加载分词器\n\n分词器用于将输入的文本字符串分割成一系列子词，从而更容易输入到模型中。\n\u003Cbr>\n我不会自己实现一个BPE分词器（不过Andrej Karpathy有一个非常简洁的实现），\n他的实现链接：https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fminbpe\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c52ffe6786ea.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>BPE分词器加载步骤总结：\u003C\u002Fh3>\n\n1. 加载常规词汇：加载本地分词器模型字典（仅包含常规子词，不包含特殊标记）。\n2. 定义特殊标记：手动定义特殊标记（可以使用现成的，也可以在现成的基础上进行修改）。\n3. 定义文本粗切规则：定义用于文本粗切的正则表达式（直接使用现成的）。输入文本将经过两步处理：先根据正则表达式进行粗切，再根据BPE算法进行细切，最终得到完整的分词结果。\n4. 创建分词器：基于OpenAI开源的tiktoken库创建文本编码解码对象（该库可以根据BPE算法对粗切结果进一步细分）。\n\n\n```python\n# 加载基于BPE的分词器\n\n# 导入相关库\nfrom pathlib import Path  # 用于从文件路径中获取文件名\u002F模型名\nimport tiktoken  # OpenAI开发的开源文本编码解码库（文本与标记ID之间的相互转换）\nfrom tiktoken.load import load_tiktoken_bpe  # 加载BPE模型\nimport torch  # 用于构建模型和矩阵计算\nimport json  # 用于加载配置文件\nimport matplotlib.pyplot as plt  # 用于绘制图表\n\n\ntokenizer_path = \"Meta-Llama-3-8B\u002Foriginal\u002Ftokenizer.model\"  # 分词器模型路径\n\n# 常规字典之外的特殊标记。\n# 这些特殊标记存在于“Meta-Llama-3-8B\u002F”路径下的‘tokenizer.json’和‘tokenizer_config.json’文件的‘added_tokens’字段中\nspecial_tokens = [\n            \"\u003C|begin_of_text|>\",\n            \"\u003C|end_of_text|>\",\n            \"\u003C|reserved_special_token_0|>\",  # 保留的特殊标记，编号从0到250\n            \"\u003C|reserved_special_token_1|>\",\n            \"\u003C|reserved_special_token_2|>\",\n            \"\u003C|reserved_special_token_3|>\",\n            \"\u003C|start_header_id|>\",  # 标头信息开始，用于标记包裹结构化数据的标头信息，如元数据\n            \"\u003C|end_header_id|>\",  # 标头信息结束\n            \"\u003C|reserved_special_token_4|>\",\n            \"\u003C|eot_id|>\",  # 轮次结束，用于标记多轮对话中当前轮次的结束\n        ] + [f\"\u003C|reserved_special_token_{i}|>\" for i in range(5, 256 - 5)]\n\n\n# 加载BPE模型（实际上是一个字典）\n# 字典由子词（字节类型，用utf-8解码）和排名（id）组成，共有128000个词，不包括上述256个特殊标记，\n# 因此模型字典的总大小在后续操作中将达到128256个条目（但此处未计算）。\n# 排名值是从0开始的递增序列，用于确定子词单元合并的优先级顺序，\n# 优先级越高，合并越早。因此这里的变量名为“mergeable_ranks”，而不是类似BPE或词汇表之类的名称。\n# 特殊标记未被加入字典，可能是为了灵活性考虑，\n# 这样在面对不同模型架构或需要不同特殊标记的任务时，可以方便地添加特定标记，同时保持字典大小不变。\nmergeable_ranks = load_tiktoken_bpe(tokenizer_path)\n\n\n# 创建文本编码解码对象\n# pat_str大致可分为三类：带有缩写的单词及普通单词、中文片段、1-3位数字及其他特殊字符\ntokenizer = tiktoken.Encoding(\n    name=Path(tokenizer_path).name,  # 编码器名称，在调试和日志记录时使用不同编码器会更方便\n    pat_str=r\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\",  # 用于初步将文本粗略分割为标记序列的正则表达式\n    mergeable_ranks=mergeable_ranks,  # 传入已加载的BPE模型\n    special_tokens={token: len(mergeable_ranks) + i for i, token in enumerate(special_tokens)},  # 用于添加特殊标记-ID对的字典\n)\n\n\n# 测试创建是否成功，即编码解码器是否能正常运行\nprint(tokenizer.decode(tokenizer.encode(\"create tokenizer successed!\")))\n\n\n# 下面是一个案例测试，用于测试pat_str的粗切与分词器的细切之间的效果和差异。\n# pat_str的正则表达式只提供初步的分割，\n# 一些长句子或中文文本可能无法被分割，这些部分会在分词器中根据BPE算法进一步细化。\nimport regex  # 由于pat_str中使用了\\p{L}等Unicode语法，因此不能使用re库\n\n## 创建正则表达式\npat_str=r\"(?i:'s|'t|'re|'ve|'m|'ll|'d)|[^\\r\\n\\p{L}\\p{N}]?\\p{L}+|\\p{N}{1,3}| ?[^\\s\\p{L}\\p{N}]+[\\r\\n]*|\\s*[\\r\\n]+|\\s+(?!\\S)|\\s+\"\npattern = regex.compile(pat_str)\n\n## 文本分割\ntext = \"Hello world! It's a test. 这是一个测试. alongwords. a long words. 123 456 789.\"  # 测试字符串\nre_tokens = pattern.findall(text)  # 使用正则表达式分割字符串\nmerge_tokens_id = tokenizer.encode(text)  # 使用分词器分割字符串\nmerge_tokens = [tokenizer.decode([i]) for i in merge_tokens_id]  # 将分词器分割结果的ID序列转换为实际的子词序列\n\n## 输出结果\nprint(\"原始字符串: \", text)\nprint(\"正则表达式分割结果: \", re_tokens)\nprint(\"分词器分割结果: \", merge_tokens)\nprint(\"分词器分割结果ID: \", list(zip(merge_tokens, merge_tokens_id)))\n\n## 从结果可以看出，所有单词前面的空格都被保留了下来，而不是被合并为单个空格标记或直接删除。\n\n## 这有利于模型正确理解单词之间的边界信息，例如示例中的‘alongwords’。\n```\n\n    创建分词器成功！\n    原始字符串： 你好世界！这是一个测试。这是一个测试. alongwords. a long words. 123 456 789.\n    正则表达式分割结果：  ['你好', ' 世界', '!', ' 这', \"'s\", ' a', ' test', '.', ' 这是一个测试', '.', ' alongwords', '.', ' a', ' long', ' words', '.', ' ', '123', ' ', '456', ' ', '789', '.']\n    分词器分割结果：  ['你好', ' 世界', '!', ' 这', \"'s\", ' a', ' test', '.', ' 这', '是一个', '测试', '.', ' along', 'words', '.', ' a', ' long', ' words', '.', ' ', '123', ' ', '456', ' ', '789', '.']\n    分词器分割结果对应的ID：  [('你好', 9906), (' 世界', 1917), ('!', 0), (' 这', 1102), (\"'s\", 596), (' a', 264), (' test', 1296), ('.', 13), (' 这', 122255), ('是一个', 122503), ('测试', 82805), ('.', 13), (' along', 3235), ('words', 5880), ('.', 13), (' a', 264), (' long', 1317), (' words', 4339), ('.', 13), (' ', 220), ('123', 4513), (' ', 220), ('456', 10961), (' ', 220), ('789', 16474), ('.', 13)]\n\n\n## 读取模型文件和配置文件\n\n通常，读取模型文件取决于其模型类的编写方式以及其中的变量名。\n\u003Cbr>\n然而，由于我们是从零开始实现Llama3，我们将逐个读取张量文件。\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_76842a6c1f45.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 加载模型，即一个字典，如{\"网络层名称\": 张量类型参数}\nmodel = torch.load(\"Meta-Llama-3-8B\u002Foriginal\u002Fconsolidated.00.pth\")\n\n# 打印前20个网络层的名字，以验证模型是否正确加载。\nprint(json.dumps(list(model.keys())[:20], indent=4))\n```\n\n    [\n        \"tok_embeddings.weight\",\n        \"layers.0.attention.wq.weight\",\n        \"layers.0.attention.wk.weight\",\n        \"layers.0.attention.wv.weight\",\n        \"layers.0.attention.wo.weight\",\n        \"layers.0.feed_forward.w1.weight\",\n        \"layers.0.feed_forward.w3.weight\",\n        \"layers.0.feed_forward.w2.weight\",\n        \"layers.0.attention_norm.weight\",\n        \"layers.0.ffn_norm.weight\",\n        \"layers.1.attention.wq.weight\",\n        \"layers.1.attention.wk.weight\",\n        \"layers.1.attention.wv.weight\",\n        \"layers.1.attention.wo.weight\",\n        \"layers.1.feed_forward.w1.weight\",\n        \"layers.1.feed_forward.w3.weight\",\n        \"layers.1.feed_forward.w2.weight\",\n        \"layers.1.attention_norm.weight\",\n        \"layers.1.ffn_norm.weight\",\n        \"layers.2.attention.wq.weight\"\n    ]\n\n\n\n```python\n# 加载配置文件。\n# 每个配置的具体含义将在下一节中说明。\nwith open(\"Meta-Llama-3-8B\u002Foriginal\u002Fparams.json\", \"r\") as f:\n    config = json.load(f)\nconfig\n```\n\n\n\n\n    {'dim': 4096,\n     'n_layers': 32,\n     'n_heads': 32,\n     'n_kv_heads': 8,\n     'vocab_size': 128256,\n     'multiple_of': 1024,\n     'ffn_dim_multiplier': 1.3,\n     'norm_eps': 1e-05,\n     'rope_theta': 500000.0}\n\n\n\n### 利用配置文件推断模型细节\n\n| 配置项 | 配置值 | 含义 |\n| ---- | ---- | ---- |\n| dim | 4096 | 隐藏层的维度，即每个标记的向量表示具有4096维。 |\n| n_layers | 32 | 模型层数，即该模型包含32个Transformer层或Transformer块。 |\n| n_heads | 32 | 多头注意力机制中的头数，即每个多头注意力块有32个头。所谓多头，是指同时使用多个独立的注意力机制来捕捉输入数据的不同特征或信息。 |\n| n_kv_heads | 8 | 键值注意力中的头数，用于分组查询注意力（GQA）。也就是说，键值注意力有8个头，而查询有n_heads=32个头。每4个查询头会共享一组键值对。 |\n| vocab_size | 128256 | 词汇表大小，包括128000个普通标记和256个特殊标记。 |\n| multiple_of | 1024 | 隐藏层维度的倍数约束。也就是说，为了优化计算效率，模型的隐藏层维度应为1024的倍数。 |\n| ffn_dim_multiplier | 1.3 | 前馈网络层隐藏层维度的乘数因子，用于计算FFN的隐藏层维度。具体计算过程见相应部分。 |\n| norm_eps | 1e-05 | 层归一化计算中分母上添加的常数，用于防止除以零并确保数值稳定性。 |\n| rope_theta | 500000.0 | 旋转位置编码（RoPE）中的基本频率缩放因子，控制位置编码的周期性和分辨率，从而影响模型捕捉不同长度序列及位置关系的能力。 |\n\n\u003Cbr>\n\n\u003Ch3>\n根据配置详情，可以推断出给定输入时注意力的内部计算过程如下：\n\u003C\u002Fh3>\n\n\u003Cpre>\n输入(L, 4096) -> query_proj(L, 128, 32)\n               -> key_proj(L, 128, 8)\n               -> value_proj(L, 128, 8)\n                                           -> group_query_attention(L, 128, 32)\n                                           -> output_proj(L, 4096)\n                                                                                   -> 输出(L, 4096)\n\u003C\u002Fpre>\n\n\n```python\n# 记录这些配置，后续将逐步使用。\ndim = config[\"dim\"]\nn_layers = config[\"n_layers\"]\nn_heads = config[\"n_heads\"]\nn_kv_heads = config[\"n_kv_heads\"]\nvocab_size = config[\"vocab_size\"]\nmultiple_of = config[\"multiple_of\"]\nffn_dim_multiplier = config[\"ffn_dim_multiplier\"]\nnorm_eps = config[\"norm_eps\"]\nrope_theta = torch.tensor(config[\"rope_theta\"])\n```\n\n# 将输入文本转换为嵌入\n\n在将字符串形式的文本输入到网络层之前，需要将其转换为向量形式以便进行数学计算。\n\u003Cbr>\n所需步骤是：使用分词器将输入文本拆分为子词序列 -> 将子词转换为向量表示。\n\n## 将文本转换为标记ID序列\n这里我们使用tiktoken（OpenAI提供的库）作为分词器。\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c4ace96c20b0.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 将输入提示转换为标记ID序列\nprompt = \"生命、宇宙以及一切问题的终极答案是 \"  # 输入文本\ntokens = [128000] + tokenizer.encode(prompt)  # 进行子词分割，并在文本开头添加一个表示文本开始的特殊标记\u003C|begin_of_text|>。维度：[17]\nprint(tokens)  # 检查分割结果\ntokens = torch.tensor(tokens)  # 转换为张量类型，以便后续进行矩阵计算。[17]\n\n# 将标记 ID 转换为特定的标记子词序列，这仅用于显示目的，并非实际所需\nprompt_split_as_tokens = [tokenizer.decode([token]) for token in tokens]\nprint(prompt_split_as_tokens)\n```\n\n    [128000, 1820, 4320, 311, 279, 17139, 3488, 315, 2324, 11, 279, 15861, 11, 323, 4395, 374, 220]\n    ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n\n\n## 将标记 ID 序列转换为嵌入\n\n抱歉，这是该代码库中唯一使用内置神经网络模块的部分。\n\u003Cbr>\n简而言之，我们原来的 [17×1] 标记序列现在变成了 [17×4096]，即 17 个长度为 4096 的嵌入（每个标记对应一个）。\n\u003Cbr>\u003Cbr>\n注意：请留意这个张量形状的变化，这将有助于你更好地理解整个过程（我也会在所有步骤中标注形状的变化）。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_533946884ac0.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 创建一个嵌入层，用于将离散的标记 ID 映射到连续的向量空间\nembedding_layer = torch.nn.Embedding(vocab_size, dim)\n\n# 使用 Llama3 中的预训练参数更新嵌入层的参数\nembedding_layer.weight.data.copy_(model[\"tok_embeddings.weight\"])\n\n# 使用嵌入层将输入的标记 ID 序列转换为向量表示\n# 嵌入层只是根据 ID 在字典中查找对应的向量，不涉及标记之间的交互。\n# [17] -> [17×4096]\ntoken_embeddings_unnormalized = embedding_layer(tokens).to(torch.bfloat16)  # 默认是全精度的 float32，这里改为半精度格式以减少内存占用。\n\ntoken_embeddings_unnormalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n# 构建第一个 Transformer 块\n\n从下面所示的第一个 Transformer 块所涉及的预训练参数来看，它包括：\n1. 两个归一化层（attention_norm 和 ffn_norm）\n2. 注意力机制的实现（4 个 attention.w）\n3. 前馈网络层的实现（3 个 feed_forward.w）\n4. （当然，还包括两个不需要预训练参数的残差连接操作）\n\n一般来说，Transformer 块中的操作流程如下：\n\u003Cbr>\n归一化 -> 多头自注意力 -> 残差连接 -> 归一化 -> 前馈神经网络 -> 残差连接\n\n\n```python\n# 展示第一个 Transformer 块的所有权重参数及其形状\nfor k, v in model.items():\n    if not k.startswith('layers'):\n        continue\n    if k.startswith('layers.1'):\n        break\n    print(k, v.shape)\n```\n\n    layers.0.attention.wq.weight torch.Size([4096, 4096])\n    layers.0.attention.wk.weight torch.Size([1024, 4096])\n    layers.0.attention.wv.weight torch.Size([1024, 4096])\n    layers.0.attention.wo.weight torch.Size([4096, 4096])\n    layers.0.feed_forward.w1.weight torch.Size([14336, 4096])\n    layers.0.feed_forward.w3.weight torch.Size([14336, 4096])\n    layers.0.feed_forward.w2.weight torch.Size([4096, 14336])\n    layers.0.attention_norm.weight torch.Size([4096])\n    layers.0.ffn_norm.weight torch.Size([4096])\n\n\n这里需要注意两点：\n1. 神经网络权重矩阵的形状是 (输出维度, 输入维度)。在计算时，参数矩阵 W 会先转置为 (输入维度, 输出维度)，然后与输入 X 相乘，即输出 Y = XW.T。这一点你会在后续计算中看到。\n2. 由于 Llama3 使用分组注意力机制，每 4 个查询头会共享一组键值向量（详情请参阅上面关于配置文件细节的部分）。因此，键值权重矩阵的维度是 [1024, 4096], 是查询权重矩阵 [4096, 4096] 的 1\u002F4。\n\n## 归一化\n\n归一化操作旨在约束数据中的尺度差异，避免因向量数值差异过大而导致训练过程不稳定等问题。\n\u003Cbr>\u003Cbr>\n归一化后，张量的形状仍保持为 [17×4096]，与嵌入的形状相同。\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_b6286bc7e59d.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n### 对嵌入使用 RMS 归一化\n\nLlama3 使用均方根（RMS）归一化方法，其计算公式如图所示。\n\u003Cbr>\n需要注意的是，我们需要一个 norm_eps 参数（来自配置），以防止 RMS 被意外地设为 0，从而导致除以零的错误。\n\u003Cbr>\n公式如下：\n\u003Cdiv>\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_2f575a0b6e6f.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n此外，你可能已经注意到公式中的 gi 参数。这是一个在模型训练过程中学习到的缩放因子，用于再次缩放每个维度的归一化结果，以增强模型的表达能力。它的维度与嵌入的特征维度相同，即 [4096]。\n\n\n```python\n# 定义 RMS 归一化的计算函数\n# 每个标记将被独立归一化\n# norm_weights 是预训练的缩放因子（即公式中的 gi），用于增强模型的表征能力。可以从模型文件中加载，具有 4096 个维度\n# torch.rsqrt 用于计算张量平方根的倒数，即 1\u002FRMS(a)\ndef rms_norm(tensor, norm_weights):\n    return (tensor * torch.rsqrt(tensor.pow(2).mean(-1, keepdim=True) + norm_eps)) * norm_weights\n```\n\n\n```python\n# 对输入进行归一化\ntoken_embeddings = rms_norm(token_embeddings_unnormalized, model[\"layers.0.attention_norm.weight\"])  # [17×4096] & [4096] -> [17×4096]\nmodel[\"layers.0.attention_norm.weight\"].shape, token_embeddings.shape\n```\n\n\n\n\n    (torch.Size([4096]), torch.Size([17, 4096]))\n\n## 从零开始实现单头注意力机制\n\n在每一层的多头注意力计算中，涉及32个头。然而，这些头的计算过程是完全相同且相互独立的。因此，在本节中，我们将首先实现单头注意力的计算过程，并在下一节将其扩展到多头计算。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_08965382b82c.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\n注意力机制的核心计算公式如图所示。\n\u003C\u002Fh3>\n\n1. 我们需要通过对输入嵌入进行线性映射，得到查询、键和值向量。\n2. 随后，基于查询和键向量，我们计算出各个标记之间的注意力权重，即对于每个标记，其他标记对其的重要性或相关性的得分。\n3. 最后，根据注意力权重对值向量进行加权，得到每个标记对应的注意力结果。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8574b10f57e4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n回到正题。让我们先加载第一层Transformer的注意力头。\n\u003Cbr>\n&gt; 当我们从模型中加载查询、键、值以及输出权重矩阵时（输出权重用于融合多个头的信息，以生成最终的注意力输出），我们会发现它们的形状分别是：[4096×4096]、[1024×4096]、[1024×4096]、[4096×4096]。\n\u003Cbr>\n&gt; 初看起来这似乎有些奇怪，因为理想情况下，我们希望每个头的q、k、v彼此独立（在这种情况下，它们的形状应该是：32×[128×4096]、8×[128×4096]、8×[128×4096]）。\n\u003Cbr>\n&gt; 代码作者将它们捆绑在一起，是因为这样有助于并行化注意力头的乘法计算。\n\u003Cbr>\n&gt; 但我们将会把这一切展开...\n\n\n```python\n# 显示当前q、k、v和o的注意力权重矩阵的形状。\nprint(\n    model[\"layers.0.attention.wq.weight\"].shape,  # [4096×4096]\n    model[\"layers.0.attention.wk.weight\"].shape,  # [1024×4096]\n    model[\"layers.0.attention.wv.weight\"].shape,  # [1024×4096]\n    model[\"layers.0.attention.wo.weight\"].shape   # [4096×4096]\n)\n```\n\n    torch.Size([4096, 4096]) torch.Size([1024, 4096]) torch.Size([1024, 4096]) torch.Size([4096, 4096])\n\n\n### 获取输入标记对应的QKV向量\n\n在本节中，我们将把输入的标记嵌入转换为查询、键和值向量，以便进行注意力机制的计算。\n\n#### 获取查询向量\n\n##### 展开查询权重矩阵\n\n我们首先将来自多个注意力头的查询展开，最终的形状将是[32×128×4096]。\n\u003Cbr>\n这里，32是Llama3中的注意力头数量，128是查询头的向量维度，而4096则是标记嵌入的维度（嵌入维度位于最后一个维度的原因是，在进行输入与权重相乘时，通常是X*W.T，即与权重的转置相乘）。\n\n\n```python\n# 加载并修改第0层的查询权重矩阵的形状，以将其展开为多头形式\nq_layer0 = model[\"layers.0.attention.wq.weight\"]  # 默认形状为[4096×4096]\nhead_dim = q_layer0.shape[0] \u002F\u002F n_heads  # 注意力头的维度，4096\u002F32 = 128\nq_layer0 = q_layer0.view(n_heads, head_dim, dim)  # 展开后的维度，[32×128×4096]\nq_layer0.shape\n```\n\n\n\n\n    torch.Size([32, 128, 4096])\n\n\n\n##### 获取第一个头\n在这里，我访问了第一层查询权重矩阵的第一个头。该查询权重矩阵的形状是[128×4096]。\n\n\n```python\n# 提取第一个头的权重\nq_layer0_head0 = q_layer0[0]  # [32×128×4096] -> [128×4096]\nq_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n##### 将标记嵌入与查询权重相乘，得到标记对应的查询向量\n\n在这里，你可以看到结果的形状是[17×128]。这是因为我们有17个标记，而对于每个标记，都有一个长度为128的查询向量。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_832fd0cbc155.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 计算第一个查询头上的输入查询值\n# Q0_head0 = XW0_Q_head0.T\nq_per_token = torch.matmul(token_embeddings, q_layer0_head0.T)  # [17×4096] x [4096×128] = [17×128]\nq_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### 获取键向量（几乎与查询向量相同）\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_fb63a357c607.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n我想偷个懒，所以不再详细说明键向量的计算过程了。Orz。你只需要记住一点：\n\u003Cbr>\n&gt; 键同样会生成一个128维的向量。\n\u003Cbr>\n&gt; 键的权重矩阵参数数量仅为查询的四分之一，这是因为每个键的权重由4个头同时共享，从而减少了所需的计算量。\n\n\n```python\n# 加载并修改第0层的键权重矩阵的形状，使其以多头形式展开\n# 与查询权重矩阵不同，键有8个注意力头，因此其参数数量是查询矩阵的四分之一\nk_layer0 = model[\"layers.0.attention.wk.weight\"]  # [1024×4096]\nk_layer0 = k_layer0.view(n_kv_heads, k_layer0.shape[0] \u002F\u002F n_kv_heads, dim) # [8×128×4096]\nk_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\n\n```python\n# 提取第一个头的权重\nk_layer0_head0 = k_layer0[0]  # [8×128×4096] -> [128×4096]\nk_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n\n```python\n# 计算第一个头对应的输入键向量\n# K0_head0 = XW0_K_head0.T\nk_per_token = torch.matmul(token_embeddings, k_layer0_head0.T)  # [17×4096] x [4096×128] = [17×128]\nk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### 获取值向量（几乎与键向量相同）\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8fc2f0557634.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n&gt; 类似于键的权重，值的权重也是由每4个注意力头共享的（以节省计算量）。\n\u003Cbr>\n&gt; 因此，值权重矩阵的形状是[8×128×4096]。\n\n\n```python\n# 加载并修改第0层的值权重矩阵的形状，使其以多头形式展开\n# 与键权重矩阵类似，值也有8个注意力头，因此其参数数量同样是查询矩阵的四分之一\nv_layer0 = model[\"layers.0.attention.wv.weight\"]  # [1024×4096]\nv_layer0 = v_layer0.view(n_kv_heads, v_layer0.shape[0] \u002F\u002F n_kv_heads, dim)  # [1024×4096] -> [8×128×4096]\nv_layer0.shape\n```\n\n\n\n\n    torch.Size([8, 128, 4096])\n\n\n\n\n```python\n# 提取第一个头的权重\nv_layer0_head0 = v_layer0[0]  # [8×128×4096] -> [128×4096]\nv_layer0_head0.shape\n```\n\n\n\n\n    torch.Size([128, 4096])\n\n\n\n\n```python\n# 计算第一个头对应的输入值向量\n\n# V0_head0 = XW0_V_head0.T\nv_per_token = torch.matmul(token_embeddings, v_layer0_head0.T)  # [17x4096] x [4096x128] = [17x128]\nv_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n### 将位置信息添加到查询和键向量中\n\n- 对于自然语言来说，词语之间的顺序关系和相对位置极其重要。例如，“The dog bites the man”和“The man bites the dog”具有完全不同的语义信息。此外，我们的直觉也告诉我们，距离较近的词语之间的相关性通常大于距离较远的词语。\n- 因此，在注意力计算过程中，我们需要为每个标记提供位置信息，以便模型能够更好地捕捉序列中的依赖关系。\n- 为什么要把位置信息加到查询和键向量上？因为查询和键向量用于计算注意力权重，即每个标记对其他标记的重要性。这就要求它们在计算相似度时，能够同时知道任意两个标记的位置及其相对位置关系。\n- 为什么不把位置信息加到值向量上呢？因为值向量只用于加权求和。位置信息已经在查询和键的交互中被考虑到了，因此值向量只需要提供内容信息即可。\n\n我们将使用RoPE（旋转位置编码）来为这些向量添加位置信息。\n\n#### 旋转位置编码（RoPE）\n\n你可以观看这个视频来详细了解它的数学原理（这也是我观看过的视频）：\nhttps:\u002F\u002Fwww.youtube.com\u002Fwatch?v=o29P0Kpobz0&t=530s\n\u003Cbr>\u003Cbr>\nRoPE 的基本思想是将向量视为处于复数空间中，然后根据位置生成特定的旋转矩阵。通过将向量与旋转矩阵相乘，可以在复数空间中实现旋转，从而将相对位置信息添加到向量中。（也就是说，将输入向量之间的位置关系看作是在复数空间中以不同角度进行的旋转。）\n\u003Cbr>\n（类似于机器人运动学中通过基于三角函数的矩阵乘法来实现平面位置坐标绕轴的旋转。）\n\u003Cbr>\u003Cbr>\nRoPE 通常应用于自注意力机制中的查询和键向量。在计算注意力分数时，首先会根据 RoPE 的相应旋转矩阵对查询和键向量进行旋转，然后再进行点积计算和 softmax 归一化等操作。这样，Transformer 在计算注意力时就能考虑到位置信息，从而更好地捕捉文本中的依赖关系。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_fde9bd425df6.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n\u003Ch3>\nRoPE 的具体计算过程如下：\n\u003C\u002Fh3>\n\n1. 将每个向量的维度分成若干对（因为高维旋转矩阵的推导较为复杂，且维度过高会显著增加计算复杂度，而二维旋转的公式相对成熟且简单，易于计算）。\n2. 对每一对，计算 $\\Large \\theta=\\frac{1}{rope\\\\_theta^{i\u002FD}}$，其中 $i$ 是第 $i$ 对，$D$ 是总对数。这表示当前维度对在向量中的位置信息。\n3. 对于每个向量，计算 $\\Large m$，表示该向量对应于第 $m$ 个标记。即当前向量在整个输入向量中的位置信息。\n4. 对于每一对，![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_a12f47598060.png) ，其中 $res$ 是向量对在复数空间中旋转 $m\\theta$ 度后的结果。\n5. 对所有向量的所有维度对重复上述计算，得到最终的 RoPE 结果。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8c5c472c7eb2.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n\n\u003Ch3>\n在实际代码实现中，为了简化计算过程，上述基于旋转矩阵的计算（步骤 4）会被转换为复数域内的计算。其原理如下：\n\u003C\u002Fh3>\n\n1. 直角坐标 $(x, y)$ 可以被视为复数 $\\large x + yi$ 在复平面上的表示。\n2. 复数的极坐标形式可以表示为 $\\large re^{i\\theta}$，其中 $r$ 是模长，$\\theta$ 是角度。\n3. 极坐标下的乘法运算 $\\large r_1e^{i\\theta_1} \\times r_2e^{i\\theta_2} = r_1r_2e^{i(\\theta_1 + \\theta_2)}$ 可以理解为将坐标_1 的长度放大 $r_2$ 倍，并将其旋转 $\\theta_2$ 度。\n4. 因此，如果想要将坐标旋转 $m\\theta$ 度，可以定义一个模长为 1、角度为 $m\\theta$ 的旋转因子 $\\large e^{im\\theta}$。将其与坐标相乘，就相当于基于旋转矩阵的旋转方法。\n5. 此外，根据欧拉公式，我们有 $\\large re^{i\\theta} = r\\cos\\theta + r\\sin{\\theta i} = x + yi$，以及 $\\large e^{im\\theta} = \\cos{m\\theta} + \\sin{m\\theta i}$。\n6. 因此，将二维坐标 $(x, y)$ 旋转 $m\\theta$ 度可以通过 $\\large re^{i\\theta^\\prime} \\times e^{im\\theta} = (x + yi) \\times (\\cos{m\\theta} + sin{m\\theta i})$ 来实现（两个复数的乘积）。\n\n#### 向查询向量中添加位置信息\n\n在接下来的步骤中，我们将首先把查询向量按维度方向分成若干对，然后按照上述步骤对每一对进行角度旋转。\n\u003Cbr>\u003Cbr>\n现在我们有一个形状为 [17x64x2] 的向量。这是通过将提示中每个标记对应的 128 维查询向量分成 64 对，并对每一对旋转 $m\\theta$ 度得到的。\n\n\n```python\n# 按维度方向将查询向量分成若干对。\n# .float() 是为了切换回双精度，以确保后续三角函数计算的精度和数值稳定性。\n# [17x128] -> [17x64x2]\nq_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)\nq_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\u003Ch3>\n开始获取旋转矩阵的复数域表示。\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_17d4fd3664b5.png\" width=\"600\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 计算 θ。第一步：计算 i\u002FD。\n\n# [64]\nzero_to_one_split_into_64_parts = torch.tensor(range(64))\u002F64  # 每个特征分割后有64对维度，因此需要64个θ值\nzero_to_one_split_into_64_parts\n```\n\n\n\n\n    tensor([0.0000, 0.0156, 0.0312, 0.0469, 0.0625, 0.0781, 0.0938, 0.1094, 0.1250,\n            0.1406, 0.1562, 0.1719, 0.1875, 0.2031, 0.2188, 0.2344, 0.2500, 0.2656,\n            0.2812, 0.2969, 0.3125, 0.3281, 0.3438, 0.3594, 0.3750, 0.3906, 0.4062,\n            0.4219, 0.4375, 0.4531, 0.4688, 0.4844, 0.5000, 0.5156, 0.5312, 0.5469,\n            0.5625, 0.5781, 0.5938, 0.6094, 0.6250, 0.6406, 0.6562, 0.6719, 0.6875,\n            0.7031, 0.7188, 0.7344, 0.7500, 0.7656, 0.7812, 0.7969, 0.8125, 0.8281,\n            0.8438, 0.8594, 0.8750, 0.8906, 0.9062, 0.9219, 0.9375, 0.9531, 0.9688,\n            0.9844])\n\n\n\n\n```python\n# 计算θ。步骤2：获取θ。\n# rope_theta用于控制位置编码的周期性等信息。\n# 详情请参阅配置信息部分。\nfreqs = 1.0 \u002F (rope_theta ** zero_to_one_split_into_64_parts)  # [64]\nfreqs\n```\n\n\n\n\n    tensor([1.0000e+00, 8.1462e-01, 6.6360e-01, 5.4058e-01, 4.4037e-01, 3.5873e-01,\n            2.9223e-01, 2.3805e-01, 1.9392e-01, 1.5797e-01, 1.2869e-01, 1.0483e-01,\n            8.5397e-02, 6.9566e-02, 5.6670e-02, 4.6164e-02, 3.7606e-02, 3.0635e-02,\n            2.4955e-02, 2.0329e-02, 1.6560e-02, 1.3490e-02, 1.0990e-02, 8.9523e-03,\n            7.2927e-03, 5.9407e-03, 4.8394e-03, 3.9423e-03, 3.2114e-03, 2.6161e-03,\n            2.1311e-03, 1.7360e-03, 1.4142e-03, 1.1520e-03, 9.3847e-04, 7.6450e-04,\n            6.2277e-04, 5.0732e-04, 4.1327e-04, 3.3666e-04, 2.7425e-04, 2.2341e-04,\n            1.8199e-04, 1.4825e-04, 1.2077e-04, 9.8381e-05, 8.0143e-05, 6.5286e-05,\n            5.3183e-05, 4.3324e-05, 3.5292e-05, 2.8750e-05, 2.3420e-05, 1.9078e-05,\n            1.5542e-05, 1.2660e-05, 1.0313e-05, 8.4015e-06, 6.8440e-06, 5.5752e-06,\n            4.5417e-06, 3.6997e-06, 3.0139e-06, 2.4551e-06])\n\n\n\n\n```python\n# 计算mθ\n# 'outer'用于计算外积，'arange(17)'表示每个向量对应的m值（由于输入有17个token，因此需要17个m值）。\n# 结果的形状为[17x64]，这意味着每个token对应的向量都有64个mθ值，这些值用于计算64对维度的旋转。\nfreqs_for_each_token = torch.outer(torch.arange(17), freqs)  # [17] & [64] -> [17x64]\n```\n\n\n```python\n# 获取(cos mθ + sin mθ i)，即把mθ转换为复数形式\n# 将旋转角度mθ视为模为1的极坐标形式，然后将其转换为复数表示\n# 'polar'的两个输入分别表示模（设置为1，意味着只改变角度而不影响长度）和角度（即mθ）\nfreqs_cis = torch.polar(torch.ones_like(freqs_for_each_token), freqs_for_each_token)  # [17x64] -> [17x64]\nprint(freqs_cis.shape)\n\n# 查看freqs_cis在某些位置的值，仅用于展示\ntoken_to_show = [1, 3, 5]  # 查看第2、4、6行\nfig, axs = plt.subplots(1, len(token_to_show), figsize=(5 * len(token_to_show), 4))  # 生成一个包含3个子图的单行图窗\nfor i, index in enumerate(token_to_show):\n    value = freqs_cis[index]\n    for j, element in enumerate(value):\n        # 从原点到坐标点画一条蓝色线，实部作为x坐标，虚部作为y坐标。\n        axs[i].plot([0, element.real], [0, element.imag], color='blue', linewidth=1, label=f\"Index: {j}\")\n        # 用红色数字标注表示第i对维度。\n        axs[i].annotate(f\"{j}\", xy=(element.real, element.imag), color='red')\n    axs[i].set_xlabel('Real')\n    axs[i].set_ylabel('Imaginary')\n    axs[i].set_title(f'Plot of {index + 1}th of freqs_cis')\nplt.show()\n\n\"\"\"\n注意：如图所示，位置靠后的token具有更大的旋转角度，但在单个token内，较早的向量维度具有更大的旋转角度。\n      如果感兴趣，可以进一步探索这背后是否存在数学原因。X_X\n\"\"\"\n```\n\n    torch.Size([17, 64])\n\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_95b7152370ba.png)\n    \n\n\n\n\n\n    '\\n注意：如图所示，位置靠后的token具有更大的旋转角度，但在单个token内，较早的向量维度具有更大的旋转角度。\\n      如果感兴趣，可以进一步探索这背后是否存在数学原因。X_X\\n'\n\n\n\n\u003Ch3>\n现在我们已经为每个token对应的查询向量的每一对维度提供了一个复数（一个改变角度的向量）。\n\u003C\u002Fh3>\n\u003Cbr>\n现在我们可以将我们的查询（已分成对的）转换为复数，然后通过点积计算来旋转这些查询。 :)\n\n\n```python\n# 获取(x + yi)\n# 即将维度对转换为复数。转换后，维度的形状将由[17x64x2]变为[17x64]。\nq_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # [17x64x2] -> [17x64]\nq_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\n# 计算(x + yi) * (cos mθ + sin mθ i)\n# 即执行旋转操作以得到最终结果。\nq_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis  # [17x64] * [17x64] = [17x64]\nq_per_token_as_complex_numbers_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\u003Ch3>\n获取旋转后的向量（恢复形状）。\n\u003C\u002Fh3>\n\u003Cbr>\n我们可以将复数再次表示为实数，从而以维度对的形式获得查询结果。\n\n\n```python\n# 将复数结果转换回实数维度对的形式。\nq_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # [17x64] -> [17x64x2]\nq_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n合并旋转后的维度。这样，我们就得到了一个新的查询向量（旋转后的查询向量），其形状为[17x128]，其中17表示token的数量，128表示查询向量的维度。\n\n\n```python\n# 将维度对的结果恢复为原始的查询向量形式，得到最终的查询向量。\nq_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # [17x64x2] -> [17x128]\nq_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n#### 为键向量添加位置信息（与查询相同）\n\n\n```python\n# 沿维度方向将键向量分成对，形成维度对（修改形状）。\nk_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # [17x128] -> [17x64x2]\nk_per_token_split_into_pairs.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\n# 获取(x + yi)\n\n# 即将维度对转换为复数。转换后，维度的形状将从 [17x64x2] 变为 [17x64]。\nk_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # [17x64x2] -> [17x64]\nk_per_token_as_complex_numbers.shape\n```\n\n\n\n\n    torch.Size([17, 64])\n\n\n\n\n```python\n# 计算 (x + yi) * (cosmθ + sinmθi)\n# 即执行旋转操作以得到最终结果。\n# 然后将结果转换回实数形式。\nk_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers * freqs_cis)  # [17x64] * [17x64] = [17x64] -> [17x64x2]\nk_per_token_split_into_pairs_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 64, 2])\n\n\n\n\n```python\n# 将维度对的结果恢复为原始的键向量形式，从而得到最终的键向量。\nk_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # [17x64x2] -> [17x128]\nk_per_token_rotated.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n\u003Ch3>\n至此，我们已经得到了每个 token 对应的旋转后的查询向量和键向量。\n\u003C\u002Fh3>\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_e044c5abbf69.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n每个查询和键向量的形状仍然是 [17x128]。\n\n### 一切准备就绪，现在开始计算 token 之间的注意力权重。\n\n这将涉及三个步骤：\n1. 计算注意力分数：score = Q x K\n2. 掩码未来 token：score = mask(score)\n3. 计算注意力权重：res = softmax(score)\n\n让我们开始吧！ :)\n\n#### 将查询向量和键向量相乘，得到注意力分数。\n\n通过这种方式，我们将得到每个 token 与所有其他 token 之间的分数值。\n\u003Cbr>\n这些分数表示每个 token 的查询与所有其他 token 的键之间的相关性强度。\n\u003Cbr>\n这就是自注意力机制！\n\u003Cbr>\n这个注意力分数矩阵（qk_per_token）的形状是 [17x17]，其中 17 是输入 prompt 中的 token 数量。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8804d1a9ce96.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 计算注意力分数\n# 同时进行归一化处理，以防止后续的 softmax 计算结果过于偏向 0 或 1，\n# （当维度较大时，点积值可能会过大），\n# 这可能导致梯度消失或梯度爆炸，从而保持数值稳定性。\nqk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(head_dim)**0.5  # [17x128] x [128x17] = [17x17]\nqk_per_token.shape\n```\n\n\n\n\n    torch.Size([17, 17])\n\n\n\n#### 现在我们需要对未来的查询-键分数进行掩码处理。\n\n在 Llama 3 的训练过程中，未来 token 的 QK 分数会被掩码掉。\n\u003Cbr>\n为什么呢？因为在训练时，我们只学习如何利用过去的 token 来预测当前的 token。如果不进行掩码处理，就会导致预测信息的泄露。\n\u003Cbr>\n因此，在推理过程中，我们也需要将未来 token 的分数置为 0（以确保训练和推理过程的一致性）。\n\u003Cbr>\n\n当然，如果你和我一样好奇不进行掩码会有什么后果，可以在学完本节内容后查看我在最后一节中进行的额外实验结果。（^_\u003C）\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_85e31c7ffd44.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 首先看一下掩码前的分数矩阵\ndef display_qk_heatmap(qk_per_token):\n    _, ax = plt.subplots()  # 创建一个绘图窗口\n\n    # `imshow` 常用于以二维数组或矩阵的形式显示数据，\n    # 它会将矩阵元素映射为灰度或颜色值，因此可以用来绘制热图。\n    # 先将张量转换回全精度，然后从计算图中分离出来，以避免潜在的梯度计算和存储问题。\n    # 指定使用 'viridis' 颜色映射方案来显示图像（蓝色 -> 绿色 -> 黄色）。\n    im = ax.imshow(qk_per_token.to(float).detach(), cmap='viridis')\n\n    # 设置 x 轴和 y 轴刻度的数量和标签，以确保正确的一一对应关系。\n    ax.set_xticks(range(len(prompt_split_as_tokens)))\n    ax.set_yticks(range(len(prompt_split_as_tokens)))\n    ax.set_xticklabels(prompt_split_as_tokens)\n    ax.set_yticklabels(prompt_split_as_tokens)\n\n    # 在旁边添加一个颜色条。\n    # 指定 `im` 来识别正确的颜色映射和取值范围。\n    # 指定它所属的子图为 `ax`（如果有多个子图，则为 `ax = ax[i]`）。\n    ax.figure.colorbar(im, ax=ax)\n    \ndisplay_qk_heatmap(qk_per_token)\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_d4d8169ef08f.png)\n    \n\n\n\n```python\n# 生成掩码矩阵\n# 将需要掩码的位置设置为负无穷，不需要掩码的位置设置为 0。\n# 然后将其加到分数矩阵上，以实现掩码效果（计算 softmax 时，负无穷会趋近于 0）。\n\n# `torch.full` 用于生成具有指定形状和填充值的张量。\n# 这里首先生成一个充满负无穷的 [17x17] 矩阵。\n# 指定该矩阵的设备与之前 token 的设备相同，以确保后续计算不会出错，\n# 例如，如果之前的 token 在 GPU 上，而这里没有指定设备，那么 `mask` 就会在 CPU 上重新创建，相加时就会出错。\nmask = torch.full((len(tokens), len(tokens)), float(\"-inf\"), device=tokens.device)  # [17x17]\n\n# `torch.triu` 用于返回矩阵的上三角部分，并将其余部分置零（使用 `torch.tril` 获取下三角部分）。\n\n# `diagonal` 是对角线的偏移量。当它为 1 时，表示取主对角线上方 1 个位置开始的上三角部分，以避免掩码掉当前 token 自身。\nmask = torch.triu(mask, diagonal=1)  # [17x17]\n\nmask, mask.shape\n```\n\n\n\n\n    (tensor([[0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf, -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., -inf],\n             [0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]),\n     torch.Size([17, 17]))\n\n\n\n\n```python\n# 掩码未来 token 的得分\nqk_per_token_after_masking = qk_per_token + mask  # [17x17] + [17x17] = [17x17]\ndisplay_qk_heatmap(qk_per_token_after_masking)  # 显示掩码后的注意力得分\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_43a2d88c2dbe.png)\n    \n\n\n#### 计算最终的注意力权重，即对得分进行 softmax 操作。\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_8574b10f57e4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n# 计算注意力权重\n# 即计算得分的 softmax 值。\n# `dim = 1` 表示按行进行 softmax 计算，结果转换为半精度以与后续的值向量保持一致。\nqk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # [17x17] -> [17x17]\ndisplay_qk_heatmap(qk_per_token_after_masking_after_softmax)\n```\n\n\n    \n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_7d690f953c6c.png)\n    \n\n\n### 终于！计算单头注意力机制的最终结果！\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_1dac0fb02c00.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n原理：利用之前的注意力权重（范围在 0 到 1 之间），确定每个 token 应该使用每个值向量的多少比例（即对值向量进行加权）。\n\n示例：如果输入包含 3 个 token，第一个 token 的注意力结果可能是：res = 0.6 * value_1 + 0.3 * value_2 + 0.1 * value_3\n\n权重矩阵与值矩阵相乘后，注意力结果的形状为 [17x128]。\n\n\n```python\n# 计算单头注意力的最终结果\nqkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17x17] x [17x128] = [17x128]\nqkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 128])\n\n\n\n## 计算多头注意力机制（通过简单循环重复上述过程）\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_c5ab2100efca.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n我们现在得到了第一层第一个头的注意力值。\n\u003Cbr>\n\n接下来需要通过一个循环，对第一层中的每一个头执行与上一单元格完全相同的数学运算。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>\n值得注意的是，在 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmeta-llama\u002Fllama3\u002Fblob\u002Fmain\u002Fllama\u002Fmodel.py#L90\">官方 Llama3 代码实现\u003C\u002Fa>中，多头注意力的计算采用了一次性矩阵乘法的方式，而不是耗时的 for 循环计算。其一般流程如下：\n\u003C\u002Fh3>\n\n1. 基于矩阵并行性，计算 QKV 向量：[17x4096] × [4096x4096] 或 [4096x1024] = [17x4096] 或 [17x1024]，然后将其重塑为 [32x17x128] 或 [8x17x128]。\n2. 获得 QKV 向量后，将 K 和 V 向量的内部部分复制，使其形状与 Q 向量一致。此时三者的形状均为 [32x17x128]。\n3. 在计算得分时，使用转置方法交换张量最后两个维度的位置，完成矩阵乘法。例如，`torch.matmul(q, k.transpose(1,2)) \u002F head_dim ** 0.5`。此时为 [32x17x128] × [32x128x17] = [32x17x17]。\n4. 其他矩阵计算也遵循同样的原理。\n\n注：上述过程中每一步的矩阵形状变化都是简化版本，仅用于说明以便理解，与官方 Llama3 实现中的变化过程有所不同（官方实现涉及大量的形状变换操作）。\n\n### 计算每个头的结果\n\n\n```python\n# 计算多头注意力结果\n\n# 即，上一次单头注意力计算过程的循环\nqkv_attention_store = []\n\nfor head in range(n_heads):\n    # 提取当前头对应的QKV权重矩阵\n    q_layer0_head = q_layer0[head]  # [32x128x4096] -> [128x4096]\n    k_layer0_head = k_layer0[head\u002F\u002F4]  # 每4个头共享一个键权重，[8x128x4096] -> [128x4096]\n    v_layer0_head = v_layer0[head\u002F\u002F4]  # 每4个头共享一个值权重，[8x128x4096] -> [128x4096]\n    \n    # 计算XW以得到QKV向量\n    # [17x4096] x [4096x128] = [17x128]\n    q_per_token = torch.matmul(token_embeddings, q_layer0_head.T)\n    k_per_token = torch.matmul(token_embeddings, k_layer0_head.T)\n    v_per_token = torch.matmul(token_embeddings, v_layer0_head.T)\n    \n    # 将位置信息添加到查询向量（RoPE）\n    q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)  # 沿维度方向将向量分成成对的形式，形成维度对。[17x128] -> [17x64x2]\n    q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # 转换为复数表示，(x,y) -> (x+yi)。[17x64x2] -> [17x64]\n    q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis[:len(tokens)]  # 计算(x+yi)*(cosmθ+sinmθi)，完成旋转操作。[17x64] * [17x64] = [17x64]\n    q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # 将结果转换回实数表示，(x+yi) -> (x,y)。[17x64] -> [17x64x2]\n    q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # 将结果恢复为原始向量形状，得到最终的查询向量。[17x64x2] -> [17x128]\n\n    # 将位置信息添加到键向量（RoPE）\n    k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # 沿维度方向将向量分成成对的形式，形成维度对。[17x128] -> [17x64x2]\n    k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # 转换为复数表示，(x,y) -> (x+yi)。[17x64x2] -> [17x64]\n    k_per_token_as_complex_numbers_rotated = k_per_token_as_complex_numbers * freqs_cis[:len(tokens)]  # 计算(x+yi)*(cosmθ+sinmθi)，完成旋转操作。[17x64] * [17x64] = [17x64]\n    k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers_rotated)  # 将结果转换回实数表示，(x+yi) -> (x,y)。[17x64] -> [17x64x2]\n    k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # 将结果恢复为原始向量形状，得到最终的键向量。[17x64x2] -> [17x128]\n\n    # 同时计算注意力分数并对其进行归一化（即Q×K\u002Fsqrt(dim)）\n    qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(head_dim)**0.5  # [17x128] x [128x17] = [17x17]\n    \n    # 对未来 tokens 的分数进行掩码处理\n    mask = torch.full(qk_per_token.shape, float(\"-inf\"), device=tokens.device)  # 创建与注意力分数相同形状的矩阵，填充负无穷，并存储在与其他向量相同的设备上，以防止后续计算出错。[17x17]\n    mask = torch.triu(mask, diagonal=1)  # 保留上三角部分的负无穷，其余部分置为零（即上三角区域代表需要被掩码的未来 tokens）。对角线偏移为1，以避免掩码当前 token 自身。[17x17]\n    qk_per_token_after_masking = qk_per_token + mask  # 将注意力分数与掩码矩阵相加，使分数矩阵的上三角部分变为负无穷，这将在后续的 softmax 操作后趋近于零。[17x17]\n    \n    # 计算注意力权重（即 softmax(score)）\n    # 同时将其转换回半精度（因为稍后会与值向量 v_per_token 相乘，因此数据类型需要一致）。\n    qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # 按行计算 softmax。[17x17]\n    \n    # 计算注意力机制的最终结果（即 softmax(score) × V）\n    qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17x17] × [17x128] = [17x128]\n    \n    # 记录该头的结果\n    qkv_attention_store.append(qkv_attention)\n\nlen(qkv_attention_store)\n```\n\n\n\n\n    32\n\n\n\n### 将各头的结果合并为一个大矩阵\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_ea75afe08f8d.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n现在我们已经得到了第一层中所有32个头的注意力机制结果。接下来，我们将所有的注意力值合并为一个形状为[17x4096]的大矩阵。\n\u003Cbr>\n我们几乎完成了注意力层的计算 :)\n\n\n```python\n# 合并多头注意力矩阵\nstacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)  # 沿第二个维度拼接，32x[17x128] -> [17x4096]\nstacked_qkv_attention.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n### 头与头之间的信息交互（线性映射），自注意力层的最后一道工序！\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_996c739db7aa.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\nlayer0 注意力计算的最后一步是进行最终的线性映射，即将组合后的注意力矩阵与输出权重矩阵相乘。\n\n\n```python\n# 加载 layer0 的输出权重矩阵\nw_layer0 = model[\"layers.0.attention.wo.weight\"]  # [4096x4096]\nw_layer0.shape\n```\n\n\n\n\n    torch.Size([4096, 4096])\n\n\n\n这只是一个简单的线性层，所以我们只需要进行矩阵乘法。\n\n\n```python\n# 对注意力矩阵进行线性映射\n# 这就是注意力层的最终输出\nembedding_delta = torch.matmul(stacked_qkv_attention, w_layer0.T)  # [17x4096] x [4096x4096] = [17x4096]\nembedding_delta.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## 进行残差运算（相加）\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_0399128c7400.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n现在我们已经得到了应用了注意力机制后的输入向量值。此时，我们需要将其与原始输入向量相加（即残差运算，以确保信息不易丢失并缓解梯度消失问题）。\n\n\n```python\n# 将注意力层的输出与原始输入相加，完成残差运算\nembedding_after_edit = token_embeddings_unnormalized + embedding_delta  # [17x4096] + [17x4096] = [17x4096]\nembedding_after_edit.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## 进行第二次归一化操作\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_872485ee5e22.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\n```python\n\n# 对残差操作的结果进行归一化\nembedding_after_edit_normalized = rms_norm(embedding_after_edit, model[\"layers.0.ffn_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\nembedding_after_edit_normalized.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## 执行前馈神经网络（FFN）层的计算\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_73bafe1bed53.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\n在Llama3中，他们使用了SwiGLU前馈网络。这种网络架构可以在模型需要时有效增强非线性特性。\n\u003Cbr>\n如今，这类前馈网络架构在大型语言模型中非常常见。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>为什么引入非线性层：\u003C\u002Fh3>\n\n- 非线性是神经网络模型能够被视为“通用函数逼近器”的核心原因。在传统的神经网络模型中，我们通过使用非线性激活函数（如Sigmoid、ReLU等）来增加模型的表达能力，使其能够拟合训练数据中隐藏的复杂模式。\n- 然而，在Transformer中，注意力机制本质上是对值向量的线性加权求和（尽管权重是通过Softmax函数的非线性计算得到的，但它仍然是对值的线性加权）。因此，虽然它可以捕捉全局依赖关系，但其输出仍然只是输入的线性组合。此时，Transformer模型实际上缺乏非线性能力。\n- 因此，在自注意力层之后添加一个FFN网络，为模型引入非线性变换能力，从而提升模型对复杂语义关系的建模能力，是十分必要的。\n\n\u003Ch3>通常，引入非线性层可以起到以下作用：\u003C\u002Fh3>\n\n1. 为模型增加非线性能力，以促进模型的学习和训练。\n2. 增强模型的信息抽象能力，使模型能够在逐层学习过程中表示不同层次的数据特征和模式。例如，较低层的网络可以识别基本的语言结构（如词性），而较高层的网络则可以理解更复杂的语义信息（如情感、意图）。\n3. 此外，目前有一种观点认为，注意力层主要用于处理输入上下文交互，而FFN层则是大型语言模型在训练过程中主要存储和记忆通用知识的地方（由于其非线性表示能力），以便在回答输入问题时能够从这些通用知识中找到答案。\n\n\u003Ch3>SwiGLU网络结构：\u003C\u002Fh3>\n\n1. 对输入进行线性变换：$X^\\prime = XW_3$\n2. 门控单元：$GATE = Activation\\\\_Function(XW_1)$，用于有选择地传递信息。也就是说，假设$X^\\prime$中的信息具有不同的重要性，那么应根据门控单元的得分对信息进行加权并传递，从而提高模型的表达能力。\n3. 使用的激活函数是Swish激活函数（因此该网络被称为SwiGLU，它是Swish激活函数与门控线性单元（GLU）的结合）。公式为：$Swish = X \\cdot \\sigma(\\beta X)$，其中$\\sigma$为Sigmoid激活函数。在SwiGLU中，$\\beta$被设置为1（在原始公式中，它是一个可学习的参数）。\n4. 因此，门控单元的具体计算为：$GATE = XW_1 \\cdot \\sigma(XW_1)$。在PyTorch中，这个激活函数被称为silu，即$GATE = silu(XW_1)$。\n5. 应用门控机制：$X^\\prime = X^\\prime \\cdot GATE$\n6. 再次进行线性变换：$Y = X^\\prime W_2$\n\n\u003Ch3>前馈层隐藏层维度大小的计算（基于Llama3的官方实现过程）：\u003C\u002Fh3>\n\n1. 输入维度为dim = 4096\n2. hidden_dim = 4 * dim = 16384  # 首先将其放大四倍。在初始化Transformer块中的前馈层时，输入的hidden_dim会被乘以四。\n3. hidden_dim = int(2 * hidden_dim \u002F 3) = 10922 # 然后将其放大2\u002F3倍。这种缩放首先在前馈层内部进行。\n4. hidden_dim = int(ffn_dim_multiplier * hidden_dim) = int(1.3 * 10922) = 14198  # 接着再按ffn_dim_multiplier倍数放大。模型配置文件中将ffn_dim_multiplier定义为1.3。\n5. hidden_dim = multiple_of * ((hidden_dim + multiple_of - 1) \u002F\u002F multiple_of) = 1024 * ((14198 + 1024 - 1) \u002F\u002F 1024) = 14336  # 最后调整为multiple_of的整数倍。模型配置文件中将multiple_of定义为1024，以确保模型中所有隐藏层的维度都是1024的倍数，从而提高计算效率。\n6. 最终，我们得到隐藏层的维度大小为14336。\n\n\n```python\n# 计算前馈网络层\n# 隐藏层的维度大小为14336\nw1 = model[\"layers.0.feed_forward.w1.weight\"]  # [14336x4096]\nw3 = model[\"layers.0.feed_forward.w3.weight\"]  # [14336x4096]\nw2 = model[\"layers.0.feed_forward.w2.weight\"]  # [4096x14336]\nprint(w1.shape, w3.shape, w2.shape)\n\n# output = (silu(XW1) * XW3)W2\n# [17x4096] x [4096x14336] x [14336x4096] = [17x4096]\noutput_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\noutput_after_feedforward.shape\n```\n\n    torch.Size([14336, 4096]) torch.Size([14336, 4096]) torch.Size([4096, 14336]) torch.Size([17, 4096])\n\n\n\n## 再次执行残差操作（最终我们得到了Transformer块的最终输出！）\n\n\n```python\n# 将前馈层的输出加到原始输入上，完成残差操作\n# 这就是Transformer块的最终结果\nlayer_0_embedding = embedding_after_edit+output_after_feedforward  # [17x4096] + [17x4096] = [17x4096]\nlayer_0_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n\u003Ch3>\n终于，我们得到了经过第一层处理后的每个token的新嵌入。\n\u003C\u002Fh3>\n\u003Cbr>\n接下来只需再完成31层即可（只需要一个循环）。\n\u003Cbr>\n你可以想象，这个经过处理的嵌入包含了第一层中提出的token的所有信息。\n\u003Cbr>\n现在，每一层都会对问题中提出的查询进行更复杂的编码。直到最后，我们将得到一个能够掌握下一个所需token所有信息的嵌入。\n\n# 一切都在这里。让我们完成所有32个Transformer块的计算吧。祝阅读愉快 :)\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_578c125085b4.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n是的，就是这样。我们之前所做的所有工作都将在这里一次性呈现出来，以完成每一层的计算。\n\u003Cbr>\n\n\n```python\n# 现在，让我们开始完成所有32个Transformer块的计算吧！\n\n# 使用输入标记的嵌入作为初始输入。\nfinal_embedding = token_embeddings_unnormalized  # [17×4096]\n\n# 对32层Transformer块进行逐层计算\nfor layer in range(n_layers):\n    #########################################################################################################################\n    ################### 第一轮：归一化 - 特征变换 - 残差操作 ###############################\n    \n    ########################### 第一次归一化 ###################################################\n    \n    # 第一次归一化\n    layer_embedding_norm = rms_norm(final_embedding, model[f\"layers.{layer}.attention_norm.weight\"])  # [17×4096] & [4096] -> [17×4096]\n    \n    ################ 第一次特征变换 - 多头自注意力 ########################\n    \n    # 获取当前层注意力机制的qkv权重矩阵\n    q_layer = model[f\"layers.{layer}.attention.wq.weight\"]  # [4096×4096]\n    q_layer = q_layer.view(n_heads, q_layer.shape[0] \u002F\u002F n_heads, dim)  # [32×128×4096]\n    k_layer = model[f\"layers.{layer}.attention.wk.weight\"]  # [1024×4096]\n    k_layer = k_layer.view(n_kv_heads, k_layer.shape[0] \u002F\u002F n_kv_heads, dim)  # [8×128×4096]\n    v_layer = model[f\"layers.{layer}.attention.wv.weight\"]  # [1024×4096]\n    v_layer = v_layer.view(n_kv_heads, v_layer.shape[0] \u002F\u002F n_kv_heads, dim)  # [8×128×4096]\n    \n    # 用于存储每个头的注意力机制计算结果\n    qkv_attention_store = []\n    \n    # 计算每个头的注意力机制结果\n    for head in range(n_heads):\n        # 提取当前头对应的QKV权重矩阵\n        q_layer_head = q_layer[head]  # [32×128×4096] -> [128×4096]\n        k_layer_head = k_layer[head\u002F\u002F4]  # 每4个头共享一个键权重，[8×128×4096] -> [128×4096]\n        v_layer_head = v_layer[head\u002F\u002F4]  # 每4个头共享一个值权重，[8×128×4096] -> [128×4096]\n        \n        # 计算XW以得到QKV向量\n        # [17×4096] × [4096×128] = [17×128]\n        q_per_token = torch.matmul(layer_embedding_norm, q_layer_head.T)\n        k_per_token = torch.matmul(layer_embedding_norm, k_layer_head.T)\n        v_per_token = torch.matmul(layer_embedding_norm, v_layer_head.T)\n        \n        # 为查询向量添加位置信息（RoPE）\n        q_per_token_split_into_pairs = q_per_token.float().view(q_per_token.shape[0], -1, 2)  # 沿维度方向将向量分成对，形成维度对。[17×128] -> [17×64×2]\n        q_per_token_as_complex_numbers = torch.view_as_complex(q_per_token_split_into_pairs)  # 转换为复数表示，(x,y) -> (x+yi)。[17×64×2] -> [17×64]\n        q_per_token_as_complex_numbers_rotated = q_per_token_as_complex_numbers * freqs_cis  # 计算(x+yi)×(cosθ+sinθi)，完成旋转操作。[17×64] × [17×64] = [17×64]\n        q_per_token_split_into_pairs_rotated = torch.view_as_real(q_per_token_as_complex_numbers_rotated)  # 将结果转回实数表示，(x+yi) -> (x,y)。[17×64] -> [17×64×2]\n        q_per_token_rotated = q_per_token_split_into_pairs_rotated.view(q_per_token.shape)  # 将结果转回原始向量形状，得到最终的查询向量。[17×64×2] -> [17×128]\n        \n        # 为键向量添加位置信息（RoPE）\n        k_per_token_split_into_pairs = k_per_token.float().view(k_per_token.shape[0], -1, 2)  # 沿维度方向将向量分成对，形成维度对。[17×128] -> [17×64×2]\n        k_per_token_as_complex_numbers = torch.view_as_complex(k_per_token_split_into_pairs)  # 转换为复数表示，(x,y) -> (x+yi)。[17×64×2] -> [17×64]\n        k_per_token_as_complex_numbers_rotated = k_per_token_as_complex_numbers * freqs_cis  # 计算(x+yi)×(cosθ+sinθi)，完成旋转操作。[17×64] × [17×64] = [17×64]\n        k_per_token_split_into_pairs_rotated = torch.view_as_real(k_per_token_as_complex_numbers_rotated)  # 将结果转回实数表示，(x+yi) -> (x,y)。[17×64] -> [17×64×2]\n        k_per_token_rotated = k_per_token_split_into_pairs_rotated.view(k_per_token.shape)  # 将结果转回原始向量形状，得到最终的键向量。[17×64×2] -> [17×128]\n        \n        # 计算注意力分数并同时对分数进行归一化（即Q×K\u002F√dim）\n        qk_per_token = torch.matmul(q_per_token_rotated, k_per_token_rotated.T)\u002F(128)**0.5  # [17×128] × [128×17] = [17×17]\n        \n        # 掩码未来token的分数\n        mask = torch.full(qk_per_token.shape, float(\"-inf\"), device=qk_per_token.device)  # 创建与注意力分数相同形状的矩阵，填充负无穷，并存储在与其他向量相同的设备上，以避免后续计算出错。[17×17]\n        mask = torch.triu(mask, diagonal=1)  # 保留上三角部分的负无穷，其余设为0（即上三角区域代表需要掩码的未来token）。对角线偏移1，以避免掩码当前token本身。[17×17]\n        qk_per_token_after_masking = qk_per_token + mask  # 将注意力分数与掩码矩阵相加，使分数矩阵的上三角部分变为负无穷，在后续的softmax操作后会趋近于0。[17×17]\n        \n        # 计算注意力权重（即softmax(分数)）\n        # 同时将其转换为半精度（因为后续会与值向量v_per_token相乘，数据类型需一致）。\n        qk_per_token_after_masking_after_softmax = torch.nn.functional.softmax(qk_per_token_after_masking, dim=1).to(torch.bfloat16)  # 按行计算softmax。[17×17]\n        \n        # 计算注意力机制的最终结果（即softmax(分数) × V）\n        qkv_attention = torch.matmul(qk_per_token_after_masking_after_softmax, v_per_token)  # [17×17] × [17×128] = [17×128]\n        \n        # 记录该头的结果\n        qkv_attention_store.append(qkv_attention)\n    \n    # 合并多头注意力结果\n    stacked_qkv_attention = torch.cat(qkv_attention_store, dim=-1)  # 沿第二维度合并，即32×[17×128] -> [17×4096]\n    \n    # 对结果进行线性映射，生成最终的多头自注意力机制结果\n    o_layer = model[f\"layers.{layer}.attention.wo.weight\"]\n    embedding_delta = torch.matmul(stacked_qkv_attention, o_layer.T)  # [17×4096] × [4096×4096] = [17×4096]\n\n########################### 第一个残差操作 ##############################################\n\n    # 第一个残差操作\n    # 将注意力层的输出与原始输入相加，完成残差连接\n    embedding_after_edit = final_embedding + embedding_delta  # [17x4096] + [17x4096] = [17x4096]\n    \n    \n    #########################################################################################################################\n    #################### 第二轮：归一化 - 特征变换 - 残差操作 ##############################\n    \n    ########################### 第二次归一化 ##################################################\n    \n    # 第二次归一化\n    embedding_after_edit_normalized = rms_norm(embedding_after_edit, model[f\"layers.{layer}.ffn_norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\n    \n    ################## 第二次特征变换 - 前馈网络 ##########################\n    \n    # 加载前馈网络（SwiGLU）的参数矩阵\n    w1 = model[f\"layers.{layer}.feed_forward.w1.weight\"]  # [14336x4096]\n    w3 = model[f\"layers.{layer}.feed_forward.w3.weight\"]  # [14336x4096]\n    w2 = model[f\"layers.{layer}.feed_forward.w2.weight\"]  # [4096x14336]\n    \n    # 计算前馈网络的结果（输出 = (silu(XW1) * XW3)W2）\n    # [17x4096] x [4096x14336] x [14336x4096] = [17x4096]\n    output_after_feedforward = torch.matmul(torch.functional.F.silu(torch.matmul(embedding_after_edit_normalized, w1.T)) * torch.matmul(embedding_after_edit_normalized, w3.T), w2.T)\n    \n    ########################### 第二次残差操作 ##############################################\n    \n    # 第二次残差操作，得到当前Transformer块的最终输出结果\n    # 将前馈层的输出与原始输入相加，完成残差连接\n    final_embedding = embedding_after_edit+output_after_feedforward  # [17x4096] + [17x4096] = [17x4096]\n```\n\n\n\n# 让我们完成最后一步，预测下一个标记\n\n现在我们已经得到了最终的嵌入表示，其中包含了预测下一个标记所需的所有信息。\n\u003Cbr>\n这个嵌入的形状与输入标记嵌入的形状相同，都是[17x4096]，其中17是标记的数量，4096是嵌入的维度。\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_93e18e3d2e08.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n## 首先，对最后一个Transformer层的输出进行最后一次归一化\n\n\n```python\n# 在整个模型中执行最后一次归一化\nfinal_embedding = rms_norm(final_embedding, model[\"norm.weight\"])  # [17x4096] & [4096] -> [17x4096]\nfinal_embedding.shape\n```\n\n\n\n\n    torch.Size([17, 4096])\n\n\n\n## 然后，基于最后一个标记对应的嵌入进行预测（通过线性映射到词汇表维度）\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_69eadbc654a8.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n\u003Cbr>\n我们将使用输出解码器（一个线性映射层）将最后一个标记的嵌入向量转换为下一个标记的预测结果（维度为词汇表大小。如果我们对结果应用softmax函数，每个维度的值就代表下一个标记属于该词的概率）。\n\u003Cbr>\u003Cbr>\n\n为什么我们只使用最后一个标记的输出向量来预测下一个标记呢？\n\u003Cbr>\n因为在训练过程中，模型的目标是根据当前标记及其之前的所有标记来预测下一个标记。因此，每个标记对应的输出向量用于预测它自己之后的下一个标记，而不是整个输入序列的下一个标记。\n\u003Cbr>\u003Cbr>\n\n在我们的示例中，我们希望答案是42 :)\n\u003Cbr>\n注：42是《银河系漫游指南》一书中“生命、宇宙以及任何事情的终极问题的答案”。大多数现代大型语言模型都会回答42，这将验证我们整个代码的正确性！祝我们好运 :)\n\n\n```python\n# 执行最后一次线性映射，将嵌入映射到词汇表维度大小，作为下一个标记的预测\nlogits = torch.matmul(final_embedding[-1], model[\"output.weight\"].T)  # [17x4096] -> [4096] -> [4096] x [4096x128256] = [128256]\nlogits.shape\n```\n\n\n\n\n    torch.Size([128256])\n\n\n\n## 这就是预测结果！\n\n\n```python\n# 提取概率最高的维度对应的id，\n# 就是预测的下一个标记的id\nnext_token = torch.argmax(logits, dim=-1)  # 获取最大值对应的索引，即预测的下一个标记id。[128256] -> [1]\nnext_token\n```\n\n\n\n\n    tensor(2983)\n\n\n\n\n```python\n# 根据预测的id，还原为具体的预测值\ntokenizer.decode([next_token.item()])\n```\n\n\n\n\n    '42'\n\n\n\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_b7e00f296458.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\n# 让我们深入探讨一下，不同的嵌入或标记掩码策略可能会如何影响预测结果 :)\n\n现在我们已经得到了最终的预测结果。如果你仍然感兴趣，不妨探索一下之前提到的一些问题~\n\u003Cbr>\n\n我们将简要探讨三种情况：\n1. 除了top-1结果之外，当前预测中还预测了哪些内容，即top-k结果？\n2. 如果我们使用其他标记的输出嵌入来进行预测，会得到什么结果？\n3. 如果在之前的注意力计算中没有对未来的标记进行掩码，预测结果会有什么不同？\n\n\n```python\n# 首先来看看top-k预测结果\nlogits_sort, logits_idx = torch.sort(logits, dim=-1, descending=True)  # 将预测概率最高的标记放在最前面，[128256]\n[tokenizer.decode([i]) for i in logits_idx[:10]]  # 查看概率最高的前10个结果\n```\n\n\n\n\n    ['42', '6', '43', '41', '4', '1', '45', '3', '2', '46']\n\n\n\n\n```python\n# 接下来，让我们看看使用其他标记的嵌入进行预测能得到什么\nlogits_all_token = torch.matmul(final_embedding, model[\"output.weight\"].T)  # 将嵌入映射到与词汇表相同大小，[17x4096] x [4096x128256] = [17x128256]\nlogits_all_token_sort, logits_all_token_idx = torch.sort(logits_all_token, dim=-1, descending=True)  # 将预测概率最高的标记放在最前面，[17x128256]\n\nprint('输入标记:', prompt_split_as_tokens)  # 显示输入标记，[17]\n\n# 根据每个标记的输出嵌入显示下一个标记预测的结果\nfor i in range(len(final_embedding)):\n    print(f'基于第{i+1}个标记的预测结果:', [tokenizer.decode([j]) for j in logits_all_token_idx[i][:10]])  # 输出概率最高的前10个结果\n    \n_=\"\"\"\n可以看出，当基于每个标记进行预测时，预测结果是“当前标记”之后下一个标记的可能结果，\n而不是对整个完整输入的预测结果。\n因此，在实际预测中，只会使用最后一个标记的嵌入来进行预测。\n\"\"\"\n```\n\n    输入标记: ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n    基于第1个标记的预测结果: ['Question', 'def', '#', 'The', 'import', 'Tags', 'A', 'package', 'Home', 'I']\n    基于第2个标记的预测结果: [' ', ' best', ' first', ' most', ' new', ' world', ' last', ' same', ' way', ' number']\n    基于第3个标记的预测结果: [' to', ' is', ' was', ' of', ' lies', ',', ' for', ' you', ' key', ' will']\n    基于第4个标记的预测结果: [' the', ' this', ' your', ' all', ' that', ' a', ' my', ' life', ' \"', ' everything']\n    基于第5个标记的预测结果: [' question', ' problem', ' above', ' ultimate', ' first', ' r', ' following', ' questions', ' most', ' previous']\n    基于第6个标记的预测结果: [' question', ' questions', ' mystery', '\\xa0', ' quest', '\\n', ' life', ' philosophical', ' qu', ' problem']\n    基于第7个标记的预测结果: [' of', '\\n', ' to', ' is', '?\\n', ',', '.\\n', ':', '...']\n    基于第8个标记的预测结果: [' life', ' Life', ' the', '\\xa0', ' everything', ' existence', '\\n', ' LIFE', ' all', ' human']\n    基于第9个标记的预测结果: [',', ' the', '\\n', ' and', ' is', ',\\n', '.\\n', '?\\n', '...']\n    基于第10个标记的预测结果: [' the', ' universe', ' and', ' etc', '\\xa0', ' is', ' death', ' of', ' or', ' everything']\n    基于第11个标记的预测结果: [' universe', ' Universe', '\\n', '\\xa0', ' un', ' univers', ' uni', ' cosmos', ' universal', ' u']\n    基于第12个标记的预测结果: [',', ' and', ' &', '\\n', ',\\n', ' ,', '...', ',and', '...', '\\xa0']\n    基于第13个标记的预测结果: [' and', ' everything', ' &', ' the', ' etc', '\\xa0', ' is', ' or', ' ...\\n', ' an']\n    基于第14个标记的预测结果: [' everything', '\\xa0', ' the', ' every', '\\n', ' ever', ' all', ' Everything', ' EVERY', '...']\n    基于第15个标记的预测结果: ['\\n', ' is', '.\\n', '.', '?\\n', ',', ' (', '\\n\\n', '...', '\\n', ' in']\n    基于第16个标记的预测结果: [' ', '\\n', '...', '...', ':', ' forty', ' not', ' \"', '…', ' a']\n    基于第17个标记的预测结果: ['42', '6', '43', '41', '4', '1', '45', '3', '2', '46']\n\n\n\n```python\n# 最后，让我们看看在计算注意力时不屏蔽未来标记时，预测结果会是什么样子\n# 此时，基于每个标记的预测结果如下\n# 可以看出，由于可以看到未来的标记，每个标记的嵌入能够更准确地预测“它后面的下一个标记”（有点像“作弊”）\n\n_=\"\"\"\n输入标记: ['\u003C|begin_of_text|>', 'the', ' answer', ' to', ' the', ' ultimate', ' question', ' of', ' life', ',', ' the', ' universe', ',', ' and', ' everything', ' is', ' ']\n基于第1个标记的预测结果: [':\u002F\u002F', '.Forms', '_REF', ' Angeles', '.swing', '', 'php', 'во', 'ysics', '']\n基于第2个标记的预测结果: [' answer', ' Hitch', ' universe', ' question', ' ultimate', ' meaning', ' hitch', ' Universe', ' Answer', ' reason']\n基于第3个标记的预测结果: [' to', ' is', ',', ':', ' was', '\\n', ' ', ' (', '\\n\\n', ' of']\n基于第4个标记的预测结果: [' the', ' life', ' this', ' which', ' everything', ' that', ' how', ' why', ' ', ' all']\n基于第5个标记的预测结果: [' ultimate', ' question', ' great', ' meaning', ' universe', ' Ultimate', ' everything', ' life', ' holy', ' greatest']\n基于第6个标记的预测结果: [' question', ' answer', ' is', ' was', '\\n', ' questions', ' mystery', '\\n\\n', ' what', ' Question']\n基于第7个标记的预测结果: [' of', ' is', '\\n', ',', ' about', ':', ' to', ' in', ' (', '\u003C|end_of_text|>']\n基于第8个标记的预测结果: [' life', ' existence', ' everything', ' Life', ' the', ' death', ' time', ' all', ' why', ' which']\n基于第9个标记的预测结果: [',', ' is', ' the', '\\n', ':', ' (', '...', ' and', ' ,', ' -']\n基于第10个标记的预测结果: [' the', ' and', ' is', ' death', ' The', ' which', ' or', '\\xa0', ' existence', ' don']\n基于第11个标记的预测结果: [' universe', ' answer', ' cosmos', ' world', ' existence', ' Universe', ' everything', ' un', ' meaning', ' question']\n基于第12个标记的预测结果: [',', ' and', ' is', ' &', '\\n', ' ,', '.', '...', ' (', ' ']\n基于第13个标记的预测结果: [' and', ' &', ' don', ' the', ' is', ' a', ' or', ' Douglas', '\\xa0', '\u003C|end_of_text|>']\n基于第14个标记的预测结果: [' everything', ' dough', ' don', ' ever', ' deep', ' Douglas', ' the', ' every', ' all', ' death']\n基于第15个标记的预测结果: ['\\n', ' is', ',', '.', ' ', ' (', ':', '\u003C|end_of_text|>', '\\n\\n', '.\\n']\n基于第16个标记的预测结果: [' ', '\\n', ' forty', '...', ' \"', '42', ' the', ':', '\\xa0', ' to']\n基于第17个标记的预测结果: ['42', '6', '4', '41', '1', '2', '3', '7', '5', '43']\n\"\"\"\n```\n\n# 需要预测多个标记吗？只需使用KV缓存即可！（这真的让我费了很大的劲才弄清楚。Orz）\n\n\n\u003Ch3>如何连续预测多个标记\u003C\u002Fh3>\n\n现在，我们已经完成了对输入文本的下一个词的预测。但如果我们的预期输出需要多个标记呢？\n\u003Cbr>\n例如，在实际的大模型应用中，模型通常不会只输出一个词，而是常常会输出一段文字，甚至是非常长的一段文本。这种能力是如何实现的呢？\n\u003Cbr>\n其实很简单：我们只需要反复调用大模型的预测过程，逐步生成完整的句子或段落。\n\u003Cbr>\n这个过程就像“滚雪球”一样：每预测出一个词，我们就把这个词添加到当前的输入序列中，然后再次调用模型进行新一轮的预测。当遇到停止符号（在Llama3中是一个特殊的标记“\u003C|end_of_text|>”）或者达到最大长度限制（超参数max_seq_len）时，预测就会停止。\n\u003Cbr>\u003Cbr>\n这听起来效率不高吗？确实如此！\n\u003Cbr>\n这就是为什么会有像KV缓存这样广为人知的优化机制。通过缓存历史标记的KV向量，我们可以减少每次输入和计算的负担，从而大幅提升推理效率。\n\u003Cbr>\n得益于缓存机制，当我们使用大型模型进行推理时，你可能会注意到：等待第一个标记输出往往是最耗时的阶段。但一旦第一个标记被输出，后续标记的输出速度就会显著加快。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>KV缓存的优缺点\u003C\u002Fh3>\n\n**优点**：在连续预测时，我们每次只需输入新标记，而不需要输入整个文本序列。这大大提高了推理过程中的计算速度。\n\u003Cbr>\n**缺点**：由于引入了缓存机制，推理过程中会占用更多的内存资源。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>KV缓存的原理推导\u003C\u002Fh3>\n\nKV缓存源于对上述矩阵计算过程的观察与分析。通过分析每个输入标记的计算步骤，我们可以发现：在大多数计算环节中，各个标记的计算其实是相对独立的，很少涉及与其他标记的交互。只有在计算注意力机制时，才会出现标记之间的相互作用，因此需要缓存历史的KV向量。\n\u003Cbr>\n\n\u003Ch3>以下是KV缓存的具体推导逻辑：\u003C\u002Fh3>\n\n1. **前提**：要预测下一个标记，我们只需要获取最后一个标记的输出结果（就像我们在预测章节中所做的那样）。\n2. **非注意力部分只需计算新标记**：除了注意力计算之外，其他所有部分的计算都是独立于各个标记的。因此，我们只需要计算新标记，而无需输入历史标记（下文将进一步展开分析）。\n3. **注意力部分也只需计算新标记**：在注意力层中，由于掩码机制的作用，历史标记的输出结果不会受到未来新标记的影响。因此，它们在每一层的输入和输出都是固定的，也就是说，历史标记的QKV向量不会因为新标记的加入而改变。所以，我们只需要计算新标记的注意力即可。\n4. **计算新标记的注意力机制**：注意力层的作用是让标记获取历史标记的上下文信息。因此，对于每一个新标记，我们需要使用所有标记的值向量进行加权求和。这就要求我们必须存储历史标记的值向量。\n5. **计算新标记的注意力权重**：正如第4点所述，我们还需要先获得新标记与历史标记之间的重要性信息，即注意力权重。为此，我们需要将新标记的键向量与所有标记的键向量相乘。因此，我们也需要存储历史标记的键向量。\n6. **KV缓存的形成**：由第4和第5点可知，我们需要存储历史标记的KV向量。由于查询向量并未被使用，因此无需存储。这就是KV缓存的由来。\n7. **KV缓存的效率**：根据第3点，历史的KV向量不会发生变化。因此，在连续预测过程中，它们可以被增量更新，而无需修改历史内容。这样一来，每次预测时，我们只需输入并计算新添加标记的结果，而不必再以完整序列作为输入，从而极大地提升了推理效率。\n\n\u003Ch3>补充：KV缓存中标记计算独立性的分析\u003C\u002Fh3>\n\n**除注意力层外的所有组件（彼此之间无交互）**：\n1. **两次归一化**：每个标记向量都在其自身的特征维度上进行归一化，不涉及其他标记。\n2. **两次残差连接（加法）**：每个标记向量将其自身输出结果加回到自己身上，也不涉及其他标记。\n3. **前馈网络（FFN）**：每个标记向量都乘以相同的权重矩阵W1、W2、W3来得到结果，过程中并不使用其他标记。设想如果输入标记的数量为17个，那么FFN的计算可以简化为：[17×4096] × [4096×14336] × [14336×4096] = [17×4096]。这实际上等同于每次输入一个标记，然后将17个结果拼接成一个矩阵，即：17次（[1×4096] × [4096×14336] × [14336×4096] = [1×4096]）= 17×[1×4096] => [17×4096]。因此，在前馈层中，每个标记的计算实际上并没有与其他标记发生交互。\n\n**注意力层（仅存在新标记与历史标记之间的单向交互）**：\n1. **计算QKV向量**：每个标记向量都乘以相同的QKV权重矩阵来得到结果，不涉及其他标记。\n2. **向QK向量中添加位置信息**：每个标记向量都基于自己的位置独立进行旋转操作，不依赖于其他标记的具体内容。\n3. **计算注意力权重**：注意力权重表示每个标记与其之前所有历史标记之间的相关性，且与未来的标记无关。因此，历史标记的结果不受新标记的影响。而新标记则需要历史标记的键向量缓存。\n4. **计算注意力机制的结果**：注意力机制根据注意力权重对值向量进行加权求和。因此，与上一点的结论类似，历史标记的结果同样不受新标记的影响。而新标记则需要历史标记的值向量缓存。\n\u003Cbr>\u003Cbr>\n\n\u003Ch3>基于KV缓存的注意力计算流程\u003C\u002Fh3>\n\n为了清晰地展示计算过程，我们仅推导单头场景（将其扩展到多头场景的原理和过程与之前的多头注意力实现完全相同）：\n1. 假设历史输入标记为 $S_1$，长度为 N。基于 KV 缓存，我们将存储每个头的 KV 结果矩阵。单个头的形状为 [Nxhead_dim] = [Nx128]。\n2. 假设新添加的输入标记为 $S_2$，长度为 M（可以是新预测的标记、新一轮用户对话的输入，或其他任何场景）。\n3. 计算新标记的 QKV 向量：$Q,K,V = S_2W_{Q,K,V}$ => [Mx4096] x [4096x128] = [Mx128]。\n4. 为 QK 向量加入位置信息：新标记的位置应从 N + 1 开始，而非从 0 开始。[Mx128] -> [Mx128]。\n5. 将新的 KV 值添加到 KV 缓存中，得到更新后的 KV 矩阵，即 [Nx128] -> [(N + M)x128]。\n6. 计算新标记的注意力权重：Attention_weight = softmax(QK\u002Fsqrt(d) + mask) => [Mx128] x [128x(N + M)] = [Mx(N + M)]。\n7. 计算新标记的注意力机制最终结果：Attention_weight x V => [Mx(N + M)] x [(N + M)x128] = [Mx128]。\n8. 将每个头的结果拼接起来，并进行线性映射，得到注意力层的最终输出，其形状为 32x[Mx128] -> [Mx4096]。\n\u003Cbr>\u003Cbr>\n\n由于我们之前的学习过程已经相当全面，这里不再实现优化方案的代码（如果你感兴趣，可以参考 Llama 3 的官方代码，其实现相对简单）。就像之前提到的多头注意力并行计算一样，知道计算过程可以被优化就足够了~\n\n\n\n# 感谢大家。感谢你们持续的学习。爱你们 :)\n\n我们的学习到这里就结束了。希望你也享受了这段阅读的过程！\n\n\u003Ca id=\"from_me\">\u003C\u002Fa>\n\n## 来自我\n如果你看到了这篇作品，感谢你的信任，也感谢你一直学到这里。我很高兴能对你有所帮助~\n\u003Cbr>\n\n如果你想支持我的工作\n1. 给它点个赞⭐~ :)\n2. 请我喝杯咖啡~ [https:\u002F\u002Fko-fi.com\u002Ftherealoliver](https:\u002F\u002Fko-fi.com\u002Ftherealoliver)\n\n\u003Cbr>\n\n## 来自前作作者\n\n如果你想支持我的工作\n\n1. 在 Twitter 上关注我 https:\u002F\u002Ftwitter.com\u002Fnaklecha \n2. 或者，请我喝杯咖啡 [https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha](https:\u002F\u002Fwww.buymeacoffee.com\u002Fnaklecha)\n\n说实话，如果你能看到这里，就已经让我很开心了 :)\n\n是什么激励着我呢？\n\n我和朋友们有一个使命——让科研更加普及！我们创建了一个名为 A10 的研究实验室——[AAAAAAAAAA.org](http:\u002F\u002Faaaaaaaaaa.org\u002F)\n\nA10 的 Twitter 账号——https:\u002F\u002Ftwitter.com\u002Faaaaaaaaaaorg\n\n我们的理念如下：\n\u003Cdiv>\n    \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_readme_3b5392087d3b.png\" width=\"600px\"\u002F>\n\u003C\u002Fdiv>\n\u003Cbr>\u003Cbr>\n再次感谢原作者提供的基础代码和插图，它们也让我学到了很多。\n\u003Cbr>\u003Cbr>\n\n\n# 许可证\n\n版权所有 (c) 2025 张金龙 (https:\u002F\u002Fgithub.com\u002Ftherealoliver)\n\n版权所有 (c) 2024 尼山特·阿克莱查\n\nMIT","# Deepdive-llama3-from-scratch 快速上手指南\n\n本项目是基于 `naklecha\u002Fllama3-from-scratch` 的增强版，旨在通过从零实现每一个张量运算和矩阵乘法，帮助开发者深入理解 Llama3 模型的底层原理、推理过程及 KV-Cache 机制。项目包含详尽的中文代码注释、维度追踪推导及原理讲解。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows (推荐 Linux)\n*   **Python 版本**: Python 3.8+\n*   **硬件要求**: 建议拥有 NVIDIA GPU 以加速推理（CPU 亦可运行但速度较慢），需至少 16GB 内存以加载 8B 模型权重。\n*   **前置依赖**:\n    *   `torch`: 用于构建模型和张量计算\n    *   `tiktoken`: OpenAI 开源的分词库\n    *   `matplotlib`: 用于可视化\n    *   `huggingface-hub`: 用于自动下载模型（可选）\n\n## 安装步骤\n\n### 1. 克隆项目\n首先将项目代码克隆到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ftherealoliver\u002FDeepdive-llama3-from-scratch.git\ncd Deepdive-llama3-from-scratch\n```\n\n### 2. 安装 Python 依赖\n使用 pip 安装所需库。国内用户推荐使用清华或阿里镜像源加速安装：\n\n```bash\npip install torch tiktoken matplotlib huggingface-hub -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 获取模型权重\n本项目需要 Meta 官方的 Llama3 模型权重文件（`Meta-Llama-3-8B`）。\n\n**方式 A：自动下载（推荐）**\n如果您已申请并获得 Hugging Face 的访问权限，项目代码中集成了基于 Hugging Face 的下载逻辑，首次运行时会自动拉取。请确保已登录 HF：\n```bash\nhuggingface-cli login\n```\n\n**方式 B：手动下载**\n您也可以从官方渠道下载后放入指定目录：\n1. 访问 [Meta Llama 下载页](https:\u002F\u002Fllama.meta.com\u002Fllama-downloads\u002F) 或通过 ModelScope 等国内镜像源下载 `Meta-Llama-3-8B`。\n2. 确保使用的是 `original` 文件夹下的原始模型文件。\n3. 将下载的文件放置在项目根目录或代码中指定的路径下。\n\n> **注意**：本项目直接使用原始模型文件（即下载包中的 `original` 文件夹内容），无需转换为 safetensors 或其他格式。\n\n## 基本使用\n\n本项目核心代码位于主脚本中（通常为 `deepdive_llama3.py` 或 README 中指向的主文件），它以逐步执行的方式展示了从加载分词器到预测下一个 token 的全过程。\n\n### 运行示例\n\n直接运行主脚本即可启动完整的推导流程：\n\n```bash\npython deepdive_llama3.py\n```\n\n**执行流程说明：**\n脚本将按顺序执行以下步骤，并在终端输出详细的维度变化和中间结果：\n1.  **加载分词器 (Tokenizer)**: 初始化 BPE 分词器，处理特殊 token。\n2.  **文本嵌入 (Embeddings)**: 将输入文本转换为 token ID 序列，再映射为向量。\n3.  **构建 Transformer 块**:\n    *   执行 RMS 归一化。\n    *   从头实现单头注意力机制（计算 QKV、RoPE 旋转位置编码、Mask 掩码、Softmax）。\n    *   扩展至多头注意力机制并合并结果。\n    *   执行残差连接与前馈神经网络 (FFN) 计算。\n4.  **循环推理**: 完成全部 32 个 Transformer 块的计算。\n5.  **预测输出**: 对最后一个 token 进行归一化和线性映射，输出下一个 token 的预测概率。\n\n### 代码探索建议\n由于本项目主打“教学”与“原理剖析”，建议开发者直接打开核心代码文件阅读：\n*   关注代码中丰富的**中文注释**，解释了每一步“做什么”以及“为什么”。\n*   观察变量名中携带的**维度信息**，辅助理解矩阵形状的变化。\n*   尝试修改输入文本，观察不同策略对预测结果的影响。\n\n---\n*提示：项目中包含关于 KV-Cache 的独立推导章节，如需研究多 token 连续预测优化，请参阅代码中对应的 \"KV-Cache\" 部分。*","某高校 AI 实验室的研究生团队正试图从零复现 Llama3 推理过程，以深入理解大模型底层机制并完成课程项目。\n\n### 没有 Deepdive-llama3-from-scratch 时\n- 面对原始代码中复杂的矩阵运算，学生难以追踪维度变化，常因形状不匹配导致调试失败，耗费大量时间在排查基础错误上。\n- 缺乏对“为什么这样做”的原理解释，只能机械地复制代码，无法真正掌握注意力机制和 RMS 归一化等核心设计思想。\n- 关于 KV-Cache 的资料零散且晦涩，团队在推导缓存优化逻辑时陷入瓶颈，难以理解其在加速推理中的具体作用。\n- 英文文档配合机器翻译的代码注释存在表达歧义，初学者阅读障碍大，团队协作沟通成本极高。\n\n### 使用 Deepdive-llama3-from-scratch 后\n- 代码中详尽的维度追踪注释让每一步矩阵变换清晰可见，学生能迅速定位计算逻辑，调试效率提升数倍。\n- 丰富的原理推导章节不仅展示了代码实现，更深度剖析了设计初衷，帮助团队成员从根本上吃透了模型架构。\n- 新增的 KV-Cache 专属推导章将抽象概念具象化，团队顺利完成了从理论推导到代码落地的全过程，掌握了加速推理的关键。\n- 提供地道的中英文双语代码与文档，消除了语言隔阂，使得组内不同英语水平的成员都能无障碍协作，学习曲线显著平缓。\n\nDeepdive-llama3-from-scratch 通过“知其然更知其所以然”的深度教学，将黑盒般的模型推理转化为透明、可掌控的学习路径。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftherealoliver_Deepdive-llama3-from-scratch_cf0397cb.png","therealoliver","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftherealoliver_10573385.png","AI learner, developer in PhD\r\n\r\ntherealoliverzhang@gmail.com",null,"haidian, beijing","https:\u002F\u002Fgithub.com\u002Ftherealoliver",[82],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",100,629,52,"2026-03-21T08:24:58","MIT","未说明","需要 NVIDIA GPU 以加载和运行 Meta-Llama-3-8B 模型（具体显存需求取决于模型大小，8B 模型通常建议 16GB+ 显存），CUDA 版本未说明","建议 16GB+（用于加载 8B 参数模型及中间计算）",{"notes":94,"python":90,"dependencies":95},"本项目需手动下载 Meta-Llama-3-8B 原始权重文件（约 5GB+），支持通过 HuggingFace 或官网下载。代码从零实现 Llama3 推理过程，包含详细的矩阵维度追踪和原理注释。注意：README 中提到的架构图存在一处关于残差连接输入的小错误，已在文中更正说明。",[96,97,98,99],"torch","tiktoken","matplotlib","json",[26,13,15],[102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121],"inference","kv-cache","llama","llms","attention","attention-mechanism","gpt","language-model","llm-configuration","mask","multi-head-attention","positional-encoding","residuals","rms","rms-norm","rope","swiglu","tokenizer","transformer","rotary-position-encoding","2026-03-27T02:49:30.150509","2026-04-06T07:10:08.154053",[],[]]