[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-gordicaleksa--pytorch-original-transformer":3,"tool-gordicaleksa--pytorch-original-transformer":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,52],"视频",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[14,35],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":10,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":109,"github_topics":110,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":123,"updated_at":124,"faqs":125,"releases":126},4317,"gordicaleksa\u002Fpytorch-original-transformer","pytorch-original-transformer","My implementation of the original transformer model (Vaswani et al.). I've additionally included the playground.py file for visualizing otherwise seemingly hard concepts. Currently included IWSLT pretrained models.","pytorch-original-transformer 是经典 Transformer 模型（Vaswani 等人提出）的 PyTorch 复现版本，旨在降低学习门槛，帮助用户轻松上手并深入理解这一革命性架构。它解决了原始论文中数学公式抽象、核心概念难以直观把握的痛点，特别针对位置编码、自定义学习率调度及标签平滑等关键机制提供了可视化辅助。\n\n该项目不仅包含完整的模型代码和预训练的机器翻译模型，更独特的亮点在于内置了 `playground.py` 脚本。通过该脚本，用户可以将枯燥的公式转化为直观的图表，例如将复杂的位置编码公式映射为清晰的热力图，或将学习率变化曲线具象化，让“只可意会”的概念变得一目了然。\n\npytorch-original-transformer 非常适合希望从零掌握 Transformer 原理的开发者、学生及研究人员使用。对于想要探究注意力机制本质，而非直接调用黑盒 API 的技术人员来说，这是一个极佳的学习资源。配合代码中详尽的注释与可视化工具，它能帮助用户快速建立对现代大语言模型基石的深刻认知，是通往高阶 NLP 研究的理想起点。","## The Original Transformer (PyTorch) :computer: = :rainbow:\nThis repo contains PyTorch implementation of the original transformer paper (:link: [Vaswani et al.](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762)). \u003Cbr\u002F>\nIt's aimed at making it **easy to start playing and learning** about transformers. \u003Cbr\u002F>\n\n## Table of Contents\n  * [What are transformers?](#what-are-transformers)\n  * [Understanding transformers](#understanding-transformers)\n  * [Machine translation](#machine-translation)\n  * [Setup](#setup)\n  * [Usage](#usage)\n  * [Hardware requirements](#hardware-requirements)\n\n## What are transformers\n\nTransformers were originally proposed by Vaswani et al. in a seminal paper called [Attention Is All You Need](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.03762.pdf).\n\nYou probably heard of transformers one way or another. **GPT-3 and BERT** to name a few well known ones :unicorn:. The main idea\nis that they showed that you don't have to use recurrent or convolutional layers and that simple architecture coupled with attention is super powerful. It\ngave the benefit of **much better long-range dependency modeling** and the architecture itself is highly **parallelizable** (:computer::computer::computer:) which leads to better compute efficiency!\n\nHere is how their beautifully simple architecture looks like:\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_1d40780ae5f0.png\" width=\"350\"\u002F>\n\u003C\u002Fp>\n\n## Understanding transformers\n\nThis repo is supposed to be a learning resource for understanding transformers as the original transformer by itself is not a SOTA anymore.\n\nFor that purpose the code is (hopefully) well commented and I've included the `playground.py` where I've visualized a couple\nof concepts which are hard to explain using words but super simple once visualized. So here we go!\n\n### Positional Encodings\n\nCan you parse this one in a glimpse of the eye?\n\n\u003Cp align=\"left\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_8b873cd441e3.png\"\u002F>\n\u003C\u002Fp>\n\nNeither can I. Running the `visualize_positional_encodings()` function from `playground.py` we get this:\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_a943fc66050b.jpg\"\u002F>\n\u003C\u002Fp>\n\nDepending on the position of your source\u002Ftarget token you \"pick one row of this image\" and you add it to it's embedding vector, that's it.\nThey could also be learned, but it's just more fancy to do it like this, obviously! :nerd_face:\n\n### Custom Learning Rate Schedule\n\nSimilarly can you parse this one in `O(1)`?\n\n\u003Cp align=\"left\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_e5c46aa3ebf8.png\"\u002F>\n\u003C\u002Fp>\n\nNoup? So I thought, here it is visualized:\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_54d6d9215890.png\"\u002F>\n\u003C\u002Fp>\n\nIt's super easy to understand now. Now whether this part was crucial for the success of transformer? I doubt it.\nBut it's cool and makes things more complicated. :nerd_face: (`.set_sarcasm(True)`)\n\n*Note: model dimension is basically the size of the embedding vector, baseline transformer used 512, the big one 1024*\n\n### Label Smoothing\n\nFirst time you hear of label smoothing it sounds tough but it's not. You usually set your target vocabulary distribution\nto a `one-hot`. Meaning 1 position out of 30k (or whatever your vocab size is) is set to 1. probability and everything else to 0.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_d68e27bf7661.png\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\nIn label smoothing instead of placing 1. on that particular position you place say 0.9 and you evenly distribute the rest of\nthe \"probability mass\" over the other positions \n(that's visualized as a different shade of purple on the image above in a fictional vocab of size 4 - hence 4 columns)\n\n*Note: Pad token's distribution is set to all zeros as we don't want our model to predict those!*\n\nAside from this repo (well duh) I would highly recommend you go ahead and read [this amazing blog](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F) by Jay Alammar!\n\n## Machine translation\n\nTransformer was originally trained for the NMT (neural machine translation) task on the [WMT-14 dataset](https:\u002F\u002Ftorchtext.readthedocs.io\u002Fen\u002Flatest\u002Fdatasets.html#wmt14) for:\n* English to German translation task (achieved 28.4 [BLEU score](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBLEU))\n* English to French translation task (achieved 41.8 BLEU score)\n \nWhat I did (for now) is I trained my models on the [IWSLT dataset](https:\u002F\u002Ftorchtext.readthedocs.io\u002Fen\u002Flatest\u002Fdatasets.html#iwslt), which is much smaller, for the\nEnglish-German language pair, as I speak those languages so it's easier to debug and play around.\n\nI'll also train my models on WMT-14 soon, take a look at the [todos](#todos) section.\n\n---\n\nAnyways! Let's see what this repo can practically do for you! Well it can translate!\n\nSome short translations from my German to English IWSLT model: \u003Cbr\u002F>\u003Cbr\u002F>\nInput: `Ich bin ein guter Mensch, denke ich.` (\"gold\": I am a good person I think) \u003Cbr\u002F>\nOutput: `['\u003Cs>', 'I', 'think', 'I', \"'m\", 'a', 'good', 'person', '.', '\u003C\u002Fs>']` \u003Cbr\u002F>\nor in human-readable format: `I think I'm a good person.`\n\nWhich is actually pretty good! Maybe even better IMO than Google Translate's \"gold\" translation.\n\n---\n\nThere are of course failure cases like this: \u003Cbr\u002F>\u003Cbr\u002F>\nInput: `Hey Alter, wie geht es dir?` (How is it going dude?) \u003Cbr\u002F>\nOutput: `['\u003Cs>', 'Hey', ',', 'age', 'how', 'are', 'you', '?', '\u003C\u002Fs>']` \u003Cbr\u002F>\nor in human-readable format: `Hey, age, how are you?` \u003Cbr\u002F>\n\nWhich is actually also not completely bad! Because:\n* First of all the model was trained on IWSLT (TED like conversations)\n* \"Alter\" is a colloquial expression for old buddy\u002Fdude\u002Fmate but it's literal meaning is indeed age.\n\nSimilarly for the English to German model.\n\n## Setup\n\nSo we talked about what transformers are, and what they can do for you (among other things). \u003Cbr\u002F>\nLet's get this thing running! Follow the next steps:\n\n1. `git clone https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer`\n2. Open Anaconda console and navigate into project directory `cd path_to_repo`\n3. Run `conda env create` from project directory (this will create a brand new conda environment).\n4. Run `activate pytorch-transformer` (for running scripts from your console or set the interpreter in your IDE)\n\nThat's it! It should work out-of-the-box executing environment.yml file which deals with dependencies. \u003Cbr\u002F>\nIt may take a while as I'm automatically downloading SpaCy's statistical models for English and German.\n\n-----\n\nPyTorch pip package will come bundled with some version of CUDA\u002FcuDNN with it,\nbut it is highly recommended that you install a system-wide CUDA beforehand, mostly because of the GPU drivers. \nI also recommend using Miniconda installer as a way to get conda on your system.\nFollow through points 1 and 2 of [this setup](https:\u002F\u002Fgithub.com\u002FPetlja\u002FPSIML\u002Fblob\u002Fmaster\u002Fdocs\u002FMachineSetup.md)\nand use the most up-to-date versions of Miniconda and CUDA\u002FcuDNN for your system.\n\n## Usage\n\n#### Option 1: Jupyter Notebook\n\nJust run `jupyter notebook` from you Anaconda console and it will open the session in your default browser. \u003Cbr\u002F>\nOpen `The Annotated Transformer ++.ipynb` and you're ready to play! \u003Cbr\u002F>\n\n---\n\n**Note:** if you get `DLL load failed while importing win32api: The specified module could not be found` \u003Cbr\u002F>\nJust do `pip uninstall pywin32` and then either `pip install pywin32` or `conda install pywin32` [should fix it](https:\u002F\u002Fgithub.com\u002Fjupyter\u002Fnotebook\u002Fissues\u002F4980)!\n\n#### Option 2: Use your IDE of choice\n\nYou just need to link the Python environment you created in the [setup](#setup) section.\n\n### Training\n\nTo run the training start the `training_script.py`, there is a couple of settings you will want to specify:\n* `--batch_size` - this is important to set to a maximum value that won't give you CUDA out of memory\n* `--dataset_name` - Pick between `IWSLT` and `WMT14` (WMT14 is not advisable [until I add](#todos) multi-GPU support)\n* `--language_direction` - Pick between `E2G` and `G2E`\n\nSo an example run (from the console) would look like this: \u003Cbr\u002F>\n`python training_script.py --batch_size 1500 --dataset_name IWSLT --language_direction G2E`\n\nThe code is well commented so you can (hopefully) understand how the training itself works. \u003Cbr\u002F>\n\nThe script will:\n* Dump checkpoint *.pth models into `models\u002Fcheckpoints\u002F`\n* Dump the final *.pth model into `models\u002Fbinaries\u002F`\n* Download IWSLT\u002FWMT-14 (the first time you run it and place it under `data\u002F`)\n* Dump [tensorboard data](#evaluating-nmt-models) into `runs\u002F`, just run `tensorboard --logdir=runs` from your Anaconda\n* Periodically write some training metadata to the console\n\n*Note: data loading is slow in torch text, and so I've implemented a custom wrapper which adds the caching mechanisms\nand makes things ~30x faster! (it'll be slow the first time you run stuff)*\n\n### Inference (Translating)\n\nThe second part is all about playing with the models and seeing how they translate! \u003Cbr\u002F>\nTo get some translations start the `translation_script.py`, there is a couple of settings you'll want to set:\n* `--source_sentence` - depending on the model you specify this should either be English\u002FGerman sentence\n* `--model_name` - one of the pretrained model names: `iwslt_e2g`, `iwslt_g2e` or your model(*)\n* `--dataset_name` - keep this in sync with the model, `IWSLT` if the model was trained on IWSLT\n* `--language_direction` - keep in sync, `E2G` if the model was trained to translate from English to German\n\n(*) Note: after you train your model it'll get dumped into `models\u002Fbinaries` see what it's name is and specify it via\nthe `--model_name` parameter if you want to play with it for translation purpose. If you specify some of the pretrained\nmodels they'll **automatically get downloaded** the first time you run the translation script.\n\nI'll link IWSLT pretrained model links here as well: [English to German](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fa6pfo6t9m2dh1jq\u002Fiwslt_e2g.pth?dl=1) and [German to English.](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fdgcd4xhwig7ygqd\u002Fiwslt_g2e.pth?dl=1)\n\nThat's it you can also visualize the attention check out [this section.](#visualizing-attention) for more info.\n\n### Evaluating NMT models\n\nI tracked 3 curves while training:\n* training loss (KL divergence, batchmean)\n* validation loss (KL divergence, batchmean)\n* BLEU-4 \n\n[BLEU is an n-gram based metric](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP02-1040.pdf) for quantitatively evaluating the quality of machine translation models. \u003Cbr\u002F>\nI used the BLEU-4 metric provided by the awesome **nltk** Python module.\n\nCurrent results, models were trained for 20 epochs (DE stands for Deutch i.e. German in German :nerd_face:):\n\n| Model | BLEU score | Dataset |\n| --- | --- | --- |\n| [Baseline transformer (EN-DE)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fa6pfo6t9m2dh1jq\u002Fiwslt_e2g.pth?dl=1) | **27.8** | IWSLT val |\n| [Baseline transformer (DE-EN)](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fdgcd4xhwig7ygqd\u002Fiwslt_g2e.pth?dl=1) | **33.2** | IWSLT val |\n| Baseline transformer (EN-DE) | x | WMT-14 val |\n| Baseline transformer (DE-EN) | x | WMT-14 val |\n\nI got these using greedy decoding so it's a pessimistic estimate, I'll add beam decoding [soon.](#todos)\n\n**Important note:** Initialization matters a lot for the transformer! I initially thought that other implementations\nusing Xavier initialization is again one of those arbitrary heuristics and that PyTorch default init will do - I was wrong:\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_d9fd72bf75af.png\" width=\"450\"\u002F>\n\u003C\u002Fp>\n\nYou can see here 3 runs, the 2 lower ones used PyTorch default initialization (one used `mean` for KL divergence\nloss and the better one used `batchmean`), whereas the upper one used **Xavier uniform** initialization!\n \n---\n\nIdea: you could potentially also periodically dump translations for a reference batch of source sentences. \u003Cbr\u002F>\nThat would give you some qualitative insight into how the transformer is doing, although I didn't do that. \u003Cbr\u002F>\nA similar thing is done when you have hard time quantitatively evaluating your model like in [GANs](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-gans) and [NST](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-nst-feedforward) fields.\n\n### Tracking using Tensorboard\n\nThe above plot is a snippet from my Azure ML run but when I run stuff locally I use Tensorboard.\n\nJust run `tensorboard --logdir=runs` from your Anaconda console and you can track your metrics during the training.\n\n### Visualizing attention\n\nYou can use the `translation_script.py` and set the `--visualize_attention` to True to additionally understand what your\nmodel was \"paying attention to\" in the source and target sentences.\n\nHere are the attentions I get for the input sentence `Ich bin ein guter Mensch, denke ich.`\n\nThese belong to layer 6 of the encoder. You can see all of the 8 multi-head attention heads.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_30f87adea660.png\" width=\"850\"\u002F>\n\u003C\u002Fp>\n\nAnd this one belongs to decoder layer 6 of the self-attention decoder MHA (multi-head attention) module. \u003Cbr\u002F>\nYou can notice an interesting **triangular pattern** which comes from the fact that target tokens can't look ahead!\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_aff1c090bad5.png\" width=\"850\"\u002F>\n\u003C\u002Fp>\n\nThe 3rd type of MHA module is the source attending one and it looks similar to the plot you saw for the encoder. \u003Cbr\u002F>\nFeel free to play with it at your own pace!\n\n*Note: there are obviously some bias problems with this model but I won't get into that analysis here*\n\n## Hardware requirements\n\nYou really need a decent hardware if you wish to train the transformer on the **WMT-14** dataset.\n\nThe authors took:\n* **12h on 8 P100 GPUs** to train the baseline model and **3.5 days** to train the big one.\n\nIf my calculations are right that amounts to ~19 epochs (100k steps, each step had ~25000 tokens and WMT-14 has ~130M src\u002Ftrg tokens)\nfor the baseline and 3x that for the big one (300k steps).\n\nOn the other hand it's much more feasible to train the model on the **IWSLT** dataset. It took me:\n* 13.2 min\u002Fepoch (1500 token batch) on my RTX 2080 machine (8 GBs of VRAM)\n* ~34 min\u002Fepoch (1500 token batch) on Azure ML's K80s (24 GBs of VRAM)\n\nI could have pushed K80s to 3500+ tokens\u002Fbatch but had some CUDA out of memory problems.\n\n### Todos:\n\nFinally there are a couple more todos which I'll hopefully add really soon:\n* Multi-GPU\u002Fmulti-node training support (so that you can train a model on WMT-14 for 19 epochs)\n* Beam decoding (turns out it's not that easy to implement this one!)\n* BPE and shared source-target vocab (I'm using SpaCy now)\n\nThe repo already has everything it needs, these are just the bonus points. I've tested everything\nfrom environment setup, to automatic model download, etc.\n\n## Video learning material\n\nIf you're having difficulties understanding the code I did an in-depth overview of the paper [in this video:](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cbYxHkgkSVs)\n\n\u003Cp align=\"left\">\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cbYxHkgkSVs\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_db201dd34f9a.jpg\" \nalt=\"A deep dive into the attention is all you need paper\" width=\"480\" height=\"360\" border=\"10\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\nI have some more videos which could further help you understand transformers:\n* [My approach to understanding NLP\u002Ftransformers](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bvBK-coXf9I)\n* [Another overview of the paper (a bit higher level)](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=n9sLZPLOxG8)\n* [A case study of how this project was developed](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=px4rtkWHFvM)\n\n## Acknowledgements\n\nI found these resources useful (while developing this one):\n\n* [The Annotated Transformer](http:\u002F\u002Fnlp.seas.harvard.edu\u002F2018\u002F04\u002F03\u002Fattention.html)\n* [PyTorch official implementation](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002F187e23397c075ec2f6e89ea75d24371e3fbf9efa\u002Ftorch\u002Fnn\u002Fmodules\u002Ftransformer.py)\n\nI found some inspiration for the model design in the The Annotated Transformer but I found it hard to understand, and\nit had some bugs. It was mainly written with researchers in mind. Hopefully this repo opens up\nthe understanding of transformers to the common folk as well! :nerd_face:\n\n## Citation\n\nIf you find this code useful, please cite the following:\n\n```\n@misc{Gordić2020PyTorchOriginalTransformer,\n  author = {Gordić, Aleksa},\n  title = {pytorch-original-transformer},\n  year = {2020},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer}},\n}\n```\n\n## Connect with me\n\nIf you'd love to have some more AI-related content in your life :nerd_face:, consider:\n* Subscribing to my YouTube channel [The AI Epiphany](https:\u002F\u002Fwww.youtube.com\u002Fc\u002FTheAiEpiphany) :bell:\n* Follow me on [LinkedIn](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Faleksagordic\u002F) and [Twitter](https:\u002F\u002Ftwitter.com\u002Fgordic_aleksa) :bulb:\n* Follow me on [Medium](https:\u002F\u002Fgordicaleksa.medium.com\u002F) :books: :heart:\n\n## Licence\n\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer\u002Fblob\u002Fmaster\u002FLICENCE)","## 原始 Transformer（PyTorch）：computer: = :rainbow:\n此仓库包含原始 Transformer 论文的 PyTorch 实现（:link: [Vaswani 等](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762)）。\u003Cbr\u002F>\n其目标是让读者能够**轻松上手并学习** Transformer 模型。\u003Cbr\u002F>\n\n## 目录\n  * [什么是 Transformer？](#what-are-transformers)\n  * [理解 Transformer](#understanding-transformers)\n  * [机器翻译](#machine-translation)\n  * [设置](#setup)\n  * [使用方法](#usage)\n  * [硬件要求](#hardware-requirements)\n\n## 什么是 Transformer\n\nTransformer 最早由 Vaswani 等人在一篇开创性的论文《Attention Is All You Need》中提出（[链接](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1706.03762.pdf)）。\n\n你可能或多或少听说过 Transformer。比如**GPT-3 和 BERT**等知名模型 :unicorn:。其核心思想在于，他们证明了无需使用循环层或卷积层，仅通过简单的架构结合注意力机制就能发挥强大的作用。这带来了**更优秀的长距离依赖建模能力**，并且该架构本身高度**可并行化**（:computer::computer::computer:），从而显著提升计算效率！\n\n以下是其简洁而优美的架构示意图：\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_1d40780ae5f0.png\" width=\"350\"\u002F>\n\u003C\u002Fp>\n\n## 理解 Transformer\n\n本仓库旨在作为学习资源，帮助大家理解 Transformer 模型，因为原始的 Transformer 已经不再是当前的 SOTA 模型了。\n\n为此，代码中添加了详尽的注释，并且我编写了 `playground.py` 文件，用于可视化一些难以用文字描述、但通过图像却能直观理解的概念。接下来就让我们开始吧！\n\n### 位置编码\n\n你能一眼看懂这个公式吗？\n\n\u003Cp align=\"left\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_8b873cd441e3.png\"\u002F>\n\u003C\u002Fp>\n\n我也不行。不过，运行 `playground.py` 中的 `visualize_positional_encodings()` 函数后，我们得到如下可视化结果：\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_a943fc66050b.jpg\"\u002F>\n\u003C\u002Fp>\n\n根据源语言或目标语言词元的位置，只需“选取这张图中的一行”，将其加到对应的嵌入向量中即可，就这么简单。当然，位置编码也可以通过学习获得，但显然这样实现起来更加优雅！ :nerd_face:\n\n### 自定义学习率调度\n\n同样地，你能瞬间读懂下面这个公式吗？\n\n\u003Cp align=\"left\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_e5c46aa3ebf8.png\"\u002F>\n\u003C\u002Fp>\n\n不行吧？那我就把它可视化出来：\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_54d6d9215890.png\"\u002F>\n\u003C\u002Fp>\n\n现在是不是就很容易理解了呢？至于这部分是否对 Transformer 的成功至关重要，我表示怀疑。不过，它确实很酷，也让事情变得更复杂一些。 :nerd_face: （`.set_sarcasm(True)`）\n\n*注：模型维度基本上就是嵌入向量的大小，基准 Transformer 使用的是 512 维，更大的版本则是 1024 维。*\n\n### 标签平滑\n\n初次听到“标签平滑”时，可能会觉得很难理解，但实际上并不复杂。通常情况下，我们会将目标词汇表分布设置为“独热编码”。也就是说，在 3 万个词（或你的词汇表大小是多少）中，只有一个位置的概率为 1，其余均为 0。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_d68e27bf7661.png\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n而在应用标签平滑时，我们不会直接将概率设为 1，而是例如设为 0.9，然后将剩余的概率均匀分配到其他位置上。\n（如上图所示，在一个虚构的 4 个词的词汇表中，这些概率被表示为不同深浅的紫色条带——因此有 4 列。）\n\n*注：填充符（Pad token）的概率分布始终为零，因为我们不希望模型预测这些符号！*\n\n除了这个仓库之外（这还用说吗？），我还强烈推荐你阅读 Jay Alammar 的精彩博客：[这里](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F)！\n\n## 机器翻译\n\nTransformer 最初是在 [WMT-14 数据集](https:\u002F\u002Ftorchtext.readthedocs.io\u002Fen\u002Flatest\u002Fdatasets.html#wmt14) 上针对 NMT（神经机器翻译）任务进行训练的：\n* 英语到德语的翻译任务（BLEU 得分为 28.4）\n* 英语到法语的翻译任务（BLEU 得分为 41.8）\n\n目前，我则在规模小得多的 [IWSLT 数据集](https:\u002F\u002Ftorchtext.readthedocs.io\u002Fen\u002Flatest\u002Fdatasets.html#iwslt) 上，针对英德语言对训练了自己的模型。由于我本人会说这两种语言，调试和实验起来更加方便。\n\n不久之后，我也计划在 WMT-14 数据集上训练模型，请关注 [待办事项](#todos) 部分。\n\n---\n\n总之！让我们来看看这个仓库究竟能为你做些什么吧！没错，它可以进行翻译！\n\n以下是我基于 IWSLT 数据集训练的德英模型的一些简短翻译示例：\u003Cbr\u002F>\u003Cbr\u002F>\n输入：`Ich bin ein guter Mensch, denke ich.`（“金句”：我认为自己是个好人）\u003Cbr\u002F>\n输出：`['\u003Cs>', 'I', 'think', 'I', \"'m\", 'a', 'good', 'person', '.', '\u003C\u002Fs>']`\u003Cbr\u002F>\n或者以人类可读的形式呈现：`I think I'm a good person.`\n\n这其实相当不错！在我看来，甚至比 Google Translate 的“金句”翻译还要好。\n\n---\n\n当然也有失败的情况，比如：\u003Cbr\u002F>\u003Cbr\u002F>\n输入：`Hey Alter, wie geht es dir?`（哥们儿，最近怎么样？）\u003Cbr\u002F>\n输出：`['\u003Cs>', 'Hey', ',', 'age', 'how', 'are', 'you', '?', '\u003C\u002Fs>']`\u003Cbr\u002F>\n或者以人类可读的形式呈现：`Hey, age, how are you?`\u003Cbr\u002F>\n\n不过这也算不上完全糟糕，原因在于：\n* 首先，模型是在 IWSLT 数据集上训练的，该数据集主要包含类似 TED 演讲的对话内容。\n* “Alter” 是一种口语化的表达，意为“老兄”或“哥们儿”，但其字面意思确实是“年龄”。\n\n类似的逻辑也适用于英德翻译模型。\n\n## 设置\n\n我们已经讨论了 Transformer 是什么以及它能为你做什么（以及其他用途）。\u003Cbr\u002F>\n现在就让我们把它跑起来吧！请按照以下步骤操作：\n\n1. `git clone https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer`\n2. 打开 Anaconda 控制台，进入项目目录：`cd path_to_repo`\n3. 在项目目录下运行 `conda env create`（这将创建一个新的 conda 环境）。\n4. 运行 `activate pytorch-transformer`（以便从控制台运行脚本，或在 IDE 中设置解释器）。\n\n就这样！执行 `environment.yml` 文件后，应该可以开箱即用，自动处理所有依赖关系。\u003Cbr\u002F>\n由于需要自动下载 SpaCy 的英语和德语统计模型，这个过程可能需要一些时间。\n\n-----\n\n虽然 PyTorch 的 pip 包会自带某个版本的 CUDA\u002FcuDNN，但强烈建议你事先在系统级别安装 CUDA，主要是为了确保 GPU 驱动程序的兼容性。\n此外，我推荐使用 Miniconda 安装程序来在你的系统上部署 conda。\n请参考 [这篇设置指南](https:\u002F\u002Fgithub.com\u002FPetlja\u002FPSIML\u002Fblob\u002Fmaster\u002Fdocs\u002FMachineSetup.md) 中的第 1 和第 2 步，\n并为你的系统选择最新版本的 Miniconda 和 CUDA\u002FcuDNN。\n\n## 使用方法\n\n### 选项 1：Jupyter Notebook\n\n只需在 Anaconda 控制台中运行 `jupyter notebook`，它就会在您的默认浏览器中打开会话。\u003Cbr\u002F>\n打开 `The Annotated Transformer ++.ipynb`，您就可以开始玩了！\u003Cbr\u002F>\n\n---\n\n**注意：** 如果您遇到 `DLL 加载失败，在导入 win32api 时：指定的模块未找到` 的错误\u003Cbr\u002F>\n只需执行 `pip uninstall pywin32`，然后要么 `pip install pywin32`，要么 `conda install pywin32` [应该可以解决这个问题](https:\u002F\u002Fgithub.com\u002Fjupyter\u002Fnotebook\u002Fissues\u002F4980)！\n\n### 选项 2：使用您选择的 IDE\n\n您只需要将您在 [设置](#setup) 部分创建的 Python 环境链接到您的 IDE 即可。\n\n### 训练\n\n要运行训练，请启动 `training_script.py`。有几个参数您需要指定：\n* `--batch_size` - 这个参数很重要，应设置为不会导致 CUDA 内存不足的最大值。\n* `--dataset_name` - 在 `IWSLT` 和 `WMT14` 之间选择（在添加多 GPU 支持之前，不建议使用 `WMT14` [见待办事项](#todos)）。\n* `--language_direction` - 在 `E2G` 和 `G2E` 之间选择。\n\n因此，一个示例运行（从控制台）看起来如下：\u003Cbr\u002F>\n`python training_script.py --batch_size 1500 --dataset_name IWSLT --language_direction G2E`\n\n代码中有详细的注释，因此您应该能够理解训练过程是如何进行的。\u003Cbr\u002F>\n\n该脚本会：\n* 将检查点模型以 `.pth` 格式保存到 `models\u002Fcheckpoints\u002F` 目录下。\n* 将最终模型以 `.pth` 格式保存到 `models\u002Fbinaries\u002F` 目录下。\n* 下载 IWSLT\u002FWMT-14 数据集（首次运行时），并将其放置在 `data\u002F` 目录下。\n* 将 [tensorboard 数据](#evaluating-nmt-models) 保存到 `runs\u002F` 目录下，您只需在 Anaconda 中运行 `tensorboard --logdir=runs` 即可。\n* 定期将一些训练元数据输出到控制台。\n\n*注意：torch text 中的数据加载速度较慢，因此我实现了一个自定义包装器，增加了缓存机制，使加载速度提高了约 30 倍！（首次运行时仍然会比较慢）*\n\n### 推理（翻译）\n\n第二部分是关于使用这些模型进行翻译，并观察它们的表现！\u003Cbr\u002F>\n要进行翻译，请启动 `translation_script.py`。有几个参数您需要设置：\n* `--source_sentence` - 根据您指定的模型，这应该是英语或德语句子。\n* `--model_name` - 预训练模型名称之一：`iwslt_e2g`、`iwslt_g2e`，或者您自己的模型(*)。\n* `--dataset_name` - 与模型保持一致，如果模型是在 IWSLT 数据集上训练的，则设置为 `IWSLT`。\n* `--language_direction` - 与模型保持一致，如果模型是用于英译德的，则设置为 `E2G`。\n\n(*) 注意：在您训练完模型后，它会被保存到 `models\u002Fbinaries` 目录下。请查看其文件名，并通过 `--model_name` 参数指定，以便用于翻译任务。如果您指定了一些预训练模型，它们将在您首次运行翻译脚本时 **自动下载**。\n\n我也会在此处提供 IWSLT 预训练模型的链接：[英译德](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fa6pfo6t9m2dh1jq\u002Fiwslt_e2g.pth?dl=1) 和 [德译英](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fdgcd4xhwig7ygqd\u002Fiwslt_g2e.pth?dl=1)。\n\n就是这样！您还可以可视化注意力机制，详情请参阅 [可视化注意力](#visualizing-attention) 部分。\n\n### 评估 NMT 模型\n\n我在训练过程中跟踪了三条曲线：\n* 训练损失（KL 散度，批次均值）\n* 验证损失（KL 散度，批次均值）\n* BLEU-4 分数\n\n[BLEU 是一种基于 n-gram 的指标](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP02-1040.pdf)，用于定量评估机器翻译模型的质量。\u003Cbr\u002F>\n我使用了由优秀的 **nltk** Python 模块提供的 BLEU-4 指标。\n\n当前结果：模型均训练了 20 个 epoch（DE 代表德语，即 German in German :nerd_face:）：\n\n| 模型 | BLEU 分数 | 数据集 |\n| --- | --- | --- |\n| [基准 Transformer（EN-DE）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fa6pfo6t9m2dh1jq\u002Fiwslt_e2g.pth?dl=1) | **27.8** | IWSLT 验证集 |\n| [基准 Transformer（DE-EN）](https:\u002F\u002Fwww.dropbox.com\u002Fs\u002Fdgcd4xhwig7ygqd\u002Fiwslt_g2e.pth?dl=1) | **33.2** | IWSLT 验证集 |\n| 基准 Transformer（EN-DE） | x | WMT-14 验证集 |\n| 基准 Transformer（DE-EN） | x | WMT-14 验证集 |\n\n这些结果是使用贪婪解码得到的，因此是一个偏保守的估计。我计划很快加入束搜索解码 [见待办事项](#todos)。\n\n**重要提示：** 初始化对 Transformer 的性能影响很大！我最初认为其他实现中使用的 Xavier 初始化只是一种任意的启发式方法，PyTorch 的默认初始化就足够了——但我错了：\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_d9fd72bf75af.png\" width=\"450\"\u002F>\n\u003C\u002Fp>\n\n您可以看到这里有 3 次运行，下面的 2 次使用了 PyTorch 的默认初始化（其中一次使用 KL 散度损失的 `mean`，另一次使用 `batchmean`），而上面的一次则使用了 **Xavier 均匀** 初始化！\n\n---\n\n想法：您也可以定期为一组参考源句生成翻译结果。\u003Cbr\u002F>\n这样可以为您提供关于 Transformer 行为的定性见解，尽管我没有这样做。类似的做法也常用于难以定量评估模型的情况，例如 [GANs](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-gans) 和 [NST](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-nst-feedforward) 领域。\n\n### 使用 TensorBoard 跟踪\n\n上述图表是我使用 Azure ML 运行时截取的一部分，但在本地运行时，我会使用 TensorBoard。\n\n只需在 Anaconda 控制台中运行 `tensorboard --logdir=runs`，即可在训练过程中跟踪各项指标。\n\n### 可视化注意力\n\n您可以使用 `translation_script.py`，并将 `--visualize_attention` 设置为 `True`，以进一步了解您的模型在源句和目标句中“关注”的内容。\n\n以下是针对输入句子 `Ich bin ein guter Mensch, denke ich.` 所获得的注意力图。\n\n这些属于编码器的第 6 层。您可以看到所有的 8 个多头注意力头。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_30f87adea660.png\" width=\"850\"\u002F>\n\u003C\u002Fp>\n\n而这张图则属于解码器第 6 层的自注意力 MHA（多头注意力）模块。\u003Cbr\u002F>\n您会注意到一个有趣的 **三角形模式**，这是由于目标词无法“向前看”造成的！\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_aff1c090bad5.png\" width=\"850\"\u002F>\n\u003C\u002Fp>\n\n第三种类型的 MHA 模块是源端注意力模块，其图示与您之前看到的编码器注意力图相似。\u003Cbr\u002F>\n您可以根据自己的节奏随意探索这些注意力图！\n\n*注意：显然，这个模型存在一些偏差问题，但在这里我们暂不深入分析。*\n\n## 硬件要求\n\n如果你想在 **WMT-14** 数据集上训练 Transformer 模型，确实需要一台性能不错的硬件。\n\n作者们使用了：\n* 在 8 张 P100 GPU 上花费 **12 小时** 来训练基准模型，而训练大型模型则用了 **3.5 天**。\n\n如果我的计算没错的话，这相当于基准模型大约进行了 ~19 个 epoch（共 10 万个 step，每个 step 约有 2.5 万个 token，而 WMT-14 总共有约 1.3 亿个源端\u002F目标端 token），而大型模型则是其三倍，即 30 万个 step。\n\n另一方面，在 **IWSLT** 数据集上训练模型则要容易得多。我在自己的 RTX 2080 机器上（8 GB 显存）的训练时间为：\n* 每个 epoch 约 13.2 分钟（1500 个 token 的 batch 大小）；\n而在 Azure ML 的 K80s 上（24 GB 显存）则约为：\n* 每个 epoch 约 34 分钟（1500 个 token 的 batch 大小）。\n\n我本可以把 K80s 的 batch 大小提高到 3500+ 个 token，但遇到了一些 CUDA 内存不足的问题。\n\n### 待办事项：\n\n最后还有一些待办事项，我希望很快就能添加上去：\n* 多 GPU\u002F多节点训练支持（这样你就可以在 WMT-14 数据集上训练 19 个 epoch 的模型）；\n* 束搜索解码（事实证明实现起来并不那么容易！）；\n* BPE 和共享的源目标词汇表（我现在使用的是 SpaCy）。\n\n这个仓库已经具备了所需的一切功能，这些只是额外的补充。我已经测试过从环境搭建到自动下载模型等所有内容。\n\n## 视频学习资料\n\n如果你在理解代码时遇到困难，我在这段视频中对这篇论文做了深入的概述：[点击观看](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cbYxHkgkSVs)\n\n\u003Cp align=\"left\">\n\u003Ca href=\"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=cbYxHkgkSVs\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_readme_db201dd34f9a.jpg\" \nalt=\"深入解读《Attention Is All You Need》论文\" width=\"480\" height=\"360\" border=\"10\" \u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n我还有一些其他视频可以帮助你更好地理解 Transformer 模型：\n* [我理解 NLP\u002FTransformer 的方法](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=bvBK-coXf9I)；\n* [另一篇关于该论文的概述（层次稍高一点）](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=n9sLZPLOxG8)；\n* [关于该项目开发过程的案例分析](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=px4rtkWHFvM)。\n\n## 致谢\n\n在开发这个项目的过程中，以下资源对我很有帮助：\n\n* [The Annotated Transformer](http:\u002F\u002Fnlp.seas.harvard.edu\u002F2018\u002F04\u002F03\u002Fattention.html)；\n* [PyTorch 官方实现](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002F187e23397c075ec2f6e89ea75d24371e3fbf9efa\u002Ftorch\u002Fnn\u002Fmodules\u002Ftransformer.py)。\n\n我在设计模型时从 The Annotated Transformer 中获得了一些灵感，但它的内容比较难懂，而且还存在一些 bug。它主要是为研究人员编写的。希望这个仓库也能让普通大众更容易理解 Transformer 模型！ :nerd_face:\n\n## 引用\n\n如果你觉得这段代码有用，请引用如下：\n\n```\n@misc{Gordić2020PyTorchOriginalTransformer,\n  author = {Gordić, Aleksa},\n  title = {pytorch-original-transformer},\n  year = {2020},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer}},\n}\n```\n\n## 与我联系\n\n如果你想在生活中多接触一些与人工智能相关的内容 :nerd_face:，不妨考虑：\n* 订阅我的 YouTube 频道 [The AI Epiphany](https:\u002F\u002Fwww.youtube.com\u002Fc\u002FTheAiEpiphany) :bell:；\n* 在 [LinkedIn](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Faleksagordic\u002F) 和 [Twitter](https:\u002F\u002Ftwitter.com\u002Fgordic_aleksa) 上关注我 :bulb:；\n* 在 [Medium](https:\u002F\u002Fgordicaleksa.medium.com\u002F) 上关注我 :books: :heart:。\n\n## 许可证\n\n[![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow.svg)](https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer\u002Fblob\u002Fmaster\u002FLICENCE)","# pytorch-original-transformer 快速上手指南\n\n本指南基于 `pytorch-original-transformer` 项目，旨在帮助开发者快速理解并运行原始 Transformer 架构（Vaswani et al. \"Attention Is All You Need\"）的 PyTorch 实现。该项目代码注释详细，非常适合学习 Transformer 内部机制及进行机器翻译实验。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python 环境**: 推荐安装 **Miniconda** 或 Anaconda 以管理依赖。\n*   **GPU 支持 (可选但推荐)**:\n    *   虽然项目包含 PyTorch 自带的 CUDA 版本，但强烈建议预先安装系统级的 **CUDA** 和 **cuDNN** 以确保 GPU 驱动兼容性和最佳性能。\n    *   训练 Transformer 模型对显存有一定要求，建议使用支持 CUDA 的 NVIDIA 显卡。\n*   **网络环境**: 首次运行时会自动下载 SpaCy 统计模型（英文\u002F德文）及数据集，请确保网络畅通。\n\n## 安装步骤\n\n本项目通过 Conda 环境文件自动处理所有依赖关系。请按以下步骤操作：\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fgordicaleksa\u002Fpytorch-original-transformer\n    cd pytorch-original-transformer\n    ```\n\n2.  **创建 Conda 环境**\n    在项目根目录下运行以下命令，这将根据 `environment.yml` 创建一个名为 `pytorch-transformer` 的新环境（首次运行可能需要几分钟下载依赖）：\n    ```bash\n    conda env create\n    ```\n\n3.  **激活环境**\n    ```bash\n    conda activate pytorch-transformer\n    ```\n    *注：如果您在 IDE（如 PyCharm 或 VS Code）中开发，请将解释器设置为该环境下的 Python。*\n\n    > **Windows 用户注意**：如果在运行 Jupyter Notebook 时遇到 `DLL load failed while importing win32api` 错误，请执行以下命令修复：\n    > ```bash\n    > pip uninstall pywin32\n    > pip install pywin32\n    > ```\n\n## 基本使用\n\n安装完成后，您可以通过以下两种方式使用本项目：交互式学习（Jupyter Notebook）或命令行训练\u002F推理。\n\n### 方式一：交互式学习（推荐初学者）\n\n项目提供了一个详细的注解版 Notebook，可视化了位置编码、学习率调度等核心概念。\n\n1.  启动 Jupyter Notebook：\n    ```bash\n    jupyter notebook\n    ```\n2.  在浏览器中打开 `The Annotated Transformer ++.ipynb` 文件。\n3.  按顺序运行单元格，即可观察数据流动和模型内部状态。\n\n### 方式二：命令行训练与推理\n\n#### 1. 训练模型 (Training)\n\n使用 `training_script.py` 开始训练。以下示例展示了如何在 **IWSLT** 数据集上训练一个 **德语到英语 (G2E)** 的翻译模型：\n\n```bash\npython training_script.py --batch_size 1500 --dataset_name IWSLT --language_direction G2E\n```\n\n**关键参数说明：**\n*   `--batch_size`: 根据显存大小调整，避免 CUDA OOM 错误。\n*   `--dataset_name`: 选择 `IWSLT` (较小，适合调试) 或 `WMT14` (较大，需多卡支持)。\n*   `--language_direction`: `E2G` (英译德) 或 `G2E` (德译英)。\n\n训练过程中，脚本会自动：\n*   下载数据集至 `data\u002F` 目录。\n*   保存检查点至 `models\u002Fcheckpoints\u002F`。\n*   生成 TensorBoard 日志至 `runs\u002F` (查看命令：`tensorboard --logdir=runs`)。\n\n#### 2. 模型推理\u002F翻译 (Inference)\n\n训练完成后（或使用预训练模型），使用 `translation_script.py` 进行翻译测试。\n\n**使用预训练模型示例：**\n```bash\npython translation_script.py --source_sentence \"Ich bin ein guter Mensch, denke ich.\" --model_name iwslt_g2e --dataset_name IWSLT --language_direction G2E\n```\n\n**使用自定义训练模型：**\n如果您训练了自己的模型，它会被保存在 `models\u002Fbinaries\u002F` 目录下。将 `--model_name` 替换为该文件名（不含 `.pth` 后缀）即可：\n```bash\npython translation_script.py --source_sentence \"Your input sentence\" --model_name your_trained_model_name --dataset_name IWSLT --language_direction G2E\n```\n\n**输出示例：**\n程序将输出分词后的结果及人类可读的翻译句子，例如：\n`I think I'm a good person.`\n\n---\n*提示：项目默认使用 Xavier Uniform 初始化，这对 Transformer 的性能至关重要，请勿随意更改为默认初始化方式。*","某高校自然语言处理实验室的研究员正在指导本科生深入理解 Transformer 架构，并需要复现经典论文中的机器翻译实验以验证注意力机制原理。\n\n### 没有 pytorch-original-transformer 时\n- 学生面对 Vaswani 原论文中复杂的数学公式（如位置编码和学习率调度）感到困惑，难以将抽象理论转化为直观认知。\n- 从零开始搭建符合原始论文标准的模型代码耗时费力，极易在掩码机制或层归一化等细节上引入隐蔽错误。\n- 缺乏可视化的辅助工具，无法动态观察标签平滑（Label Smoothing）如何改变概率分布，导致对正则化策略的理解停留在表面。\n- 寻找预训练权重和适配数据集（如 IWSLT）的过程繁琐，阻碍了快速开展神经机器翻译（NMT）的对比实验。\n\n### 使用 pytorch-original-transformer 后\n- 通过运行自带的 `playground.py`，学生能直接看到位置编码波形图和学习率变化曲线，瞬间读懂原本晦涩的公式含义。\n- 直接调用高度注释且严格遵循原论文的实现代码，确保了架构的准确性，让学生能将精力集中在算法逻辑而非调试基建上。\n- 利用可视化的标签平滑示意图，清晰观察到概率质量如何在非目标词间重新分配，深刻理解了防止模型过自信的原理。\n- 一键加载已提供的 IWSLT 预训练模型，立即启动翻译任务测试，大幅缩短了从理论学习到动手实践的路径。\n\npytorch-original-transformer 通过将晦涩的理论可视化并提供标准化的复现基线，极大地降低了初学者掌握 Transformer 核心机制的门槛。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgordicaleksa_pytorch-original-transformer_a943fc66.jpg","gordicaleksa","Aleksa Gordić","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgordicaleksa_5c51e394.png","Flirting with LLMs. Tensor Core maximalist. If I say stupid stuff it's not me it's my prompt.","ex-DeepMind, ex-Microsoft","San Francisco",null,"gordic_aleksa","https:\u002F\u002Fgordicaleksa.com\u002F","https:\u002F\u002Fgithub.com\u002Fgordicaleksa",[83,87],{"name":84,"color":85,"percentage":86},"Jupyter Notebook","#DA5B0B",53.1,{"name":88,"color":89,"percentage":90},"Python","#3572A5",46.9,1091,187,"2026-04-05T00:06:43","MIT","Windows, Linux, macOS","需要 NVIDIA GPU（用于训练），显存大小取决于 batch_size，需避免 CUDA out of memory 错误；建议预先安装系统级 CUDA 和 cuDNN 以匹配 PyTorch 版本","未说明",{"notes":99,"python":100,"dependencies":101},"建议使用 Miniconda 创建名为 'pytorch-transformer' 的环境。首次运行会自动下载 SpaCy 的英语和德语统计模型及数据集（IWSLT\u002FWMT-14），耗时较长。训练时若使用 WMT-14 数据集，目前不支持多 GPU。Windows 用户若遇到 'DLL load failed' 错误，需重装 pywin32。代码包含自定义数据加载缓存机制以提升速度。","通过 conda 环境管理 (environment.yml)，具体版本未明确提及",[102,103,104,105,106,107,108],"pytorch","torchtext","spacy","nltk","tensorboard","jupyter","pywin32 (仅限 Windows)",[35,14],[111,112,113,114,115,116,117,102,118,107,119,120,121,122],"transformer","transformers","pytorch-transformer","pytorch-transformers","attention","attention-mechanism","attention-is-all-you-need","python","transformer-tutorial","deeplearning","deep-learning","original-transformer","2026-03-27T02:49:30.150509","2026-04-06T17:06:20.766732",[],[]]