[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Tencent--TencentPretrain":3,"tool-Tencent--TencentPretrain":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",157379,2,"2026-04-15T23:32:42",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":10,"env_os":88,"env_gpu":89,"env_ram":88,"env_deps":90,"category_tags":104,"github_topics":107,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":128,"updated_at":129,"faqs":130,"releases":161},7968,"Tencent\u002FTencentPretrain","TencentPretrain","Tencent Pre-training framework in PyTorch & Pre-trained Model Zoo ","TencentPretrain 是腾讯开源的一款基于 PyTorch 的预训练框架，旨在简化从预训练到微调的全流程。它主要解决了人工智能领域中多模态数据（如文本、视觉、音频）处理复杂、模型复现困难以及大规模训练门槛高等痛点。\n\n这款工具非常适合 AI 研究人员和开发者使用。无论是希望快速验证新算法的学术研究者，还是需要构建高性能应用的企业工程师，都能从中受益。TencentPretrain 继承了 UER  toolkit 的优势并进行了多模态扩展，其核心亮点在于高度模块化的设计。它将模型拆解为嵌入层、编码器、解码器等独立组件，用户可以像搭积木一样自由组合，灵活构建定制化的预训练模型。\n\n此外，TencentPretrain 具备出色的可扩展性，不仅支持 CPU、单卡及分布式训练，还集成了 DeepSpeed 以应对超大规模模型的训练需求。框架内置了丰富的“模型动物园”，提供多种已预训练好的模型供直接调用或二次开发，并在多项下游任务中取得了业界领先的效果。通过提供清晰的接口和详尽的文档，TencentPretrain 让复杂的预训练技术变得更加触手可及，帮助用户高效地探索人工智能的前沿","TencentPretrain 是腾讯开源的一款基于 PyTorch 的预训练框架，旨在简化从预训练到微调的全流程。它主要解决了人工智能领域中多模态数据（如文本、视觉、音频）处理复杂、模型复现困难以及大规模训练门槛高等痛点。\n\n这款工具非常适合 AI 研究人员和开发者使用。无论是希望快速验证新算法的学术研究者，还是需要构建高性能应用的企业工程师，都能从中受益。TencentPretrain 继承了 UER  toolkit 的优势并进行了多模态扩展，其核心亮点在于高度模块化的设计。它将模型拆解为嵌入层、编码器、解码器等独立组件，用户可以像搭积木一样自由组合，灵活构建定制化的预训练模型。\n\n此外，TencentPretrain 具备出色的可扩展性，不仅支持 CPU、单卡及分布式训练，还集成了 DeepSpeed 以应对超大规模模型的训练需求。框架内置了丰富的“模型动物园”，提供多种已预训练好的模型供直接调用或二次开发，并在多项下游任务中取得了业界领先的效果。通过提供清晰的接口和详尽的文档，TencentPretrain 让复杂的预训练技术变得更加触手可及，帮助用户高效地探索人工智能的前沿应用。","[**English**](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain) | [**中文**](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fblob\u002Fmain\u002FREADME_ZH.md)\n\n## TencentPretrain: Tencent Pre-training Framework \n\nPre-training has become an essential part of AI technology. TencentPretrain is a toolkit for pre-training and fine-tuning on data of different modalities (e.g. text and vision). TencentPretrain is characterized by modular design. It facilitates the use of existing pre-training models, and provides interfaces for users to further extend upon. With TencentPretrain, we build a model zoo which contains pre-trained models of different properties. TencentPretrain inherits the open source toolkit UER (https:\u002F\u002Fgithub.com\u002Fdbiir\u002FUER-py\u002F) and extends it to a multimodal pre-training framework.\n\n#### **Full Documentation：https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki**\n\n\u003Cbr>\n\nTable of Contents\n=================\n  * [Features](#features)\n  * [Requirements](#requirements)\n  * [Quickstart](#quickstart)\n  * [Pre-training data](#pre-training-data)\n  * [Downstream datasets](#downstream-datasets)\n  * [Modelzoo](#modelzoo)\n  * [Instructions](#instructions)\n  * [Competition solutions](#competition-solutions)\n  * [Citation](#citation)\n  * [Contact information](#contact-information)\n\n\n\u003Cbr\u002F>\n\n## Features\nTencentPretrain has the following features:\n- __Reproducibility__ TencentPretrain has been tested on many datasets and should match the performances of the original pre-training model implementations such as BERT, GPT-2, ELMo, T5, CLIP.\n- __Model modularity__ TencentPretrain is divided into the following parts: embedding, encoder, target embedding (optional), decoder (optional), and target. Ample modules are implemented in each part. Clear and robust interface allows users to combine modules to construct pre-training models with as few restrictions as possible.\n- __Multimodal__ TencentPretrain supports different modalities such as text, vision, and audio.\n- __Model training__ TencentPretrain supports CPU mode, single GPU mode, distributed training mode, and gigantic model training with DeepSpeed.\n- __Model zoo__ With the help of TencentPretrain, we pre-train and release models of different properties. Proper selection of pre-trained models is important to the performances of downstream tasks.\n- __SOTA results__ TencentPretrain supports comprehensive downstream tasks (e.g. classification and machine reading comprehension) and provides winning solutions of many competitions.\n- __Abundant functions__ TencentPretrain provides abundant functions related with pre-training, such as feature extractor and text generation.\n\n\n\u003Cbr\u002F>\n\n## Requirements\n* Python >= 3.6\n* torch >= 1.1\n* six >= 1.12.0\n* argparse\n* packaging\n* regex\n* For the pre-trained model conversion (related with TensorFlow) you will need TensorFlow\n* For the tokenization with sentencepiece model you will need [SentencePiece](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece)\n* For developing a stacking model you will need LightGBM and [BayesianOptimization](https:\u002F\u002Fgithub.com\u002Ffmfn\u002FBayesianOptimization)\n* For the pre-training with whole word masking you will need word segmentation tool such as [jieba](https:\u002F\u002Fgithub.com\u002Ffxsjy\u002Fjieba)\n* For the use of CRF in sequence labeling downstream task you will need [pytorch-crf](https:\u002F\u002Fgithub.com\u002Fkmkurn\u002Fpytorch-crf)\n* For the gigantic model training you will need [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n* For the vision model training you will need torchvision\n* For the audio model training you will need torchaudio, and opencv-python is needed for some special settings of specaugment, and editdistance is needed when finetuning a speech2text model\n\n\n\u003Cbr\u002F>\n\n## Quickstart\nThis section uses several commonly-used examples to demonstrate how to use TencentPretrain. More details are discussed in Instructions section. We firstly use BERT (a text pre-training model) on book review sentiment classification dataset. We pre-train model on book review corpus and then fine-tune it on book review sentiment classification dataset. There are three input files: book review corpus, book review sentiment classification dataset, and vocabulary. All files are encoded in UTF-8 and included in this project.\n\nThe format of the corpus for BERT is as follows (one sentence per line and documents are delimited by empty lines)：\n```\ndoc1-sent1\ndoc1-sent2\ndoc1-sent3\n\ndoc2-sent1\n\ndoc3-sent1\ndoc3-sent2\n```\nThe book review corpus is obtained from book review sentiment classification dataset. We remove labels and split a review into two parts from the middle to construct a document with two sentences (see *book_review_bert.txt* in *corpora* folder). \n\nThe format of the classification dataset is as follows:\n```\nlabel    text_a\n1        instance1\n0        instance2\n1        instance3\n```\nLabel and instance are separated by \\t . The first row is a list of column names. The label ID should be an integer between (and including) 0 and n-1 for n-way classification.\n\nWe use Google's Chinese vocabulary file *models\u002Fgoogle_zh_vocab.txt*, which contains 21128 Chinese characters.\n\nWe firstly pre-process the book review corpus. In the pre-processing stage, the corpus needs to be processed into the format required by the specified pre-training model (*--data_processor*):\n```\npython3 preprocess.py --corpus_path corpora\u002Fbook_review_bert.txt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                      --dataset_path dataset.pt --processes_num 8 --data_processor bert\n```\nNotice that *six>=1.12.0* is required.\n\nPre-processing is time-consuming. Using multiple processes can largely accelerate the pre-processing speed (*--processes_num*). BERT tokenizer is used in default (*--tokenizer bert*). After pre-processing, the raw text is converted to *dataset.pt*, which is the input of *pretrain.py*. Then we download Google's pre-trained Chinese BERT model [*google_zh_model.bin*](https:\u002F\u002Fshare.weiyun.com\u002FFR4rPxc4) (in TencentPretrain format and the original model is from [here](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert)), and put it in *models* folder. We load the pre-trained Chinese BERT model and further pre-train it on book review corpus. Pre-training model is usually composed of embedding, encoder, and target layers. To build a pre-training model, we should provide related information. Configuration file (*--config_path*) specifies the modules and hyper-parameters used by pre-training models. More details can be found in *models\u002Fbert\u002Fbase_config.json*. Suppose we have a machine with 8 GPUs:\n```\npython3 pretrain.py --dataset_path dataset.pt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                    --pretrained_model_path models\u002Fgoogle_zh_model.bin \\\n                    --config_path models\u002Fbert\u002Fbase_config.json \\\n                    --output_model_path models\u002Fbook_review_model.bin \\\n                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\n                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32\n\nmv models\u002Fbook_review_model.bin-5000 models\u002Fbook_review_model.bin\n```\nNotice that the model trained by *pretrain.py* is attacted with the suffix which records the training step (*--total_steps*). We could remove the suffix for ease of use.\n\nThen we fine-tune the pre-trained model on downstream classification dataset. We use embedding and encoder layers of [*book_review_model.bin*](https:\u002F\u002Fshare.weiyun.com\u002FPnxMrRwZ), which is the output of *pretrain.py*:\n```\npython3 finetune\u002Frun_classifier.py --pretrained_model_path models\u002Fbook_review_model.bin \\\n                                   --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                   --config_path models\u002Fbert\u002Fbase_config.json \\\n                                   --train_path datasets\u002Fbook_review\u002Ftrain.tsv \\\n                                   --dev_path datasets\u002Fbook_review\u002Fdev.tsv \\\n                                   --test_path datasets\u002Fbook_review\u002Ftest.tsv \\\n                                   --epochs_num 3 --batch_size 32\n``` \nThe default path of the fine-tuned classifier model is *models\u002Ffinetuned_model.bin* . It is noticeable that the actual batch size of pre-training is *--batch_size* times *--world_size* ; The actual batch size of downstream task (e.g. classification) is *--batch_size* . \nThen we do inference with the fine-tuned model. \n```\npython3 inference\u002Frun_classifier_infer.py --load_model_path models\u002Ffinetuned_model.bin \\\n                                          --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                          --config_path models\u002Fbert\u002Fbase_config.json \\\n                                          --test_path datasets\u002Fbook_review\u002Ftest_nolabel.tsv \\\n                                          --prediction_path datasets\u002Fbook_review\u002Fprediction.tsv \\\n                                          --labels_num 2\n```\n*--test_path* specifies the path of the file to be predicted. The file should contain text_a column.\n*--prediction_path* specifies the path of the file with prediction results.\nWe need to explicitly specify the number of labels by *--labels_num*. The above dataset is a two-way classification dataset.\n\n\u003Cbr>\n\nThe above content provides basic ways of using TencentPretrain to pre-process, pre-train, fine-tune, and do inference. More use cases can be found in complete :arrow_right: [__quickstart__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FQuickstart) :arrow_left: . The complete quickstart contains abundant use cases, covering most of the pre-training related application scenarios. It is recommended that users read the complete quickstart in order to use the project reasonably.\n\n\u003Cbr\u002F>\n\n## Pre-training data\nThis section provides links to a range of :arrow_right: [__pre-training data__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FPretraining-data) :arrow_left: . TencentPretrain can load these pre-training data directly.\n\n\u003Cbr\u002F>\n\n## Downstream datasets\nThis section provides links to a range of :arrow_right: [__downstream datasets__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FDownstream-datasets) :arrow_left: . TencentPretrain can load these datasets directly.\n\n\u003Cbr\u002F>\n\n## Modelzoo\nWith the help of TencentPretrain, we pre-trained models of different properties (e.g. models based on different modalities, encoders, and targets). Detailed introduction of pre-trained models and their download links can be found in :arrow_right: [__modelzoo__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FModelzoo) :arrow_left: . All pre-trained models can be loaded by TencentPretrain directly.\n\n\u003Cbr\u002F>\n\n## Instructions\nTencentPretrain is organized as follows：\n```\nTencentPretrain\u002F\n    |--tencentpretrain\u002F\n    |    |--embeddings\u002F # contains modules of embedding component\n    |    |--encoders\u002F # contains modules of encoder component such as RNN, CNN, Transformer\n    |    |--decoders\u002F # contains modules of decoder component\n    |    |--targets\u002F # contains modules of target component such as language modeling, masked language modeling\n    |    |--layers\u002F # contains frequently-used NN layers\n    |    |--models\u002F # contains model.py, which combines modules of different components\n    |    |--utils\u002F # contains frequently-used utilities\n    |    |--model_builder.py\n    |    |--model_loader.py\n    |    |--model_saver.py\n    |    |--opts.py\n    |    |--trainer.py\n    |\n    |--corpora\u002F # contains pre-training data\n    |--datasets\u002F # contains downstream tasks\n    |--models\u002F # contains pre-trained models, vocabularies, and configuration files\n    |--scripts\u002F # contains useful scripts for pre-training models\n    |--finetune\u002F # contains fine-tuning scripts for downstream tasks\n    |--inference\u002F # contains inference scripts for downstream tasks\n    |\n    |--preprocess.py\n    |--pretrain.py\n    |--README.md\n    |--README_ZH.md\n    |--requirements.txt\n    |--LICENSE\n\n```\n\nThe code is organized based on components (e.g. embeddings, encoders). Users can use and extend upon it with little efforts.\n\nComprehensive examples of using TencentPretrain can be found in :arrow_right: [__instructions__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FInstructions) :arrow_left: , which help users quickly implement pre-training models such as BERT, GPT-2, ELMo, T5, CLIP and fine-tune pre-trained models on a range of downstream tasks.\n\n\u003Cbr\u002F>\n\n## Competition solutions\nTencentPretrain has been used in winning solutions of many competitions. In this section, we provide some examples of using TencentPretrain to achieve SOTA results on competitions, such as CLUE. See :arrow_right: [__competition solutions__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FCompetition-solutions) :arrow_left: for more detailed information.\n\n\u003Cbr\u002F>\n\n## Citation\n#### If you are using the work (e.g. pre-trained models) in TencentPretrain for academic work, please cite the [system paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06385.pdf) published in ACL 2023:\n```\n@article{zhao2023tencentpretrain,\n  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},\n  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},\n  journal={ACL 2023},\n  pages={217},\n  year={2023}\n}\n```\n","[**English**](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain) | [**中文**](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fblob\u002Fmain\u002FREADME_ZH.md)\n\n## TencentPretrain: 腾讯预训练框架 \n\n预训练已成为人工智能技术的重要组成部分。TencentPretrain 是一个用于对不同模态数据（例如文本和视觉）进行预训练和微调的工具包。TencentPretrain 的特点是模块化设计，既便于使用现有的预训练模型，又为用户提供了进一步扩展的接口。借助 TencentPretrain，我们构建了一个包含多种性质预训练模型的模型库。TencentPretrain 继承了开源工具包 UER (https:\u002F\u002Fgithub.com\u002Fdbiir\u002FUER-py\u002F)，并将其扩展为一个多模态预训练框架。\n\n#### **完整文档：https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki**\n\n\u003Cbr>\n\n目录\n=================\n  * [特性](#features)\n  * [要求](#requirements)\n  * [快速入门](#quickstart)\n  * [预训练数据](#pre-training-data)\n  * [下游数据集](#downstream-datasets)\n  * [模型库](#modelzoo)\n  * [使用说明](#instructions)\n  * [竞赛解决方案](#competition-solutions)\n  * [引用](#citation)\n  * [联系方式](#contact-information)\n\n\n\u003Cbr\u002F>\n\n## 特性\nTencentPretrain 具有以下特点：\n- __可复现性__ TencentPretrain 已在多个数据集上进行了测试，其性能应与原始预训练模型实现（如 BERT、GPT-2、ELMo、T5、CLIP）一致。\n- __模型模块化__ TencentPretrain 分为嵌入层、编码器、目标嵌入层（可选）、解码器（可选）和目标等部分。每个部分都实现了丰富的模块，清晰且稳健的接口使用户能够自由组合这些模块，以尽可能少的限制构建预训练模型。\n- __多模态__ TencentPretrain 支持文本、视觉和音频等多种模态。\n- __模型训练__ TencentPretrain 支持 CPU 模式、单 GPU 模式、分布式训练模式以及使用 DeepSpeed 进行超大规模模型训练。\n- __模型库__ 在 TencentPretrain 的帮助下，我们预训练并发布了多种性质的模型。合理选择预训练模型对于下游任务的性能至关重要。\n- __SOTA 结果__ TencentPretrain 支持全面的下游任务（如分类和机器阅读理解），并提供了多项竞赛的优胜方案。\n- __丰富功能__ TencentPretrain 提供了与预训练相关的丰富功能，例如特征提取器和文本生成等。\n\n\n\u003Cbr\u002F>\n\n## 需求\n* Python >= 3.6\n* torch >= 1.1\n* six >= 1.12.0\n* argparse\n* packaging\n* regex\n* 对于预训练模型的转换（与 TensorFlow 相关），您需要 TensorFlow\n* 对于使用 SentencePiece 模型进行分词，您需要 [SentencePiece](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fsentencepiece)\n* 对于开发堆叠模型，您需要 LightGBM 和 [BayesianOptimization](https:\u002F\u002Fgithub.com\u002Ffmfn\u002FBayesianOptimization)\n* 对于使用全词掩码进行预训练，您需要像 [jieba](https:\u002F\u002Fgithub.com\u002Ffxsjy\u002Fjieba) 这样的分词工具\n* 对于在序列标注下游任务中使用 CRF，您需要 [pytorch-crf](https:\u002F\u002Fgithub.com\u002Fkmkurn\u002Fpytorch-crf)\n* 对于超大规模模型训练，您需要 [DeepSpeed](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FDeepSpeed)\n* 对于视觉模型训练，您需要 torchvision\n* 对于音频模型训练，您需要 torchaudio，并且在某些 specaugment 的特殊设置中需要 opencv-python，在微调 speech2text 模型时则需要 editdistance\n\n\n\u003Cbr\u002F>\n\n## 快速入门\n本节通过几个常用示例展示如何使用 TencentPretrain。更多详细信息请参阅“使用说明”部分。我们首先在书评情感分类数据集上使用 BERT（一种文本预训练模型）。我们先在书评语料库上对模型进行预训练，然后再在书评情感分类数据集上进行微调。输入文件共有三个：书评语料库、书评情感分类数据集和词汇表。所有文件均采用 UTF-8 编码，并包含在本项目中。\n\nBERT 语料库的格式如下（每行一个句子，文档之间用空行分隔）：\n```\ndoc1-sent1\ndoc1-sent2\ndoc1-sent3\n\ndoc2-sent1\n\ndoc3-sent1\ndoc3-sent2\n```\n书评语料库是从书评情感分类数据集中获取的。我们移除标签，并从中间将每篇书评分成两部分，以构建一个包含两个句子的文档（详见 `corpora` 文件夹中的 `book_review_bert.txt`）。\n\n分类数据集的格式如下：\n```\nlabel    text_a\n1        instance1\n0        instance2\n1        instance3\n```\n标签和实例之间用制表符 `\\t` 分隔。第一行为列名列表。对于 n 类分类问题，标签 ID 应为 0 到 n-1 之间的整数（包括 0 和 n-1）。\n\n我们使用 Google 的中文词汇表文件 `models\u002Fgoogle_zh_vocab.txt`，其中包含 21128 个汉字。\n\n我们首先对书评语料库进行预处理。在预处理阶段，需要将语料库处理成指定预训练模型所需的格式（`--data_processor`）：\n```\npython3 preprocess.py --corpus_path corpora\u002Fbook_review_bert.txt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                      --dataset_path dataset.pt --processes_num 8 --data_processor bert\n```\n请注意，需要安装 `six>=1.12.0`。\n\n预处理过程较为耗时。使用多进程可以显著加快预处理速度（`--processes_num`）。默认使用 BERT 分词器（`--tokenizer bert`）。预处理完成后，原始文本会被转换为 `dataset.pt`，这是 `pretrain.py` 的输入。然后我们下载 Google 预训练的中文 BERT 模型 [`google_zh_model.bin`](https:\u002F\u002Fshare.weiyun.com\u002FFR4rPxc4)（TencentPretrain 格式，原始模型来自 [这里](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fbert)），并将其放入 `models` 文件夹中。我们加载该预训练的中文 BERT 模型，并在其基础上继续在书评语料库上进行预训练。预训练模型通常由嵌入层、编码器和目标层组成。为了构建预训练模型，我们需要提供相关信息。配置文件（`--config_path`）指定了预训练模型使用的模块和超参数。更多细节请参阅 `models\u002Fbert\u002Fbase_config.json`。假设我们有一台配备 8 块 GPU 的机器：\n```\npython3 pretrain.py --dataset_path dataset.pt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                    --pretrained_model_path models\u002Fgoogle_zh_model.bin \\\n                    --config_path models\u002Fbert\u002Fbase_config.json \\\n                    --output_model_path models\u002Fbook_review_model.bin \\\n                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\n                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32\n\nmv models\u002Fbook_review_model.bin-5000 models\u002Fbook_review_model.bin\n```\n请注意，由 `pretrain.py` 训练的模型会附加一个记录训练步数的后缀（`--total_steps`）。我们可以移除该后缀以便于使用。\n\n接下来，我们在下游分类数据集上对预训练模型进行微调。我们使用 `pretrain.py` 输出的 [`book_review_model.bin`](https:\u002F\u002Fshare.weiyun.com\u002FPnxMrRwZ) 中的嵌入层和编码器：\n```\npython3 finetune\u002Frun_classifier.py --pretrained_model_path models\u002Fbook_review_model.bin \\\n                                   --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                   --config_path models\u002Fbert\u002Fbase_config.json \\\n                                   --train_path datasets\u002Fbook_review\u002Ftrain.tsv \\\n                                   --dev_path datasets\u002Fbook_review\u002Fdev.tsv \\\n                                   --test_path datasets\u002Fbook_review\u002Ftest.tsv \\\n                                   --epochs_num 3 --batch_size 32\n``` \n微调后的分类模型默认保存路径为 `models\u002Ffinetuned_model.bin`。需要注意的是，预训练任务的实际批量大小为 `--batch_size` 乘以 `--world_size`；而下游任务（如分类）的实际批量大小则为 `--batch_size`。\n\n随后，我们使用微调后的模型进行推理：\n```\npython3 inference\u002Frun_classifier_infer.py --load_model_path models\u002Ffinetuned_model.bin \\\n                                          --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                          --config_path models\u002Fbert\u002Fbase_config.json \\\n                                          --test_path datasets\u002Fbook_review\u002Ftest_nolabel.tsv \\\n                                          --prediction_path datasets\u002Fbook_review\u002Fprediction.tsv \\\n                                          --labels_num 2\n```\n`--test_path` 指定待预测文件的路径，该文件应包含 `text_a` 列。`--prediction_path` 指定包含预测结果的文件路径。我们需要通过 `--labels_num` 显式指定标签数量。上述数据集为二分类数据集。\n\n\u003Cbr>\n\n以上内容提供了使用 TencentPretrain 进行预处理、预训练、微调和推理的基本方法。更多用法案例请参阅完整的 :arrow_right: [__快速入门__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FQuickstart) :arrow_left: 。完整版快速入门包含了丰富的用例，覆盖了大多数与预训练相关的应用场景。建议用户阅读完整版快速入门，以便合理使用该项目。\n\n\u003Cbr\u002F>\n\n## 预训练数据\n本节提供了一系列 :arrow_right: [__预训练数据__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FPretraining-data) :arrow_left: 的链接。TencentPretrain 可以直接加载这些预训练数据。\n\n\u003Cbr\u002F>\n\n## 下游数据集\n本节提供了一系列 :arrow_right: [__下游数据集__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FDownstream-datasets) :arrow_left: 的链接。TencentPretrain 可以直接加载这些数据集。\n\n\u003Cbr\u002F>\n\n## 模型库\n借助 TencentPretrain，我们预训练了多种不同性质的模型（例如基于不同模态、编码器和目标的模型）。预训练模型的详细介绍及其下载链接请参阅 :arrow_right: [__模型库__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FModelzoo) :arrow_left: 。所有预训练模型均可由 TencentPretrain 直接加载。\n\n\u003Cbr\u002F>\n\n## 使用说明\nTencentPretrain 的目录结构如下：\n```\nTencentPretrain\u002F\n    |--tencentpretrain\u002F\n    |    |--embeddings\u002F # 包含嵌入组件的模块\n    |    |--encoders\u002F # 包含编码器组件的模块，如 RNN、CNN、Transformer\n    |    |--decoders\u002F # 包含解码器组件的模块\n    |    |--targets\u002F # 包含目标组件的模块，如语言模型、掩码语言模型\n    |    |--layers\u002F # 包含常用的神经网络层\n    |    |--models\u002F # 包含 model.py，用于组合不同组件的模块\n    |    |--utils\u002F # 包含常用工具函数\n    |    |--model_builder.py\n    |    |--model_loader.py\n    |    |--model_saver.py\n    |    |--opts.py\n    |    |--trainer.py\n    |\n    |--corpora\u002F # 包含预训练数据\n    |--datasets\u002F # 包含下游任务数据集\n    |--models\u002F # 包含预训练模型、词汇表和配置文件\n    |--scripts\u002F # 包含用于预训练模型的实用脚本\n    |--finetune\u002F # 包含下游任务微调脚本\n    |--inference\u002F # 包含下游任务推理脚本\n    |\n    |--preprocess.py\n    |--pretrain.py\n    |--README.md\n    |--README_ZH.md\n    |--requirements.txt\n    |--LICENSE\n\n```\n\n代码按照组件划分（如嵌入、编码器等），用户可以轻松使用并在此基础上进行扩展。\n\n关于 TencentPretrain 的全面使用示例，请参阅 :arrow_right: [__使用说明__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FInstructions) :arrow_left: ，这些示例可以帮助用户快速实现 BERT、GPT-2、ELMo、T5、CLIP 等预训练模型，并在多种下游任务上对预训练模型进行微调。\n\n\u003Cbr\u002F>\n\n## 竞赛解决方案\nTencentPretrain 已被用于多项竞赛的获奖方案中。在本节中，我们提供了一些使用 TencentPretrain 在 CLUE 等竞赛中取得 SOTA 成绩的示例。更多详细信息请参阅 :arrow_right: [__竞赛解决方案__](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki\u002FCompetition-solutions) :arrow_left: 。\n\n\u003Cbr\u002F>\n\n## 引用\n#### 如果您在学术研究中使用了 TencentPretrain 中的工作（例如预训练模型），请引用发表于 ACL 2023 的 [系统论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.06385.pdf)：\n```\n@article{zhao2023tencentpretrain,\n  title={TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities},\n  author={Zhao, Zhe and Li, Yudong and Hou, Cheng and Zhao, Jing and others},\n  journal={ACL 2023},\n  pages={217},\n  year={2023}\n}\n```","# TencentPretrain 快速上手指南\n\nTencentPretrain 是腾讯开源的多模态预训练框架，支持文本、视觉、音频等多种模态。它采用模块化设计，便于用户复现经典模型（如 BERT、GPT-2、CLIP 等）或构建自定义预训练模型。\n\n## 环境准备\n\n### 系统要求\n- **Python**: >= 3.6\n- **PyTorch**: >= 1.1\n- **操作系统**: Linux \u002F macOS \u002F Windows (推荐 Linux)\n\n### 前置依赖\n根据使用场景，可能需要安装以下额外依赖：\n- **基础依赖**: `six`, `argparse`, `packaging`, `regex`\n- **分词工具**: `jieba` (中文全词掩码预训练), `SentencePiece`\n- **深度学习加速**: `DeepSpeed` (超大模型训练), `torchvision` (视觉), `torchaudio` (音频)\n- **其他**: `TensorFlow` (模型转换), `pytorch-crf` (序列标注), `LightGBM` (堆叠模型)\n\n建议先安装基础依赖：\n```bash\npip install six>=1.12.0 argparse packaging regex\n```\n\n## 安装步骤\n\n1. **克隆项目代码**\n   ```bash\n   git clone https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain.git\n   cd TencentPretrain\n   ```\n\n2. **安装 Python 依赖**\n   推荐使用国内镜像源加速安装：\n   ```bash\n   pip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n   ```\n   *注：若需使用 DeepSpeed 或特定模态功能，请参照“环境准备”章节单独安装对应库。*\n\n## 基本使用\n\n以下示例演示如何使用 BERT 模型进行**书评情感分类**任务，流程包含：数据预处理 -> 继续预训练 -> 微调 -> 推理。\n\n### 1. 数据预处理\n将原始文本语料转换为框架所需的 `.pt` 格式。假设已有语料文件 `corpora\u002Fbook_review_bert.txt` 和词表 `models\u002Fgoogle_zh_vocab.txt`：\n\n```bash\npython3 preprocess.py --corpus_path corpora\u002Fbook_review_bert.txt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                      --dataset_path dataset.pt --processes_num 8 --data_processor bert\n```\n*提示：`--processes_num` 可指定多进程加速处理。*\n\n### 2. 继续预训练 (Optional)\n加载谷歌官方中文 BERT 权重，在书评语料上继续预训练。假设机器有 8 张 GPU：\n\n```bash\npython3 pretrain.py --dataset_path dataset.pt --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                    --pretrained_model_path models\u002Fgoogle_zh_model.bin \\\n                    --config_path models\u002Fbert\u002Fbase_config.json \\\n                    --output_model_path models\u002Fbook_review_model.bin \\\n                    --world_size 8 --gpu_ranks 0 1 2 3 4 5 6 7 \\\n                    --total_steps 5000 --save_checkpoint_steps 1000 --batch_size 32\n\n# 重命名模型文件（去除步数后缀）\nmv models\u002Fbook_review_model.bin-5000 models\u002Fbook_review_model.bin\n```\n*注：若无继续预训练需求，可直接使用下载的 `google_zh_model.bin` 进行微调。*\n\n### 3. 下游任务微调\n使用预训练模型在标注好的分类数据集上进行微调：\n\n```bash\npython3 finetune\u002Frun_classifier.py --pretrained_model_path models\u002Fbook_review_model.bin \\\n                                   --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                   --config_path models\u002Fbert\u002Fbase_config.json \\\n                                   --train_path datasets\u002Fbook_review\u002Ftrain.tsv \\\n                                   --dev_path datasets\u002Fbook_review\u002Fdev.tsv \\\n                                   --test_path datasets\u002Fbook_review\u002Ftest.tsv \\\n                                   --epochs_num 3 --batch_size 32\n```\n微调后的模型默认保存为 `models\u002Ffinetuned_model.bin`。\n\n### 4. 模型推理\n使用微调后的模型对无标签数据进行预测：\n\n```bash\npython3 inference\u002Frun_classifier_infer.py --load_model_path models\u002Ffinetuned_model.bin \\\n                                          --vocab_path models\u002Fgoogle_zh_vocab.txt \\\n                                          --config_path models\u002Fbert\u002Fbase_config.json \\\n                                          --test_path datasets\u002Fbook_review\u002Ftest_nolabel.tsv \\\n                                          --prediction_path datasets\u002Fbook_review\u002Fprediction.tsv \\\n                                          --labels_num 2\n```\n*说明：`--labels_num` 需明确指定分类类别数量（此处为二分类）。*\n\n---\n更多高级用法（如多模态训练、大模型分布式训练等）请参考官方完整文档：[TencentPretrain Wiki](https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fwiki)","某电商公司的算法团队需要快速构建一个能理解商品图文描述的多模态情感分析系统，以精准挖掘用户评论中的潜在需求。\n\n### 没有 TencentPretrain 时\n- **重复造轮子成本高**：团队需从零编写 BERT、CLIP 等模型的预训练代码，难以保证复现官方论文的性能指标，调试耗时数周。\n- **多模态融合困难**：文本与图像数据分别由不同框架处理，缺乏统一的模块化接口，导致跨模态特征对齐和联合训练极其复杂。\n- **资源适配灵活性差**：面对海量商品数据，手动配置分布式训练或显存优化（如 DeepSpeed）门槛高，容易因显存溢出导致训练中断。\n- **模型选型盲目**：缺乏内置的模型库（Model Zoo），团队无法直接调用针对不同任务优化过的预训练权重，只能盲目尝试基础模型。\n\n### 使用 TencentPretrain 后\n- **开箱即用的复现能力**：直接调用内置的 BERT 和 CLIP 模块，几行命令即可复现 SOTA 效果，将模型验证周期从数周缩短至几天。\n- **模块化构建多模态模型**：利用其清晰的嵌入、编码器和目标模块接口，像搭积木一样快速组合文本与视觉组件，轻松实现图文联合预训练。\n- **弹性训练架构**：原生支持单卡、分布式及 DeepSpeed 超大模型训练模式，自动适配现有算力资源，稳定处理亿级商品图文数据。\n- **精准模型匹配**：直接从 Model Zoo 加载针对分类任务优化的预训练权重进行微调，显著提升了情感分析的准确率并赢得内部算法竞赛。\n\nTencentPretrain 通过高度模块化和多模态支持，将复杂的预训练流程标准化，让团队能专注于业务逻辑而非底层架构搭建。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FTencent_TencentPretrain_9968eb52.png","Tencent","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FTencent_f7e55588.png","",null,"https:\u002F\u002Fopensource.tencent.com","https:\u002F\u002Fgithub.com\u002FTencent",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,1088,147,"2026-04-14T14:09:07","NOASSERTION","未说明","非必需（支持 CPU 模式）；支持单卡及分布式多卡训练；超大模型训练需 DeepSpeed；视觉任务需 torchvision，音频任务需 torchaudio",{"notes":91,"python":92,"dependencies":93},"该工具支持文本、视觉和音频多模态预训练。基础运行仅需 PyTorch，但特定功能需额外依赖：如全词掩码需 jieba，序列标注 CRF 需 pytorch-crf，堆叠模型需 LightGBM 和 BayesianOptimization。预处理阶段建议使用多进程加速。",">=3.6",[94,95,96,97,98,99,100,101,102,103],"torch>=1.1","six>=1.12.0","argparse","packaging","regex","TensorFlow (可选，用于模型转换)","SentencePiece (可选，用于分词)","DeepSpeed (可选，用于超大模型训练)","torchvision (可选，用于视觉模型)","torchaudio (可选，用于音频模型)",[15,35,105,14,106,13],"视频","音频",[108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127],"albert","bart","bert","chinese","classification","clue","elmo","fine-tuning","gpt","gpt-2","model-zoo","natural-language-processing","ner","pegasus","pre-training","pytorch","roberta","t5","unilm","xlm-roberta","2026-03-27T02:49:30.150509","2026-04-16T08:14:14.541308",[131,136,141,146,151,156],{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},35668,"在微调 Llama 模型时遇到返回码 -9 (return code = -9) 错误，如何解决？","该错误通常是由于显存不足（Out of Memory）导致的。即使将 batch size 设置为 1 仍可能报错，这可能是因为 GPU 显存不够（例如使用 32GB 显存的 V100 运行 7B 模型）。解决方案是更换更大显存的 GPU（建议每张卡至少 64GB），或者尝试减少模型并行度。如果是 CPU 内存溢出，则需要增加系统内存资源。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F28",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},35669,"预处理或训练 Llama 模型时报错 'NoneType' object has no attribute 'contiguous' 或 'an integer is required (got type NoneType)'，原因是什么？","这是因为 Llama 模型的特殊 token（special tokens）与代码默认设置不一致，导致数据预处理过程中产生了 None 值。解决方法是参考官方文档中的步骤修改特殊 token 字典，然后再重新进行数据预处理。具体操作请查看：https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fblob\u002Fmain\u002Fdocuments\u002Fllama.md 中的步骤 3。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F42",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},35670,"如何在多机多卡环境下启动预训练任务？","框架支持多机多卡训练。可以通过 deepspeed 启动，添加 --hostfile 参数指定主机文件；或者不使用 deepspeed 指令，直接使用 torchrun 启动。在使用 deepspeed 时，可以在命令中指定 --master_ip 参数（例如 tcp:\u002F\u002F172.0.67.6:12914）来连接主节点。确保所有机器网络互通且配置正确。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F94",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},35671,"目前框架是否支持自定义配置张量并行（Tensor Parallelism）和流水线并行（Pipeline Parallelism）？","截至目前，该框架尚不支持自定义配置张量并行度和流水线并行度。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F65",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},35672,"在多卡预训练保存模型时，为什么每张卡都保存了一个模型文件？这些文件是一样的吗？","这是正常现象。在分布式训练中，虽然每个进程（rank）都会执行保存逻辑，但通常保存的是完整的模型状态或者通过机制确保一致性。用户观察到每张卡都生成了模型文件，实际上这些文件内容通常是一致的或者是分片后的结果（取决于具体实现），并非代码 Bug。如果只需要一个文件，通常只需关注 rank 0 生成的文件即可，或者框架内部已处理去重。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F69",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},35673,"使用脚本预训练 Bart 模型并在 run_text2text 微调时报错 AttributeError: 'NoneType' object has no attribute 'contiguous'，如何解决？","该问题与 Llama 模型遇到的特殊 token 问题类似（参见 Issue #42），通常是由于数据预处理阶段某些字段为 None 导致的。建议检查数据预处理流程，确保输入数据中没有缺失值，并确认特殊 token 的配置是否与模型要求一致。如果问题依旧，可尝试参考 UER-py 项目的数据处理方式，该项目中未出现此问题。","https:\u002F\u002Fgithub.com\u002FTencent\u002FTencentPretrain\u002Fissues\u002F54",[]]