[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-togethercomputer--RedPajama-Data":3,"tool-togethercomputer--RedPajama-Data":64},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":45,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[15,14,13,36],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":80,"languages":81,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":112,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":145},2143,"togethercomputer\u002FRedPajama-Data","RedPajama-Data","The RedPajama-Data repository contains code for preparing large datasets for training large language models.","RedPajama-Data 是一个专为训练大型语言模型打造的开源数据准备工具集，其核心成果是发布了包含 30 万亿 token 的 RedPajama-V2 数据集。它主要解决了大模型训练中高质量、多语言语料获取难及数据冗余的问题，通过自动化流水线从 84 个 CommonCrawl 快照中清洗出超过 1000 亿份文档。\n\n该工具特别适合 AI 研究人员和开发者使用，尤其是那些希望复现顶级模型训练数据流程或需要大规模多语言语料的团队。其技术亮点在于集成了完整的 CCNet 处理管道，不仅支持英语、德语、法语、西班牙语和意大利语五种语言，还引入了先进的质量信号评估机制与严格的去重算法。数据显示，仅经过筛选和去重的核心部分就包含约 208 亿份文档，其中英文语料高达 20.5 万亿 token。此外，RedPajama-Data 提供了基于 Docker 的标准化部署方案，涵盖从构建质量分类器、计算重要性权重到执行最终去重的全流程脚本，让用户能够透明、可控地构建属于自己的高性能训练数据集。","# RedPajama-Data-v2: an Open Dataset with 30 Trillion Tokens for Training Large Language Models\n\n\u003Cimg width=\"500\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftogethercomputer_RedPajama-Data_readme_1caa2a2ab77e.png\" \u002F>\n\nThis repository contains the code for the RedPajama-V2 dataset. For more information on the dataset, check out our\n[blog post](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-data-v2). The dataset is also available on\n[HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-V2). For the code used for the\nRedPajama-1T dataset, please refer to the `rp_v1` branch in this repo.\n\n## Dataset\n\nRedPajama-V2 is an open dataset for training large language models. The dataset includes over 100B text\ndocuments coming from 84 CommonCrawl snapshots and processed using\nthe [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net) pipeline. Out of these, there are 30B documents in the corpus\nthat additionally come with quality signals, and 20B documents that are deduplicated.\n\n### Document and Token Counts for the Annotated and deduplicated `head_middle` part of the dataset\n\nThe number of documents and tokens for the annotated and deduplicated `head_middle` part of the dataset is shown in the\ntable below.\n\n|       | # Documents | Estimated Token count (deduped) |\n|-------|-------------|---------------------------------|\n| en    | 14.5B       | 20.5T                           |\n| de    | 1.9B        | 3.0T                            |\n| fr    | 1.6B        | 2.7T                            |  \n| es    | 1.8B        | 2.8T                            |\n| it    | 0.9B        | 1.5T                            |\n| Total | 20.8B       | 30.4T                           |\n\n### Languages\n\nEnglish, German, French, Italian, Spanish\n\n## Setup\n\n### Configuration\n\nCopy the file `configs\u002Frp_v2.0.conf` to e.g. `configs\u002Fdefault.conf` and configure the environment variables.\nThese will be used throughout the pipeline.\n\n### Buid Docker image\n\nTo run with docker, build the docker image using\n\n```bash\n. configs\u002Fdefault.conf\ncd app\ndocker build -t \"${DOCKER_REPO}:\" .\n\n```\n\nAlso, make sure you have `s5cmd` installed and your S3 profile configured so that you can pull data from an S3 bucket.\n\nYou can run the steps of the pipeline without any containerized environment. However, the running scripts assume you\nhave a docker and apptainer installation.\n\n## Running the Pipeline\n\nThe pipeline is composed of three steps, namely 1) preparing artifacts, 2) computing quality signals, and 3)\ndeduplication.\n\n**Important:** In case you are not running steps (1) and (2) with the provided scripts (i.e., docker containers built with the provided Dockerfile), make sure to set the `PYTHONHASHSEED` environment variable to a consistent value (e.g., 42) using\n```bash\nexport PYTHONHASHSEED=42\n```\nThis is to ensure consistency of hash functions used in the computation of DSIR weights.\n\n### 1. Create Artifacts\n\nThis part of the pipeline creates the artifacts that are used in the subsequent steps. This includes building quality\nclassifiers, training bag-of-ngram generative models for importance weight computation, fetching the list of bad words\nfrom the [LDNOOBW repo](https:\u002F\u002Fgithub.com\u002FLDNOOBW\u002FList-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words), and fetching\nthe most recent list of blacklisted urls from the [UT1 blacklist](https:\u002F\u002Fdsi.ut-capitole.fr\u002Fblacklists\u002F).\n\nAs a first step, download the english wikipedia reference classifier\nfrom [here](https:\u002F\u002Fdata.together.xyz\u002Fredpajama-data-v2\u002Fv1.0.0\u002Fartifacts\u002Fwikiref.model.bin) and place it\nin `${DATA_ROOT}\u002Fwikiref-model\u002Fen\u002Fen-model.bin`. This is the same fasttext classifier that was used in RedPajama-V1.\n\nTo create the remaining artifacts, make sure that the environment variables are set in the config file. Then, from\nthe root directory of the repository, run\n\n```bash\nbash scripts\u002Frun_prep_artifacts.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt\\\n  --max_workers 32\n```\n\nwhere `\u002Fpath\u002Fto\u002Flistings\u002Ffile.txt` is a file that contains the keys to the ccnet data that you want to process\n(e.g., `2023-06\u002F0000\u002Fen_head.json.gz`).\n\nYou can set the `max_workers` flag to the number of parallel processes you want to use.\n\nThis step will generate an id which you can store in the environment variable `ARTIFACTS_ID` for the next step.\n\n### 2. Compute Quality Signals\n\nThe second step of the pipeline compute the quality signals, including the minhash signatures to run fuzzy deduplication\nin the subsequent step. To run this step, make sure the environment variables are set in the config file. Then, from\nthe root directory of the repository, run\n\n```bash\nbash scripts\u002Fapptainer_run_quality_signals.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Foutout\u002Fdata\u002Froot\" \\\n  --max_docs -1\n```\n\n### 3. Deduplication\n\nThe third component of the pipeline consists of deduplication steps. Here we provide code to run exact and fuzzy\ndeduplication.\n\n#### Exact Deduplication using a Bloomfilter\n\nContent based deduplication is implemented in `app\u002Fsrc\u002Fbloomfilter.py`. It can be run independently of the\nprevious step, but the data needs to stored in an S3 bucket. For this step, from the `app` directory, run:\n\n```bash\npython3 app\u002Fsrc\u002Fbloomfilter.py \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt \\\n  --input_base_uri \"s3:\u002F\u002Fpath\u002Fto\u002Fccnet\u002Fdata\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --s3_profile \"...\" \\\n  --endpoint_url \"...\" \\\n  --parallel_readers 32 \\\n  --batch_size 10 \\\n  --capacity \"...\" \\\n  --error_rate \"...\"\n```\n\nIt is important to choose the correct capacity (i.e., > #documents), since otherwise the `error_rate` will not be\nguaranteed and more false positives will appear. The implementation is based on the\n[pybloomfiltermmap3](https:\u002F\u002Fgithub.com\u002Fprashnts\u002Fpybloomfiltermmap3) library.\n\n#### Fuzzy Deduplication with Locality Sensitive Hashing\n\nIn the third step of the pipeline, we run locality sensitive hashing on the minhash signatures generated in the first\nstep. To run this step, make sure that you use the same configuration as in the quality signals step. Then, from\nthe root directory of the repository, run\n\n```bash\nbash scripts\u002Fapptainer_run_lsh.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --similarity \"\u003Csimilarity_threshold>\" \\\n  --listings \"\u002Fminhash\u002Flistings\u002Ffile.txt\" \\\n  --max_docs -1\n```\n\nThe implementation is based on polars and was tested with 200M documents on a 64 core machine with 500G of RAM.\n\n## Summary of Quality Signals\n\nThe second step of this pipeline computes the following set of quality signals. We hope to grow this list further over\ntime as more signals are developed.\n\n#### Quality Annotations\n\n| Annotation Tag                                 | Description                                                                                                                                                                                                                                                                                                                                                                                                          | Category         | Reference                                                                                                                     |\n|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------|\n| ccnet_bucket                                   | head, middle or tail bucket of the perplexity score                                                                                                                                                                                                                                                                                                                                                                  | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_language_score                           | score of the language identification model                                                                                                                                                                                                                                                                                                                                                                           | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_length                                   | number of characters                                                                                                                                                                                                                                                                                                                                                                                                 | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_nlines                                   | number of lines                                                                                                                                                                                                                                                                                                                                                                                                      | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_original_length                          | number of characters before in-document line deduplication                                                                                                                                                                                                                                                                                                                                                           | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_original_nlines                          | number of lines before in-document line deduplication                                                                                                                                                                                                                                                                                                                                                                | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_perplexity                               | perplexity of an LM trained on Wikipedia                                                                                                                                                                                                                                                                                                                                                                             | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| rps_doc_books_importance                       | Given a bag of {1,2}-wordgram model trained on Books p, and a model trained on the source domain q, This is the logarithm of the ratio p(doc)\u002Fq(doc).                                                                                                                                                                                                                                                                | ML Heuristics    | [Importance Resampling (Xie et al.)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_openwebtext_importance                 | Given a bag of {1,2}-wordgram model trained on OpenWebText p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)\u002Fq(doc).                                                                                                                                                                                                                                                          | ML Heuristics    | [Importance Resampling (Xie et al.)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_wikipedia_importance                   | Given a bag of {1,2}-wordgram model trained on Wikipedia articles p, and a model trained on the source domain q, this is the logarithm of the ratio p(doc)\u002Fq(doc).                                                                                                                                                                                                                                                   | ML Heuristics    | [Importance Resampling (Xie et al.)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_ml_wikiref_score                       | Fasttext classifier prediction for the document being a Wikipedia reference. This is the same fasttext model used in the RedPajama-1T dataset. Only applies to English data..                                                                                                                                                                                                                                        | ML Heuristics    | [LLaMA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.13971), [RedPajama-1T](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T) |\n| rps_doc_ml_palm_score                          | Fasttext classifier prediction for the document being a Wikipedia article, OpenWebText sample or a RedPajama-V1 book. Only for English data.                                                                                                                                                                                                                                                                         | ML Heuristics    | [PALM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.02311), [GLaM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.06905)                                            |\n| rps_doc_ml_wikipedia_score                     | Fasttext classifier prediction for the document being a Wikipedia article. This is used for non-English data                                                                                                                                                                                                                                                                                                         | ML Heuristics    | -                                                                                                                             |\n| rps_doc_curly_bracket                          | The ratio between the number of occurrences of '{' or '}' and the number of characters in the raw text.                                                                                                                                                                                                                                                                                                              | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_frac_all_caps_words                    | The fraction of words in the content that only consist of uppercase letters. This is based on the raw content.                                                                                                                                                                                                                                                                                                       | Natural Language | [Pretrainer’s Guide](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13169)                                                                        |\n| rps_doc_frac_lines_end_with_ellipsis           | The fraction of lines that end with an ellipsis, where an ellipsis is defined as either \"...\" or \"…\".                                                                                                                                                                                                                                                                                                                | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_no_alph_words                     | The fraction of words that contain no alphabetical character.                                                                                                                                                                                                                                                                                                                                                        | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_lorem_ipsum                            | The ratio between the number of occurrences of 'lorem ipsum' and the number of characters in the content after normalisation.                                                                                                                                                                                                                                                                                        | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_mean_word_length                       | The mean length of words in the content after normalisation.                                                                                                                                                                                                                                                                                                                                                         | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_stop_word_fraction                     | The ratio between the number of stop words and the number of words in the document. Stop words are obtained from the [stopwords-json](https:\u002F\u002Fgithub.com\u002F6\u002Fstopwords-json) repo.                                                                                                                                                                                                                                     | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_symbol_to_word_ratio                   | The ratio of symbols to words in the content.. Symbols are defined \"#\", \"...\", and \"…\".                                                                                                                                                                                                                                                                                                                              | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_unique_words                      | The fraction of unique words in the content. This is also known as the degeneracy of a text sample. Calculated based on the normalised content.                                                                                                                                                                                                                                                                      | Natural Language | [Pretrainer’s Guide](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13169)                                                                        |\n| rps_doc_unigram_entropy                        | The entropy of the unigram distribution of the content. This measures the diversity of the content and is computed using sum(-x \u002F total * log(x \u002F total)) where the sum is taken over counts of unique words in the normalised content.                                                                                                                                                                              | Natural Language | -                                                                                                                             |\n| rps_doc_word_count                             | The number of words in the content after normalisation.                                                                                                                                                                                                                                                                                                                                                              | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_lines_ending_with_terminal_punctution_mark | Indicates whether a line ends with a terminal punctuation mark. A terminal punctation mark is defined as one of: \".\", \"!\", \"?\", \"”\".                                                                                                                                                                                                                                                                                 | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_lines_javascript_counts                    | The number of occurrences of the word \"javascript\" in each line.                                                                                                                                                                                                                                                                                                                                                     | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_lines_num_words                            | The number of words in each line. This is computed based on the normalised text.                                                                                                                                                                                                                                                                                                                                     | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683) , [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                       |\n| rps_lines_numerical_chars_fraction             | The ratio between the number of numerical characters and total number of characters in each line. This is based on the normalised content.                                                                                                                                                                                                                                                                           | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| rps_lines_start_with_bulletpoint               | Whether the lines that start with a bullet point symbol. The following set of unicodes are considered a bullet point: \\u2022 (bullet point), \\u2023 (triangular bullet point), \\u25B6 (black right pointing triangle), \\u25C0 (black left pointing triangle), \\u25E6 (white bullet point), \\u25A0 (black square), \\u25A1 (white square), \\u25AA (black small square), \\u25AB (white small square), \\u2013 (en dash). | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_lines_uppercase_letter_fraction            | The ratio between the number of uppercase letters and total number of characters in each line. This is based on the raw text.                                                                                                                                                                                                                                                                                        | Natural Language | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| rps_doc_num_sentences                          | The number of sentences in the content. This is calculated using the regular expression `r'\\b[^.!?]+[.!?]*'`.                                                                                                                                                                                                                                                                                                        | Natural Language | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_frac_chars_dupe_10grams                | The fraction of characters in duplicate word 10grams. This operates on the lower-cased, punctuation removed content. It is also ensured that characters in overlapping ngrams are only counted once.                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_5grams                 | The fraction of characters in duplicate word 5grams.                                                                                                                                                                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_6grams                 | The fraction of characters in duplicate word 6grams.                                                                                                                                                                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_7grams                 | The fraction of characters in duplicate word 7grams.                                                                                                                                                                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_8grams                 | The fraction of characters in duplicate word 8grams.                                                                                                                                                                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_9grams                 | The fraction of characters in duplicate word 9grams.                                                                                                                                                                                                                                                                                                                                                                 | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_2gram                   | The fraction of characters in the top word 2gram.                                                                                                                                                                                                                                                                                                                                                                    | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_3gram                   | The fraction of characters in the top word 3gram.                                                                                                                                                                                                                                                                                                                                                                    | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_4gram                   | The fraction of characters in the top word 4gram.                                                                                                                                                                                                                                                                                                                                                                    | Repetitiveness   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_ldnoobw_words                          | The number of sequences of words that are contained in the List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words blocklist. The blocklist is obtained from the [LDNOOBW](https:\u002F\u002Fgithub.com\u002FLDNOOBW\u002FList-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) repo.                                                                                                                                                     | toxicity         | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_ut1_blacklist                          | A categorical id corresponding to the list of categories of the domain of the document. Categories are obtained from the UT1 blacklist. The list is obtained from [UT-Capitole](https:\u002F\u002Fdsi.ut-capitole.fr\u002Fblacklists\u002F).                                                                                                                                                                                             | toxicictiy       | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| minhash_signature_0.7                          | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.7. The signature is based on 128 hash functions and grouped into 14 bands and 9 rows for LSH.                                                                                                                                                                                                                              | Deduplication    |\n| minhash_signature_0.8                          | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.8. The signature is based on 128 hash functions and grouped into 9 bands and 13 rows for LSH.                                                                                                                                                                                                                              | Deduplication    |\n| minhash_signature_0.9                          | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 0.9. The signature is based on 128 hash functions and grouped into 5 bands and 25 rows for LSH..                                                                                                                                                                                                                             | Deduplication    |\n| minhash_signature_1.0                          | Banded minhash signature of the document, for fuzzy deduplication at Jaccard similarity 1.0. The signature is based on 128 hash functions and grouped into 1 band and 128 rows for LSH.                                                                                                                                                                                                                              | Deduplication    |\n\n## Acknowledgements\n\nWe are appreciative to so many partners and collaborators that together are pushing forward the frontier of open LLM\nmodels.\n\n- Thank you to the OLMo team at AI2 and friends at OpenGPT-X for the insightful discussions about datasets and data\n  quality! Also for everyone who builds on the RedPajama dataset, including Cerebras for their SlimPajama efforts, and\n  the over 500 models built on RedPajam to date by the open-source AI community.\n- We are grateful to the great team at EleutherAI for paving the path on open training datasets with The Pile and for\n  open-sourcing code we use in training some of the RedPajama models.\n- Thank you to our partners of RedPajama-v1, including Ontocord.ai, MILA Québec AI Institute, ETH DS3Lab, Université de\n  Montréal, Stanford Center for Research on Foundation Models (CRFM), Stanford Hazy Research research group and LAION.\n\n## License\n\n```\nCopyright 2023 Together Computer\n\nLicensed under the Apache License, Version 2.0 (the \"License\");\nyou may not use this file except in compliance with the License.\nYou may obtain a copy of the License at\n\n   http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\nUnless required by applicable law or agreed to in writing, software\ndistributed under the License is distributed on an \"AS IS\" BASIS,\nWITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.\nSee the License for the specific language governing permissions and\nlimitations under the License.\n```\n\nFor full terms, see the LICENSE file. If you have any questions, comments, or concerns about licensing\nplease [contact us](https:\u002F\u002Fwww.together.ai\u002Fcontact).\n\nFor the dataset itself, please refer to\nthe [Common Crawl Foundation Terms of Use](https:\u002F\u002Fcommoncrawl.org\u002Fterms-of-use).\n\nTo cite RedPajama, please use:\n\n```\n@article{weber2024redpajama,\n\ttitle   = {RedPajama: an Open Dataset for Training Large Language Models},\n\tauthor  = {Maurice Weber and Daniel Y. Fu and Quentin Anthony and Yonatan Oren and Shane Adams and Anton Alexandrov and Xiaozhong Lyu and Huu Nguyen and Xiaozhe Yao and Virginia Adams and Ben Athiwaratkun and Rahul Chalamala and Kezhen Chen and Max Ryabinin and Tri Dao and Percy Liang and Christopher Ré and Irina Rish and Ce Zhang},\n\tjournal = {NeurIPS Datasets and Benchmarks Track},\n\tyear    = 2024,\n}\n```\n\n","# RedPajama-Data-v2：用于训练大型语言模型的拥有30万亿标记的开放数据集\n\n\u003Cimg width=\"500\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftogethercomputer_RedPajama-Data_readme_1caa2a2ab77e.png\" \u002F>\n\n本仓库包含RedPajama-V2数据集的相关代码。如需了解更多关于该数据集的信息，请参阅我们的\n[博客文章](https:\u002F\u002Ftogether.ai\u002Fblog\u002Fredpajama-data-v2)。该数据集也可在\n[HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-V2) 上获取。有关RedPajama-1T数据集所用代码，请参考本仓库中的`rp_v1`分支。\n\n## 数据集\n\nRedPajama-V2 是一个用于训练大型语言模型的开放数据集。该数据集包含了来自84个CommonCrawl快照的超过1000亿份文本文档，并使用\n[CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net) 流水线进行处理。其中，语料库中有300亿份文档附带质量信号，另有200亿份文档已完成去重。\n\n### 带标注且已去重的“head_middle”部分的文档与标记数量\n\n下表展示了该数据集中带标注且已去重的“head_middle”部分的文档与标记数量。\n\n|       | 文档数   | 估计标记数（去重后） |\n|-------|----------|---------------------|\n| 英文  | 145亿    | 20.5万亿            |\n| 德文  | 19亿     | 3.0万亿             |\n| 法文  | 16亿     | 2.7万亿             |\n| 西班牙文 | 18亿    | 2.8万亿             |\n| 意大利文 | 9亿     | 1.5万亿             |\n| 总计  | 208亿    | 30.4万亿            |\n\n### 语言\n\n英语、德语、法语、意大利语、西班牙语\n\n## 设置\n\n### 配置\n\n将文件`configs\u002Frp_v2.0.conf`复制到例如`configs\u002Fdefault.conf`中，并配置环境变量。这些环境变量将在整个流水线中被使用。\n\n### 构建Docker镜像\n\n若要使用Docker运行，可使用以下命令构建Docker镜像：\n\n```bash\n. configs\u002Fdefault.conf\ncd app\ndocker build -t \"${DOCKER_REPO}:\" .\n\n```\n\n此外，请确保已安装`s5cmd`并配置好S3访问权限，以便能够从S3存储桶中拉取数据。\n\n您也可以不使用容器化环境来运行流水线的各个步骤。不过，运行脚本时假设您已安装了Docker和Apptainer。\n\n## 运行流水线\n\n该流水线由三个步骤组成，分别是1）准备工件，2）计算质量信号，以及3）去重。\n\n**重要提示：** 如果您未使用提供的脚本（即使用提供的Dockerfile构建的Docker容器）来运行步骤（1）和步骤（2），请务必通过以下命令将`PYTHONHASHSEED`环境变量设置为一个固定值（例如42）：\n\n```bash\nexport PYTHONHASHSEED=42\n```\n\n这样做是为了确保在计算DSIR权重时所使用的哈希函数具有一致性。\n\n### 1. 创建工件\n\n流水线的这一部分会创建后续步骤中需要用到的工件。其中包括构建质量分类器、训练用于计算重要性权重的n-gram生成模型、从[LDNOOBW仓库](https:\u002F\u002Fgithub.com\u002FLDNOOBW\u002FList-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words)获取脏话列表，以及从[UT1黑名单](https:\u002F\u002Fdsi.ut-capitole.fr\u002Fblacklists\u002F)获取最新的被屏蔽网址列表。\n\n首先，从[这里](https:\u002F\u002Fdata.together.xyz\u002Fredpajama-data-v2\u002Fv1.0.0\u002Fartifacts\u002Fwikiref.model.bin)下载英文维基百科参考分类器，并将其放置在`${DATA_ROOT}\u002Fwikiref-model\u002Fen\u002Fen-model.bin`中。这与RedPajama-V1中使用的FastText分类器相同。\n\n要创建其余工件，请确保已在配置文件中设置好环境变量。然后，在仓库的根目录下运行以下命令：\n\n```bash\nbash scripts\u002Frun_prep_artifacts.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt\\\n  --max_workers 32\n```\n\n其中，`\u002Fpath\u002Fto\u002Flistings\u002Ffile.txt`是一个包含您想要处理的ccnet数据键的文件（例如`2023-06\u002F0000\u002Fen_head.json.gz`）。\n\n您可以将`max_workers`标志设置为您希望使用的并行进程数。\n\n此步骤将生成一个ID，您可以将其存储在环境变量`ARTIFACTS_ID`中，以供下一步使用。\n\n### 2. 计算质量信号\n\n流水线的第二步负责计算质量信号，包括用于后续模糊去重的minhash签名。要运行此步骤，需确保配置文件中已设置好环境变量。然后，在仓库的根目录下执行以下命令：\n\n```bash\nbash scripts\u002Fapptainer_run_quality_signals.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Foutout\u002Fdata\u002Froot\" \\\n  --max_docs -1\n```\n\n### 3. 去重\n\n流水线的第三部分是去重步骤。我们在此提供用于执行精确去重和模糊去重的代码。\n\n#### 使用Bloom过滤器的精确去重\n\n基于内容的去重功能实现在`app\u002Fsrc\u002Fbloomfilter.py`中。它可以在不依赖于前一步骤的情况下独立运行，但需要将数据存储在S3存储桶中。对于此步骤，在`app`目录下运行以下命令：\n\n```bash\npython3 app\u002Fsrc\u002Fbloomfilter.py \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt \\\n  --input_base_uri \"s3:\u002F\u002Fpath\u002Fto\u002Fccnet\u002Fdata\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --s3_profile \"...\" \\\n  --endpoint_url \"...\" \\\n  --parallel_readers 32 \\\n  --batch_size 10 \\\n  --capacity \"...\" \\\n  --error_rate \"...\"\n```\n\n选择正确的容量（即大于文档数量）非常重要，否则将无法保证错误率，可能会出现更多假阳性结果。该实现基于\n[pybloomfiltermmap3](https:\u002F\u002Fgithub.com\u002Fprashnts\u002Fpybloomfiltermmap3) 库。\n\n#### 使用局部敏感哈希的模糊去重\n\n在流水线的第三步，我们对第一步生成的minhash签名进行局部敏感哈希处理。要运行此步骤，需确保使用与计算质量信号时相同的配置。然后，在仓库的根目录下执行以下命令：\n\n```bash\nbash scripts\u002Fapptainer_run_lsh.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --similarity \"\u003Csimilarity_threshold>\" \\\n  --listings \"\u002Fminhash\u002Flistings\u002Ffile.txt\" \\\n  --max_docs -1\n```\n\n该实现基于Polars，并曾在一台拥有64核CPU和500GB内存的机器上对2亿份文档进行了测试。\n\n## 质量信号概览\n\n该流水线的第二步会计算以下一组质量信号。我们希望随着更多信号的开发，未来能进一步扩充此列表。\n\n#### 质量标注\n\n| 注释标签                                 | 描述                                                                                                                                                                                                                                                                                                                                                                                                          | 分类         | 参考                                                                                                                     |\n|------------------------------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|------------------|-------------------------------------------------------------------------------------------------------------------------------|\n| ccnet_bucket                                   | 困惑度分数的头部、中部或尾部桶                                                                                                                                                                                                                                                                                                                                                                  | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_language_score                           | 语言识别模型的得分                                                                                                                                                                                                                                                                                                                                                                           | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_length                                   | 字符数量                                                                                                                                                                                                                                                                                                                                                                                                 | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_nlines                                   | 行数                                                                                                                                                                                                                                                                                                                                                                                                      | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_original_length                          | 文档内去重前的字符数量                                                                                                                                                                                                                                                                                                                                                           | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_original_nlines                          | 文档内去重前的行数                                                                                                                                                                                                                                                                                                                                                                | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| ccnet_perplexity                               | 基于维基百科训练的语言模型的困惑度                                                                                                                                                                                                                                                                                                                                                                             | CCNet            | [CCNet](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fcc_net)                                                                           |\n| rps_doc_books_importance                       | 给定一个基于书籍语料 p 训练的 {1,2}-词 n-gram 模型，以及一个基于源域 q 训练的模型，这是 p(doc)\u002Fq(doc) 的对数值。                                                                                                                                                                                                                                                                | 机器学习启发式    | [重要性重采样 (Xie 等)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_openwebtext_importance                 | 给定一个基于 OpenWebText 语料 p 训练的 {1,2}-词 n-gram 模型，以及一个基于源域 q 训练的模型，这是 p(doc)\u002Fq(doc) 的对数值。                                                                                                                                                                                                                                                          | 机器学习启发式    | [重要性重采样 (Xie 等)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_wikipedia_importance                   | 给定一个基于维基百科文章 p 训练的 {1,2}-词 n-gram 模型，以及一个基于源域 q 训练的模型，这是 p(doc)\u002Fq(doc) 的对数值。                                                                                                                                                                                                                                                   | 机器学习启发式    | [重要性重采样 (Xie 等)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.03169)                                                        |\n| rps_doc_ml_wikiref_score                       | FastText 分类器对文档是否为维基百科参考文献的预测。该 FastText 模型与 RedPajama-1T 数据集所使用的相同。仅适用于英文数据。                                                                                                                                                                                                                                        | 机器学习启发式    | [LLaMA](https:\u002F\u002Farxiv.org\u002Fabs\u002F2302.13971), [RedPajama-1T](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Ftogethercomputer\u002FRedPajama-Data-1T) |\n| rps_doc_ml_palm_score                          | FastText 分类器对文档是否为维基百科文章、OpenWebText 样本或 RedPajama-V1 书籍的预测。仅适用于英文数据。                                                                                                                                                                                                                                                                         | 机器学习启发式    | [PALM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.02311), [GLaM](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.06905)                                            |\n| rps_doc_ml_wikipedia_score                     | FastText 分类器对文档是否为维基百科文章的预测。此用于非英文数据                                                                                                                                                                                                                                                                                                         | 机器学习启发式    | -                                                                                                                             |\n| rps_doc_curly_bracket                          | “{” 或 “}” 出现次数与原始文本字符数之比。                                                                                                                                                                                                                                                                                                              | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_frac_all_caps_words                    | 内容中仅由大写字母组成的单词所占比例。基于原始内容计算。                                                                                                                                                                                                                                                                                                       | 自然语言处理 | [预训练指南](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13169)                                                                        |\n| rps_doc_frac_lines_end_with_ellipsis           | 以省略号结尾的行所占比例，其中省略号定义为“…”或“…”。                                                                                                                                                                                                                                                                                                                | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_no_alph_words                     | 不包含任何字母字符的单词所占比例。                                                                                                                                                                                                                                                                                                                                                        | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_lorem_ipsum                            | “lorem ipsum” 出现次数与归一化后内容字符数之比。                                                                                                                                                                                                                                                                                        | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_mean_word_length                       | 归一化后内容中单词的平均长度。                                                                                                                                                                                                                                                                                                                                                         | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_stop_word_fraction                     | 停用词数量与文档中总词数之比。停用词来自 [stopwords-json](https:\u002F\u002Fgithub.com\u002F6\u002Fstopwords-json) 仓库。                                                                                                                                                                                                                                     | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_symbol_to_word_ratio                   | 内容中符号与单词的比例。符号定义为“#”、“…”和“…”。                                                                                                                                                                                                                                                                                                                              | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_unique_words                      | 内容中唯一单词的比例。这也被称为文本样本的退化程度。基于归一化后的内容计算。                                                                                                                                                                                                                                                                      | 自然语言处理 | [预训练指南](https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.13169)                                                                        |\n| rps_doc_unigram_entropy                        | 内容中 unigram 分布的熵。这衡量内容的多样性，计算公式为 sum(-x \u002F total * log(x \u002F total))，其中求和是对归一化后内容中唯一单词计数的总和。                                                                                                                                                                              | 自然语言处理 | -                                                                                                                             |\n| rps_doc_word_count                             | 归一化后内容中的单词数量。                                                                                                                                                                                                                                                                                                                                                              | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_lines_ending_with_terminal_punctution_mark | 表示某行是否以终结标点符号结尾。终结标点符号定义为：“.”、“!”、“?”、“””。                                                                                                                                                                                                                                                                                 | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_lines_javascript_counts                    | 每行中“javascript”一词出现的次数。                                                                                                                                                                                                                                                                                                                                                     | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_lines_num_words                            | 每行中的单词数量。基于归一化文本计算。                                                                                                                                                                                                                                                                                                                                     | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683) , [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                       |\n| rps_lines_numerical_chars_fraction             | 每行中数字字符数量与总字符数之比。基于归一化内容计算。                                                                                                                                                                                                                                                                           | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| rps_lines_start_with_bulletpoint               | 是否以项目符号开头。被视为项目符号的 Unicode 编码包括：\\u2022（实心圆点）、\\u2023（三角形圆点）、\\u25B6（黑色右向三角形）、\\u25C0（黑色左向三角形）、\\u25E6（白色圆点）、\\u25A0（黑色方块）、\\u25A1（白色方块）、\\u25AA（黑色小方块）、\\u25AB（白色小方块）、\\u2013（短破折号）。 | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_lines_uppercase_letter_fraction            | 每行中大写字母数量与总字符数之比。基于原始文本计算。                                                                                                                                                                                                                                                                                        | 自然语言处理 | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| rps_doc_num_sentences                          | 内容中的句子数量。使用正则表达式 `r'\\b[^.!?]+[.!?]*'` 计算。                                                                                                                                                                                                                                                                                                        | 自然语言处理 | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_frac_chars_dupe_10grams                | 重复 10-gram 中的字符比例。操作对象是已转为小写且去除标点的内容。同时确保重叠 n-gram 中的字符只被计数一次。                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_5grams                 | 重复 5-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_6grams                 | 重复 6-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_7grams                 | 重复 7-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_8grams                 | 重复 8-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_dupe_9grams                 | 重复 9-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                 | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_2gram                   | 最常见的 2-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                    | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_3gram                   | 最常见的 3-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                    | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_frac_chars_top_4gram                   | 最常见的 4-gram 中的字符比例。                                                                                                                                                                                                                                                                                                                                                                    | 重复性   | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116), [Gopher](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.11446)                                    |\n| rps_doc_ldnoobw_words                          | 包含在“脏话、下流、淫秽及其他不良词汇”黑名单中的词序列数量。该黑名单来自 [LDNOOBW](https:\u002F\u002Fgithub.com\u002FLDNOOBW\u002FList-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words) 仓库。                                                                                                                                                     | 有害内容   | [C4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)                                                                                        |\n| rps_doc_ut1_blacklist                          | 对应文档所属领域的类别标识。类别来自 UT1 黑名单，该名单来源于 [UT-Capitole](https:\u002F\u002Fdsi.ut-capitole.fr\u002Fblacklists\u002F)。                                                                                                                                                                                             | 有害内容       | [RefinedWeb](https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.01116)                                                                                |\n| minhash_signature_0.7                          | 文档的带状 MinHash 签名，用于 Jaccard 相似度为 0.7 时的模糊去重。签名基于 128 个哈希函数，分为 14 个带和 9 行，用于 LSH。                                                                                                                                                                                                                              | 去重    |\n| minhash_signature_0.8                          | 文档的带状 MinHash 签名，用于 Jaccard 相似度为 0.8 时的模糊去重。签名基于 128 个哈希函数，分为 9 个带和 13 行，用于 LSH。                                                                                                                                                                                                                              | 去重    |\n| minhash_signature_0.9                          | 文档的带状 MinHash 签名，用于 Jaccard 相似度为 0.9 时的模糊去重。签名基于 128 个哈希函数，分为 5 个带和 25 行，用于 LSH。                                                                                                                                                                                                                             | 去重    |\n| minhash_signature_1.0                          | 文档的带状 MinHash 签名，用于 Jaccard 相似度为 1.0 时的模糊去重。签名基于 128 个哈希函数，分为 1 个带和 128 行，用于 LSH。                                                                                                                                                                                                                              | 去重    |\n\n## 致谢\n\n我们衷心感谢众多合作伙伴与协作方，正是大家的共同努力，推动着开源大语言模型领域的前沿发展。\n\n- 感谢 AI2 的 OLMo 团队以及 OpenGPT-X 的朋友们，就数据集与数据质量展开了富有洞见的讨论！同时也要感谢所有基于 RedPajama 数据集进行开发的人们，包括 Cerebras 在 SlimPajama 方面所做的努力，以及开源 AI 社区迄今为止基于 RedPajama 构建的超过 500 个模型。\n- 我们感激 EleutherAI 的优秀团队，他们通过 The Pile 数据集为开源训练数据铺平了道路，并开源了我们在训练部分 RedPajama 模型时所使用的代码。\n- 感谢 RedPajama-v1 的合作伙伴，包括 Ontocord.ai、MILA 魁北克人工智能研究所、ETH DS3Lab、蒙特利尔大学、斯坦福基础模型研究中心 (CRFM)、斯坦福 Hazy Research 研究组以及 LAION。\n\n## 许可证\n\n```\n版权所有 © 2023 Together Computer\n\n本软件依照 Apache License, Version 2.0（“许可证”）授权使用；\n除非符合许可证的规定，否则不得使用本文件。\n您可以在以下网址获取许可证的副本：\n\n   http:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0\n\n除非适用法律另有规定或双方另有约定，否则根据本许可证分发的软件以“AS IS”为基础提供，不附带任何明示或暗示的担保或条件。\n有关特定语言的权限及限制，请参阅本许可证。\n```\n\n完整条款请参阅 LICENSE 文件。如您对许可有任何疑问、意见或顾虑，请[联系我们](https:\u002F\u002Fwww.together.ai\u002Fcontact)。\n\n关于数据集本身，请参阅 [Common Crawl 基金会使用条款](https:\u002F\u002Fcommoncrawl.org\u002Fterms-of-use)。\n\n引用 RedPajama 时，请使用以下格式：\n\n```\n@article{weber2024redpajama,\n\ttitle   = {RedPajama：用于训练大型语言模型的开源数据集},\n\tauthor  = {Maurice Weber、Daniel Y. Fu、Quentin Anthony、Yonatan Oren、Shane Adams、Anton Alexandrov、Xiaozhong Lyu、Huu Nguyen、Xiaozhe Yao、Virginia Adams、Ben Athiwaratkun、Rahul Chalamala、Kezhen Chen、Max Ryabinin、Tri Dao、Percy Liang、Christopher Ré、Irina Rish、Ce Zhang},\n\tjournal = {NeurIPS 数据集与基准测试赛道},\n\tyear    = 2024,\n}\n```","# RedPajama-Data-v2 快速上手指南\n\nRedPajama-Data-v2 是一个用于训练大型语言模型的开源数据集，包含来自 84 个 CommonCrawl 快照的超过 1000 亿份文档，总计约 30 万亿 tokens。本指南将帮助您配置环境并运行数据处理流水线。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu)\n*   **容器工具**: 已安装 `Docker` 和 `Apptainer` (Singularity)。虽然部分步骤可脱离容器运行，但官方脚本默认依赖这些工具。\n*   **存储工具**: 已安装 `s5cmd`，并配置好 S3 配置文件（用于从 S3 桶拉取数据）。\n*   **硬件建议**: 去重步骤（特别是模糊去重）资源消耗较大。参考测试环境为 64 核 CPU 和 500GB 内存。\n*   **网络**: 能够访问 HuggingFace、S3 存储桶以及外部黑名单源（如 LDNOOBW, UT1 blacklist）。\n\n## 安装步骤\n\n### 1. 获取代码与配置\n克隆仓库并复制配置文件：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data.git\ncd RedPajama-Data\ncp configs\u002Frp_v2.0.conf configs\u002Fdefault.conf\n```\n\n编辑 `configs\u002Fdefault.conf`，根据实际环境设置必要的环境变量（如 `DATA_ROOT`, `S3_PROFILE` 等）。\n\n### 2. 构建 Docker 镜像\n加载配置并构建运行所需的 Docker 镜像：\n\n```bash\n. configs\u002Fdefault.conf\ncd app\ndocker build -t \"${DOCKER_REPO}:\" .\n```\n\n### 3. 下载基础模型\n下载英语维基百科参考分类器（fasttext 模型），这是生成质量信号所必需的：\n\n```bash\n# 假设 ${DATA_ROOT} 已在配置文件中定义\nmkdir -p ${DATA_ROOT}\u002Fwikiref-model\u002Fen\nwget https:\u002F\u002Fdata.together.xyz\u002Fredpajama-data-v2\u002Fv1.0.0\u002Fartifacts\u002Fwikiref.model.bin -O ${DATA_ROOT}\u002Fwikiref-model\u002Fen\u002Fen-model.bin\n```\n\n## 基本使用\n\n数据处理流水线主要包含三个步骤：创建工件、计算质量信号、去重。\n\n**重要提示**：如果您未使用提供的 Docker\u002FApptainer 脚本运行前两步，请务必设置以下环境变量以保证哈希一致性：\n```bash\nexport PYTHONHASHSEED=42\n```\n\n### 第一步：创建工件 (Create Artifacts)\n此步骤构建质量分类器、训练 n-gram 模型并获取黑名单数据。\n\n准备一个列表文件（例如 `listings.txt`），包含您要处理的 CCNet 数据键值（如 `2023-06\u002F0000\u002Fen_head.json.gz`）。然后运行：\n\n```bash\nbash scripts\u002Frun_prep_artifacts.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt\\\n  --max_workers 32\n```\n\n运行成功后，记录输出的 ID 并导出为环境变量 `ARTIFACTS_ID`，供下一步使用。\n\n### 第二步：计算质量信号 (Compute Quality Signals)\n此步骤计算文档的质量分数（如困惑度、语言得分）并生成用于模糊去重的 MinHash 签名。\n\n```bash\nbash scripts\u002Fapptainer_run_quality_signals.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Foutout\u002Fdata\u002Froot\" \\\n  --max_docs -1\n```\n*注：请将 `--dump_id` 替换为您实际处理的数据批次 ID。*\n\n### 第三步：去重 (Deduplication)\n\n#### 方案 A：精确去重 (Exact Deduplication)\n基于 Bloomfilter 的内容去重，数据需位于 S3 桶中。\n\n```bash\npython3 app\u002Fsrc\u002Fbloomfilter.py \\\n  --listings \u002Fpath\u002Fto\u002Flistings\u002Ffile.txt \\\n  --input_base_uri \"s3:\u002F\u002Fpath\u002Fto\u002Fccnet\u002Fdata\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --s3_profile \"...\" \\\n  --endpoint_url \"...\" \\\n  --parallel_readers 32 \\\n  --batch_size 10 \\\n  --capacity \"...\" \\\n  --error_rate \"...\"\n```\n*注意：`--capacity` 必须大于文档总数，否则错误率无法保证。*\n\n#### 方案 B：模糊去重 (Fuzzy Deduplication)\n基于局部敏感哈希 (LSH) 对 MinHash 签名进行去重。\n\n```bash\nbash scripts\u002Fapptainer_run_lsh.sh \\\n  --config configs\u002Frp_v2.0.conf \\\n  --dump_id \"2022-49\" \\\n  --input_base_uri \"file:\u002F\u002F\u002Fpath\u002Fto\u002Fdata\u002Froot\" \\\n  --output_dir \"\u002Fpath\u002Fto\u002Foutput\" \\\n  --similarity \"\u003Csimilarity_threshold>\" \\\n  --listings \"\u002Fminhash\u002Flistings\u002Ffile.txt\" \\\n  --max_docs -1\n```\n\n完成上述步骤后，您将获得经过质量筛选和去重处理的高质量数据集，可用于大模型训练。","某初创 AI 实验室正致力于训练一个支持英、德、法、西、意五国语言的高质量垂直领域大模型，但在数据准备阶段遭遇了严峻挑战。\n\n### 没有 RedPajama-Data 时\n- **数据规模难以企及**：团队需自行从 CommonCrawl 抓取并清洗海量原始网页，耗时数月仅能凑齐数十亿 token，远未达到训练高性能模型所需的万亿级规模。\n- **多语言覆盖不均**：非英语语料（如德语、意大利语）稀缺且质量参差不齐，导致模型在多语言任务上表现严重失衡，出现“偏科”现象。\n- **去重与过滤成本高昂**：缺乏成熟的去重流水线，训练数据中包含大量重复内容和低质噪声，不仅浪费算力，还导致模型容易过拟合或生成有害内容。\n- **复现门槛极高**：由于缺乏统一的质量信号标注和标准化的处理代码，不同工程师处理出的数据集差异巨大，实验结果无法稳定复现。\n\n### 使用 RedPajama-Data 后\n- **即刻获取万亿级语料**：直接利用已处理好的 30.4 万亿 token 数据集（含 208 亿文档），将原本数月的数据工程周期缩短至几天，迅速启动模型训练。\n- **多语言能力天然均衡**：得益于内置的德、法、西、意等高质量多语言子集，模型在跨语言理解与生成任务上表现一致且强劲，无需额外费力搜集小语种数据。\n- **内置质量与去重保障**：直接使用经过 CCNet 管道处理、带有质量信号标注且已完成去重的数据，显著提升了训练效率，模型收敛更快且输出更安全。\n- **流程标准化可复现**：依托官方提供的完整 Docker 化流水线脚本，团队能快速复现数据预处理步骤，确保实验基准一致，让研发重心回归算法优化。\n\nRedPajama-Data 通过提供规模化、多语言且经严格清洗的开源数据底座，彻底消除了大模型训练中最繁琐的数据工程瓶颈。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftogethercomputer_RedPajama-Data_cd83a8fc.png","togethercomputer","Together","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftogethercomputer_92d72b3a.jpg","",null,"https:\u002F\u002Fgithub.com\u002Ftogethercomputer",[82,86,90],{"name":83,"color":84,"percentage":85},"Python","#3572A5",96.2,{"name":87,"color":88,"percentage":89},"Shell","#89e051",3.5,{"name":91,"color":92,"percentage":93},"Dockerfile","#384d54",0.3,4933,372,"2026-04-04T04:08:55","Apache-2.0",5,"Linux","未说明","测试环境为 500GB (针对模糊去重步骤)",{"notes":103,"python":104,"dependencies":105},"1. 强烈建议使用提供的 Docker 或 Apptainer 容器环境运行，若自行运行需设置 PYTHONHASHSEED 环境变量以保证哈希一致性。\n2. 必须安装并配置 s5cmd 以从 S3 存储桶拉取数据。\n3. 模糊去重步骤在 64 核机器上测试时消耗了 500GB 内存。\n4. 需要手动下载 Wikipedia 参考分类器模型文件。","3.x (未指定具体版本，需支持 Docker\u002FApptainer)",[106,107,108,109,110,111],"Docker","Apptainer (Singularity)","s5cmd","pybloomfiltermmap3","polars","fasttext",[15,34],"2026-03-27T02:49:30.150509","2026-04-06T06:44:27.721793",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},9875,"运行 `python -m cc_net` 时遇到 `argument -l\u002F--lang_whitelist: invalid Sequence value` 错误怎么办？","该问题通常由 Python 版本不兼容引起。维护者确认在 Python 3.8 环境下运行正常，而在 Python 3.9 及以上版本中可能会复现此错误。解决方案是将 Python 环境降级至 3.8 版本（例如 3.8.x），然后重新运行命令即可解决。","https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data\u002Fissues\u002F23",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},9876,"`run_lsh.py` 脚本输出的 Parquet 文件行数远少于输入文件，这些输出代表什么？如何利用它们进行去重？","输出文件包含属于某个聚类（cluster）的文档 ID。去重策略是：忽略这些文档，除非文档 ID 等于聚类 ID（doc ID == cluster ID）。通过保留每个聚类中 doc ID 等于 cluster ID 的那一个文档，即可实现去重。此外，如果有质量评分信号，也可以利用这些信号在每个聚类中选择质量最好的文档保留，而不仅仅是依赖 ID 匹配。","https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data\u002Fissues\u002F96",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},9877,"如何为其他语言（如中文）构建质量分类器所需的训练语料？","关键是要有足够的样本来训练 fasttext 分类器。对于 Wikipedia 数据处理，需要运行 `cc_net` 管道处理 `.warc` 文件以生成 `.wet` 文件。关于 `create_corpus.py` 的输入，代码实际上直接使用 SentencePiece 库，而非某些旧代码中提到的 tokenizer 逻辑。具体步骤包括：1. 确保有足够数量的 URL 样本；2. 对 Wikipedia dump 运行 cc_net 管道生成 wet 文件；3. 调整 `create_corpus.py` 中的路径配置以指向正确的 mined 数据目录（通常是 `cc_net\u002Fdata\u002Fmined\u002F{CC_DUMP}\u002F*.gz`）。","https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data\u002Fissues\u002F37",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},9878,"在 Wikipedia 文件夹中测试 `download.py` 时出现 `FileNotFoundError` 且无法解析数据文件，如何解决？","这是一个已知问题，通常与文件夹命名或数据集加载逻辑有关。维护者已在提交 `01c8ec04c1aa1ba979ac75e5684449e0ad43ccbd` 和 PR #16 中修复了该问题。请拉取最新代码更新项目。如果问题依旧，请检查文件夹名称是否符合预期，并确保没有因文件名冲突导致的数据集加载失败。","https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data\u002Fissues\u002F14",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},9879,"RedPajama 数据集中同时包含 Common Crawl 和 C4，这两者是否存在完全重叠导致数据冗余？","虽然 C4 数据集源自 Common Crawl，但 RedPajama 项目引入 C4 是为了确保与原始 LLaMA 数据集保持一致性，从而最大化模型性能。C4 经过了特定的清洗和处理流程，与项目中直接处理的 Common Crawl 数据在预处理方式和最终内容分布上存在差异。因此，同时使用两者并非简单的重复，而是为了复现基线效果和保证数据多样性。","https:\u002F\u002Fgithub.com\u002Ftogethercomputer\u002FRedPajama-Data\u002Fissues\u002F48",{"id":142,"question_zh":143,"answer_zh":144,"source_url":135},9880,"项目目前支持哪些语言？是否计划添加阿拉伯语等多语言支持？","目前项目主要致力于与原始 LLaMA 数据集保持尽可能的一致，因此优先支持 LLaMA 涵盖的语言。关于添加阿拉伯语或其他多语言数据的请求，维护者表示计划在后续步骤中添加对更多语言和 multilingual 数据的支持，但当前阶段暂未包含。用户可以关注项目的更新以获取多语言支持的进展。",[]]