[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--datatrove":3,"tool-huggingface--datatrove":64},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":45,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[15,14,13,36],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":75,"owner_website":80,"owner_url":81,"languages":82,"stars":95,"forks":96,"last_commit_at":97,"license":98,"difficulty_score":10,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":112,"github_topics":79,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":146},2958,"huggingface\u002Fdatatrove","datatrove","Freeing data processing from scripting madness by providing a set of platform-agnostic customizable pipeline processing blocks.","datatrove 是一款专为大规模文本数据处理设计的开源库，旨在帮助开发者和研究人员高效完成数据清洗、过滤及去重任务。面对大语言模型训练中海量数据处理的复杂性，传统脚本往往难以维护且扩展性差，datatrove 通过提供一套可自定义的流水线处理模块，将繁琐的脚本工作转化为清晰、模块化的流程，从而解决了“脚本混乱”的痛点。\n\n该工具特别适合需要处理 TB 级数据的 AI 工程师、数据科学家及大模型研究者。其核心亮点在于平台无关性：同一套处理逻辑无需修改代码，即可在本地单机或 Slurm 集群上无缝运行。datatrove 采用多步骤流水线设计，内存占用低，支持从 CommonCrawl 等原始格式提取文本、合成数据生成到最终分词保存的全链路操作。此外，它基于 fsspec 兼容本地、S3 等多种文件系统，并内置对 Ray 等分布式计算引擎的支持，让用户能灵活应对不同规模的基础设施环境，轻松构建稳健的数据预处理管道。","# DataTrove\n\nDataTrove is a library to process, filter and deduplicate text data at a very large scale. It provides a set of prebuilt commonly used processing blocks with a framework to easily add custom functionality.\n\nDataTrove processing pipelines are platform-agnostic, running out of the box locally or on a slurm cluster. Its (relatively) low memory usage and multiple step design makes it ideal for large workloads, such as to process an LLM's training data.\n\nLocal, remote and other file systems are supported through [fsspec](https:\u002F\u002Ffilesystem-spec.readthedocs.io\u002Fen\u002Flatest\u002F).\n\n## Table of contents\n\n\u003C!-- toc -->\n\n- [Installation](#installation)\n- [Quickstart examples](#quickstart-examples)\n- [Terminology](#terminology)\n- [Pipeline](#pipeline)\n  * [DataTrove Document](#datatrove-document)\n  * [Types of pipeline blocks](#types-of-pipeline-blocks)\n  * [Full pipeline](#full-pipeline)\n- [Executors](#executors)\n  * [LocalPipelineExecutor](#localpipelineexecutor)\n  * [SlurmPipelineExecutor](#slurmpipelineexecutor)\n  * [RayPipelineExecutor](#raypipelineexecutor)\n- [Logging](#logging)\n- [DataFolder \u002F paths](#datafolder--paths)\n- [Practical guides](#practical-guides)\n  * [Reading data](#reading-data)\n  * [Synthetic data generation](#synthetic-data-generation)\n    + [Custom rollouts](#custom-rollouts)\n    + [Ready-to-use generation script](#ready-to-use-generation-script)\n    + [Advanced configuration](#advanced-configuration)\n    + [Progress monitoring](#progress-monitoring)\n    + [Benchmarking](#benchmarking)\n  * [Extracting text](#extracting-text)\n  * [Filtering data](#filtering-data)\n  * [Saving data](#saving-data)\n  * [Deduplicating data](#deduplicating-data)\n  * [Summary Statistics](#summary-statistics)\n  * [Custom blocks](#custom-blocks)\n    + [Simple data](#simple-data)\n    + [Custom function](#custom-function)\n    + [Custom block](#custom-block)\n- [Contributing](#contributing)\n- [Citation](#citation)\n\n\u003C!-- tocstop -->\n\n## Installation\n\nRequires Python 3.10+.\n\n```bash\npip install datatrove[FLAVOUR]\n```\nAvailable flavours (combine them with `,` i.e. `[processing,s3]`):\n- `all` installs everything: `pip install datatrove[all]`\n- `io` dependencies to read `warc\u002Farc\u002Fwet` files and arrow\u002Fparquet\u002F[Optimized-parquet](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fdatasets-libraries#optimized-parquet-files) formats: `pip install datatrove[io]`\n- `processing` dependencies for text extraction, filtering and tokenization: `pip install datatrove[processing]`\n- `s3` s3 support: `pip install datatrove[s3]`\n- `cli` for command line tools: `pip install datatrove[cli]`\n- `ray` for distributed compute engine: `pip install datatrove[ray]`\n- `inference` for LLM inference pipelines: `pip install datatrove[inference]`\n- `decont` for decontamination with lighteval: `pip install datatrove[decont]`\n- `multilingual` for multilingual text processing: `pip install datatrove[multilingual]`\n\n## Quickstart examples\nYou can check the following [examples](examples):\n- [fineweb.py](examples\u002Ffineweb.py) full reproduction of the [FineWeb dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceFW\u002Ffineweb)\n- [process_common_crawl_dump.py](examples\u002Fprocess_common_crawl_dump.py) full pipeline to read commoncrawl warc files, extract their text content, filters and save the resulting data to s3. Runs on slurm\n- [tokenize_c4.py](examples\u002Ftokenize_c4.py) reads data directly from huggingface's hub to tokenize the english portion of the C4 dataset using the `gpt2` tokenizer\n- [estimate_tokens.py](examples\u002Festimate_tokens.py) estimate total token counts for large HF datasets — needed to set the correct `SamplerFilter` rate when creating a random shuffled subsample (e.g. 100B tokens from a multi-trillion-token dataset). Streams a small sample per dataset, converges on the average tokens\u002Fdoc, and multiplies by the total row count.\n- [smol_data.py](examples\u002Fsmol_data.py) builds ~100B token subsets, 50-30-20 mixtures, and shuffled variants for several large Hugging Face datasets\n- [minhash_deduplication.py](examples\u002Fminhash_deduplication.py) full pipeline to run minhash deduplication of text data\n- [sentence_deduplication.py](examples\u002Fsentence_deduplication.py) example to run sentence level exact deduplication\n- [exact_substrings.py](examples\u002Fexact_substrings.py) example to run ExactSubstr (requires [this repo](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeduplicate-text-datasets))\n- [finephrase.py](examples\u002Finference\u002Ffinephrase.py) standalone example to generate a synthetic dataset at scale with multiple prompt templates\n\n## Terminology\n- `pipeline`: a list of processing steps to execute (read data, filter, write to disk, etc)\n- `executor`: runs a specific pipeline on a given execution environment (slurm, multi cpu machine, etc)\n- `job`: the execution of a pipeline on a given executor\n- `task`: a `job` is comprised of multiple `tasks`, and these are used to parallelize execution, usually by having each `task` process a `shard` of data. Datatrove keeps track of which tasks have completed and when you relaunch only incomplete tasks will run.\n- `file`: an individual input file (.json, .csv, etc).\n> [!TIP]\n> Note that each file will be processed by a single `task`. Datatrove does not automatically split a file into multiple parts, so to fully parallelize you should have multiple medium sized files rather than a single large file)\n- `shard`: a group of input data (usually a group of `file`s), which will be assigned to a specific `task`. Each `task` will process a different non overlapping `shard` of data, from the full list of input files\n- `worker`: compute resource that will execute a single task at a time, e.g., if you have 50 cpu cores you can run a LocalPipelineExecutor with `workers=50`, to execute 50 `tasks` simultaneously (one per cpu). Once a `worker` is done with a `task`, it will start processing another waiting `task`\n\n> [!TIP]\n> Your number of `tasks` controls how much you can parallelize and also how much time each individual processing unit will take. If you have a small number of tasks (and they each therefore have to process a large number of files) and they fail, you will have to restart from scratch, whereas if you have a larger number of small tasks each failed task will take way less time to rerun.\n\n> [!CAUTION]\n> If your `tasks` > `files`, some tasks will not process any data, so there usually isn't a point in setting `tasks` to a number larger than `files`.\n\n### Example \nRunning a `job` to process **10000** `files`, on a machine with **100** cpu cores (`workers`). If we choose to use **1000** tasks, each one will process a `shard` of 10 files. `workers=100` means that we can process **100** tasks at a time.\n\n## Pipeline\n### DataTrove Document\nEach pipeline block processes data in the datatrove [`Document`](src\u002Fdatatrove\u002Fdata.py) format:\n- `text` the actual text content for each sample\n- `id` a unique id (string) for this sample\n- `metadata` a dictionary where any additional info may be stored\n\n### Types of pipeline blocks\nEach pipeline block takes a generator of `Document` as input and returns another generator of `Document`.\n- **[readers](src\u002Fdatatrove\u002Fpipeline\u002Freaders)** read data from different formats and yield `Document`\n- **[writers](src\u002Fdatatrove\u002Fpipeline\u002Fwriters)** save `Document` to disk\u002Fcloud in different formats\n- **[extractors](src\u002Fdatatrove\u002Fpipeline\u002Fextractors)** extract text content from raw formats (such as webpage html)\n- **[filters](src\u002Fdatatrove\u002Fpipeline\u002Ffilters)** filter out (remove) some `Document`s based on specific rules\u002Fcriteria\n- **[stats](src\u002Fdatatrove\u002Fpipeline\u002Fstats)** blocks to collect statistics on the dataset\n- **[tokens](src\u002Fdatatrove\u002Fpipeline\u002Ftokens)** blocks to tokenize data or count tokens\n- **[dedup](src\u002Fdatatrove\u002Fpipeline\u002Fdedup)** blocks for deduplication\n### Full pipeline\nA pipeline is defined as a list of pipeline blocks. As an example, the following pipeline would read data from disk, randomly filter (remove) some documents and write them back to disk:\n```python\nfrom datatrove.pipeline.readers import CSVReader\nfrom datatrove.pipeline.filters import SamplerFilter\nfrom datatrove.pipeline.writers import JsonlWriter\n\npipeline = [\n    CSVReader(\n        data_folder=\"\u002Fmy\u002Finput\u002Fpath\"\n    ),\n    SamplerFilter(rate=0.5),\n    JsonlWriter(\n        output_folder=\"\u002Fmy\u002Foutput\u002Fpath\"\n    )\n]\n```\n\n## Executors\nPipelines are platform-agnostic, which means that the same pipeline can smoothly run on different execution environments without any changes to its steps. Each environment has its own PipelineExecutor.\n\nSome options common to all executors:\n- `pipeline` a list consisting of the pipeline steps that should be run\n- `logging_dir` a datafolder where log files, statistics and more should be saved. Do not reuse folders for different pipelines\u002Fjobs as this will overwrite your stats, logs and completions.\n- `skip_completed` (_bool_, `True` by default) datatrove keeps track of completed tasks so that when you relaunch a job they can be skipped. Set this to `False` to disable this behaviour\n- `randomize_start_duration` (_int_, `0` by default) the maximum number of seconds to delay the start of each task to prevent all tasks from starting simultaneously and potentially overloading the system. \n\nCall an executor's `run` method to execute its pipeline.\n\n\n> [!TIP]\n> Datatrove keeps track of which tasks successfully completed by creating a marker (an empty file) in the `${logging_dir}\u002Fcompletions` folder. Once the job finishes, if some of its tasks have failed, you can **simply relaunch the exact same executor** and datatrove will check and only run the tasks that were not previously completed.\n\n> [!CAUTION]\n> If you relaunch a pipeline because some tasks failed, **do not change the total number of tasks** as this will affect the distribution of input files\u002Fsharding.\n\n\n\n### LocalPipelineExecutor\nThis executor will launch a pipeline on a local machine.\nOptions:\n- `tasks` total number of tasks to run\n- `workers` how many tasks to run simultaneously. If `-1`, no limit. Anything `> 1` will use multiprocessing to execute the tasks.\n- `start_method` method to use to spawn a multiprocessing Pool. Ignored if `workers` is 1\n\n\u003Cdetails>\n  \u003Csummary>Example executor\u003C\u002Fsummary>\n\n```python\nfrom datatrove.executor import LocalPipelineExecutor\nexecutor = LocalPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    logging_dir=\"logs\u002F\",\n    tasks=10,\n    workers=5\n)\nexecutor.run()\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Multi-node parallelism\u003C\u002Fsummary>\n\nYou can have different nodes\u002Fmachines process different parts of the total tasks by using the `local_tasks` and `local_rank_offset`. For each node\u002Finstance\u002Fmachine, launch with the following options:\n- `tasks` the total tasks to be executed (across all machines). **This value must be the same on each machine or the input file distribution may overlap!** Example: 500\n- `local_tasks` how many tasks of the total will be executed on this particular machine. Note that you can use different values for each machine. Example: 100\n- `local_rank_offset` the rank of the first task to be executed on this machine. If this is the 3rd machine where you are launching a job, and the 2 previous machines each ran 250 and 150 jobs, this would be `400` for the current machine.\n\nTo get final merged stats you will have to invoke the `merge_stats` script manually on a path containing the stats from all machines.\n\u003C\u002Fdetails>\n\n### SlurmPipelineExecutor\nThis executor will launch a pipeline on a slurm cluster, using slurm job arrays to group and manage tasks.\nOptions:\n- `tasks` total number of tasks to run. **required**\n- `time` slurm time limit string. **required**\n- `partition` slurm partition. **required**\n- `workers` how many tasks to run simultaneously. If `-1`, no limit. Slurm will run `workers` tasks at a time. (default: `-1`)\n- `job_name` slurm job name (default: \"data_processing)\n- `depends` another SlurmPipelineExecutor instance, which will be a dependency of this pipeline (current pipeline will only start executing after the depended on pipeline successfully completes)\n- `sbatch_args` dictionary with any other arguments you would like to pass to sbatch\n- `slurm_logs_folder` where to save the slurm log files. If using a local path for `logging_dir`, they will be saved on `logging_dir\u002Fslurm_logs`. If not, they will be saved as a subdir of the current directory.\n\u003Cdetails>\n  \u003Csummary>Other options\u003C\u002Fsummary>\n\n- `cpus_per_task` how many cpus to give each task (default: `1`)\n- `qos` slurm qos (default: \"normal\")\n- `mem_per_cpu_gb` memory per cpu, in GB (default: 2)\n- `env_command` custom command to activate a python environment, if needed\n- `condaenv` conda environment to activate\n- `venv_path` path to a python environment to activate\n- `max_array_size` the _MaxArraySize_ value in `$ scontrol show config`. If number of tasks exceeds this number, it will split into multiple array jobs (default: 1001)\n- `max_array_launch_parallel` if we need multiple jobs due to max_array_size, whether to launch them all in one go (parallel) or sequentially (default: `False`)\n- `stagger_max_array_jobs` when max_array_launch_parallel is True, this determines how many seconds to wait between launching each of the parallel jobs (default: `0`)\n- `run_on_dependency_fail` start executing when a job we depend on finishes even if it has failed (default: `False`)\n- `randomize_start` randomize the start of each task in a job in a ~3 min window. Useful when heavily hitting an s3 bucket for example. (default: `False`)\n\u003C\u002Fdetails>\n\u003Cdetails>\n  \u003Csummary>Example executor\u003C\u002Fsummary>\n\n```python\nfrom datatrove.executor import SlurmPipelineExecutor\nexecutor1 = SlurmPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    job_name=\"my_cool_job1\",\n    logging_dir=\"logs\u002Fjob1\",\n    tasks=500,\n    workers=100,  # omit to run all at once\n    time=\"10:00:00\",  # 10 hours\n    partition=\"hopper-cpu\"\n)\nexecutor2 = SlurmPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    job_name=\"my_cool_job2\",\n    logging_dir=\"logs\u002Fjob2\",\n    tasks=1,\n    time=\"5:00:00\",  # 5 hours\n    partition=\"hopper-cpu\",\n    depends=executor1  # this pipeline will only be launched after executor1 successfully completes\n)\n# executor1.run()\nexecutor2.run() # this will actually launch executor1, as it is a dependency, so no need to launch it explicitly\n```\n\u003C\u002Fdetails>\n\n### RayPipelineExecutor\nThis executor will launch a pipeline on a ray cluster, using ray tasks for parallel execution.\nOptions:\n- `tasks` total number of tasks to run.\n- `workers` how many tasks to run simultaneously. If `-1`, no limit. Ray will run `workers` tasks at a time. (default: `-1`)\n- `depends` another RayPipelineExecutor instance, which will be a dependency of this pipeline (current pipeline will only start executing after the depended on pipeline successfully completes)\n\u003Cdetails>\n  \u003Csummary>Other options\u003C\u002Fsummary>\n\n- `cpus_per_task` how many cpus to give each task (default: `1`)\n- `mem_per_cpu_gb` memory per cpu, in GB (default: 2)\n- `ray_remote_kwargs` Additional kwargs to pass to the ray.remote decorator\n\u003C\u002Fdetails>\n\u003Cdetails>\n  \u003Csummary>Example executor\u003C\u002Fsummary>\n\n```python\nimport ray\nfrom datatrove.executor import RayPipelineExecutor\nray.init()\nexecutor = RayPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    logging_dir=\"logs\u002F\",\n    tasks=500,\n    workers=100,  # omit to run all at once\n)\nexecutor.run()\n```\n\u003C\u002Fdetails>\n\n## Logging\nFor a pipeline with `logging_dir` **mylogspath\u002Fexp1**, the following folder structure would be created:\n\n\u003Cdetails>\n  \u003Csummary>See folder structure\u003C\u002Fsummary>\n\n```\n└── mylogspath\u002Fexp1\n    │── executor.json ⟵ json dump of the executor options and pipeline steps\n    │── launch_script.slurm ⟵ the slurm config created and used to launch this job (if running on slurm)\n    │── executor.pik ⟵ the slurm config created and used to launch this job (if running on slurm)\n    │── ranks_to_run.json ⟵ list of tasks that are being run\n    │── logs\u002F\n    │   └──[task_00000.log, task_00001.log, task_00002.log, ...] ⟵ individual logging files for each task\n    │── completions\u002F\n    │   └──[00004, 00007, 00204, ...] ⟵ empty files marking a task as completed. Using when relaunching\u002Fresuming a job (only unfinished tasks will be run)\n    │── stats\u002F\n    │   └──[00000.json, 00001.json, 00002.json, ...] ⟵ individual stats for each task (number of samples processed, filtered, removed, etc)\n    └── stats.json ⟵ global stats from all tasks\n```\n\u003C\u002Fdetails>\n\n### Colorization\nLog messages support colorization. By default, colorization will be auto detected for console messages and disabled for log files (logs\u002Ftask_XXXXX.log).\nTo explicitly enable or disable colorization, you may set the following environment variables:\n- `DATATROVE_COLORIZE_LOGS` \"1\" to add ANSI colors to console log messages and \"0\" to disable colorization.\n- `DATATROVE_COLORIZE_LOG_FILES` set to \"1\" to add ANSI colors to log messages saved to logs\u002Ftask_XXXXX.log.\n\n## DataFolder \u002F paths\nDatatrove supports a wide variety of input\u002Foutput sources through [fsspec](https:\u002F\u002Ffilesystem-spec.readthedocs.io\u002Fen\u002Flatest\u002F).\n\nThere are a few ways to provide a path to a datatrove block (for `input_folder`, `logging_dir`, `data_folder` and so on arguments):\n- `str`: the simplest way is to pass a single string. Example: `\u002Fhome\u002Fuser\u002Fmydir`, `s3:\u002F\u002Fmybucket\u002Fmyinputdata`, `hf:\u002F\u002Fdatasets\u002Fallenai\u002Fc4\u002Fen\u002F`\n\n- `(str, fsspec filesystem instance)`: a string path and a fully initialized filesystem object. Example: `(\"s3:\u002F\u002Fmybucket\u002Fmyinputdata\", S3FileSystem(client_kwargs={\"endpoint_url\": endpoint_uri}))`\n- `(str, dict)`: a string path and a dictionary with options to initialize a fs. Example (equivalent to the previous line): `(\"s3:\u002F\u002Fmybucket\u002Fmyinputdata\", {\"client_kwargs\": {\"endpoint_url\": endpoint_uri}})`\n- `DataFolder`: you can initialize a [DataFolder](src\u002Fdatatrove\u002Fio.py) object directly and pass it as an argument\n\nUnder the hood these argument combinations are parsed by [`get_datafolder`](src\u002Fdatatrove\u002Fio.py#116).\n\n## Practical guides\n\n### Reading data\nUsually, pipelines will start with a [Reader](src\u002Fdatatrove\u002Fpipeline\u002Freaders) block.\nMost readers take a `data_folder` argument — a path to a folder containing the data to be read.\n\nThese files will be distributed across each task. If you have `N` tasks, task with rank `i` (0-based) will process files `i, i+N, i+2N, i+3N,...`.\n\nInternally, each reader reads data and converts it into a dictionary before creating a `Document` object.\n\nSome options common to most readers:\n- `text_key` the dictionary key containing the text content for each sample. Default: `text`\n- `id_key` the dictionary key containing the id for each sample. Default: `id`\n- `default_metadata` a dictionary for any default metadata values you would like to add (such as their source, for example)\n- `recursive` whether to look for files recursively in `data_folder`'s subdirectories\n- `glob_pattern` use this field to match specific files. For instance, `glob_pattern=\"*\u002Fwarc\u002F*.warc.gz\"` will match files with a `.warc.gz` file extension on the `warc\u002F` folder of each of the `data_folder`'s subdirectories\n- `adapter` this function takes the raw dictionary obtained from the reader and returns a dictionary with `Document`'s field names. You may overwrite this function ([_default_adapter](src\u002Fdatatrove\u002Fpipeline\u002Freaders\u002Fbase.py)) if you would like.\n- `limit` read only a certain number of samples. Useful for testing\u002Fdebugging\n\n### Synthetic data generation\nInstall the inference extras with `pip install datatrove[inference]` to pull in the lightweight HTTP client, checkpointing dependencies and async sqlite cache.\n\nWe support [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm), [SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang), OpenAI-compatible HTTPS endpoints and a local `dummy` server through the [InferenceRunner block](src\u002Fdatatrove\u002Fpipeline\u002Finference\u002Frun_inference.py). Each datatrove task can spin up its own server replica (for `vllm`, `sglang` or `dummy`) or talk directly to an external endpoint while asynchronous batching keeps GPU utilization high.\n\n#### Custom rollouts\n\nThe core abstraction is a **rollout function**—a plain async callable that receives a `Document`, a `generate(payload)` callback, and any extra kwargs from `shared_context`. You can freely orchestrate multiple sequential or parallel `generate` calls inside the rollout. This gives you full control over how prompts are constructed and how generations are combined. See [inference_chunked.py](examples\u002Finference\u002Finference_chunked.py) for examples of:\n- Simple single-request rollouts\n- Chunked rollouts that split long documents and stitch generations together\n- CPU-heavy preprocessing with process pools via `shared_context`\n- Multi-node distributed inference\n\nSet `rollouts_per_document` to automatically run the same rollout multiple times per sample; the runner collects successful outputs under `document.metadata[\"rollout_results\"]`.\n\n#### Ready-to-use generation script\n\nFor a ready-to-use script for synthetic data generation at scale (supporting models from 1B to 1T parameters, local\u002FSLURM execution, and multi-node setups), see [`generate_data.py`](examples\u002Finference\u002Fgenerate_data.py). This script handles prompt-based generation with configurable system prompts and templates.\n\n#### Advanced configuration\n\n`shared_context` lets you inject shared state into every rollout invocation. It accepts:\n- a dict (passed through as keyword arguments),\n- a callable returning a dict (handy for lazily creating resources),\n- a context manager or a callable returning one (great for pools, GPU allocators, temp dirs, etc.). Context managers are properly entered\u002Fexited once per task.\n\nRecoverable generation:\n- Setting `checkpoints_local_dir` together with `records_per_chunk` writes every `Document` to local chunk files (remember to include `${chunk_index}` in the output filename template), then uploads them via the configured writer. Failed tasks automatically resume from the last finished chunk.\n- When checkpointing is enabled a sqlite-backed `RequestCache` deduplicates individual rollouts via payload hashes (requires `xxhash` and `aiosqlite`) so completed generations are never re-sent during retries.\n- Set `skip_bad_requests=True` on `InferenceRunner` to skip provider-side `BadRequestError`s (for example, context\u002Fwindow overflows) and keep the remaining documents running.\n\nTune batching with `max_concurrent_generations` and, when pre\u002Fpost-processing is heavy, raise `max_concurrent_documents` to allow more rollout coroutines to build payloads while requests are in flight.\n\n\u003Cdetails>\n  \u003Csummary>Minimal end-to-end example\u003C\u002Fsummary>\n\n  ```\n  from datatrove.data import Document\n  from datatrove.executor.local import LocalPipelineExecutor\n  from datatrove.pipeline.inference.run_inference import InferenceConfig, InferenceRunner\n  from datatrove.pipeline.writers import JsonlWriter\n\n  async def simple_rollout(doc: Document, generate):\n      payload = {\"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": doc.text}]}], \"max_tokens\": 2048}\n      return await generate(payload)\n\n  documents = [Document(text=\"What's the weather in Tokyo?\", id=str(i)) for i in range(1005)]\n  config = InferenceConfig(server_type=\"vllm\", model_name_or_path=\"google\u002Fgemma-3-27b-it\", rollouts_per_document=1, max_concurrent_generations=500)\n\n  LocalPipelineExecutor(\n      pipeline=[\n          documents,\n          InferenceRunner(\n              rollout_fn=simple_rollout,\n              config=config,\n              skip_bad_requests=True,\n              records_per_chunk=500,\n              checkpoints_local_dir=\"\u002Ffsx\u002F...\u002Ftranslate-checkpoints\",\n              output_writer=JsonlWriter(\"s3:\u002F\u002F...\u002Ffinal_output_data\", output_filename=\"${rank}_chunk_${chunk_index}.jsonl\"),\n          ),\n      ],\n      logging_dir=\"\u002Ffsx\u002F...\u002Finference_logs\",\n      tasks=1,\n  ).run()\n  ```\n\u003C\u002Fdetails>\n\nThe extended [inference_chunked.py](examples\u002Finference\u002Finference_chunked.py) script demonstrates single- and multi-rollout flows, resumable checkpoints and sharing a process pool across rollouts.\n\n#### Progress monitoring\n\nFor long-running inference jobs, you can use `InferenceProgressMonitor` to periodically update a HuggingFace dataset card with a progress bar and ETA. After inference completes, `InferenceDatasetCardGenerator` creates a final dataset card with statistics.\n\n```python\nfrom datatrove.pipeline.inference import InferenceDatasetCardParams, InferenceProgressMonitor, InferenceDatasetCardGenerator\n\nparams = InferenceDatasetCardParams(\n    output_repo_id=\"your-username\u002Foutput-dataset\",\n    input_dataset_name=\"simplescaling\u002Fs1K-1.1\",\n    input_dataset_split=\"train\",\n    model_name=\"Qwen\u002FQwen3-0.6B\",\n    # ... other params\n)\n\n# Monitor pipeline (runs in parallel with inference on Slurm)\nmonitor_pipeline = [InferenceProgressMonitor(params=params, update_interval=3600)]\n\n# Final card generation (runs after inference completes)\ndatacard_pipeline = [InferenceDatasetCardGenerator(params=params)]\n```\n\nSee [progress_monitoring.py](examples\u002Finference\u002Fprogress_monitoring.py) for a complete example with Slurm integration.\n\n#### Benchmarking\n\nTo measure vLLM throughput across different models and configurations (TP, PP, speculative decoding), use the [benchmark tools](examples\u002Finference\u002Fbenchmark\u002FREADME.md). The benchmark suite provides:\n- **`launch_experiments.py`**: Launch sweep experiments from YAML config with automatic Slurm job submission\n- **`analyze_results.py`**: Parse server logs and generate CSV summaries with metrics like tokens\u002Fs per GPU and GPU-days to process 1B tokens\n\n### Extracting text\nYou can use [extractors](src\u002Fdatatrove\u002Fpipeline\u002Fextractors) to extract text content from raw html. The most commonly used extractor in datatrove is [Trafilatura](src\u002Fdatatrove\u002Fpipeline\u002Fextractors\u002Ftrafilatura.py), which uses the [trafilatura](https:\u002F\u002Ftrafilatura.readthedocs.io\u002Fen\u002Flatest\u002F) library.\n\n### Filtering data\n[Filters](src\u002Fdatatrove\u002Fpipeline\u002Ffilters) are some of the most important blocks of any data processing pipeline. Datatrove's filter blocks take a `Document` and return a boolean (`True` to keep a document, `False` to remove it). Removed samples do not continue to the next pipeline stage. You can also save the removed samples to disk by passing a [Writer](src\u002Fdatatrove\u002Fpipeline\u002Fwriters) to the `exclusion_writer` parameter.\n\n### Saving data\nOnce you are done processing your data you will probably want to save it somewhere. For this you can use a [writer](src\u002Fdatatrove\u002Fpipeline\u002Fwriters\u002Fjsonl.py).\nWriters require an `output_folder` (the path where data should be saved). You can choose the `compression` to use (default: `gzip`) and the filename to save each file as.\nFor the `output_filename`, a template is applied using the following arguments:\n- `${rank}` replaced with the current task's rank. Note that if this tag isn't present, **different tasks may try to write to the same location**\n- `${id}` replaced with the sample id\n- metadata: any other `${tag}` will be replaced with the corresponding `document.metadata['tag']` value\n\nAn example to separate samples by language based on their `lang` metadata field:\n```\nJsonlWriter(\n    f\"{MAIN_OUTPUT_PATH}\u002Fnon_english\u002F\",\n    output_filename=\"${language}\u002F\" + DUMP + \"\u002F${rank}.jsonl.gz\",  # folder structure: language\u002Fdump\u002Ffile\n)\n```\n\n### Deduplicating data\nFor deduplication check the examples [minhash_deduplication.py](examples\u002Fminhash_deduplication.py), [sentence_deduplication.py](examples\u002Fsentence_deduplication.py) and [exact_substrings.py](examples\u002Fexact_substrings.py).\n\n### Summary Statistics\nFor summary statistics on your data you can use the [Stats](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Ftree\u002Fmain\u002Fsrc\u002Fdatatrove\u002Fpipeline\u002Fstats) blocks. These blocks provide an easy way to collect data-profiles on your dataset in a distributed manner. It's a two step process in which you first:\n1) For each shard iterate over documents and collect stats into of the following groupings `summary` (all docs counted to \"summary\" key), `fqdn` (fully qualified domain name grouping), `suffix` (the last part of the url path grouping) or `histogram` (value based grouping).\n2) Merge the stats from different shards into a single file.\nSee the [summary_stats.py](examples\u002Fsummary_stats.py) for more details.\n\nEach resulting stat is saved in a separate file with following structure: `output_folder\u002F{fqdn,suffix,summary,histogram}\u002F{stat_name}\u002Fmetric.json`\n\nEach such file is a `MetricStatsDict` object, which you can easily load using:\n```python\nfrom datatrove.pipeline.stats.summary_stats import MetricStatsDict\nimport json\nstats = MetricStatsDict.from_dict(json.load(open(\"fqdn\u002Flength\u002Fmetric.json\")))\n\n# E.g for total length of nytimes.com docs\nstats[\"nytimes.com\"].total\n\n# Or for mean of cnn.com docs\nstats[\"cnn.com\"].mean\n```\n\nFollowing stats are available:\n- `contamination_stats.py`: `word_contamination_{words[0]}`: Frequency of words contamination in the document.\n- `doc_stats.py`: `length`: Length of the document, `white_space_ratio`: Ratio of whitespace characters, `non_alpha_digit_ratio`: Ratio of non-alphabetic and non-digit characters, `digit_ratio`: Ratio of digits, `uppercase_ratio`: Ratio of uppercase letters, `elipsis_ratio`: Ratio of elipsis characters, `punctuation_ratio`: Punctuation ratio\n- `lang_stats.py`: `fasttext_{language}`: Score of document being written in `language`. Score is computed using FastText model.\n- `line_stats.py`: `n_lines`: Number of lines per doc, `avg_line_length`: Average length of line per doc, `long_line_ratio_chars_{chars}`: Ratio of lines with more than k chars, `short_line_ratio_chars_{chars}`: Ratio of lines with less than k chars, `bullet_point_lines_ratio`: Ratio of line starting with bullet point, `line_duplicates`: Ratio of lines that are duplicates, `line_char_duplicates`: Ratio of chars in duplicated lines to all chars.\n- `paragraph_stats.py`: `n_paragraphs`: Number of paragraphs, `avg_paragraph_length`: Average paragraph length, `short_paragraph_ratio_{chars}`: Ratio of short paragraphs (`\u003C{chars}` chars), `long_paragraph_ratio_{chars}`: Ratio of long paragraphs (`>{chars}` chars)\n- `perplexity_stats.py`: `ccnet_perplexity_{model_dataset}_{language}`: Perplexity of the document using the CCNet model for `{model}` on `{dataset}` in `{language}`\n- `sentence_stats.py`: `n_sentences`: Number of sentences, `avg_sentence_length`: Average sentence length, `short_sentence_ratio_{chars}`: Ratio of short sentences (`\u003C{chars}` chars), `long_sentence_ratio_{chars}`: Ratio of long sentences (`>{chars}` chars)\n- `token_stats.py`:`token_count`: Number of tokens in the document\n- `word_stats.py`: `n_words`: Number of words in the document, `avg_word_length`: Average length of words in the document, `avg_words_per_line`: Average number of words per line in the document, `short_word_ratio_{chars}`: Ratio of words shorter than `{chars}` characters, `stop_word_ratio`: Ratio of stop words, `long_word_ratio_{chars}`: Ratio of words longer than `{chars}` characters, `type_token_ratio`: Number of unique words \u002F Number of tokens, `capitalized_word_ratio`: Ratio of capitalized words, `uppercase_word_ratio`: Ratio of uppercase words\n\n\n\n\n\n\n\n\n### Custom blocks\n\n#### Simple data\nYou can pass an iterable of [`Document`](src\u002Fdatatrove\u002Fdata.py) directly as a pipeline block like so:\n```python\nfrom datatrove.data import Document\nfrom datatrove.pipeline.filters import SamplerFilter\nfrom datatrove.pipeline.writers import JsonlWriter\n\npipeline = [\n    [\n        Document(text=\"some data\", id=\"0\"),\n        Document(text=\"some more data\", id=\"1\"),\n        Document(text=\"even more data\", id=\"2\"),\n    ],\n    SamplerFilter(rate=0.5),\n    JsonlWriter(\n        output_folder=\"\u002Fmy\u002Foutput\u002Fpath\"\n    )\n]\n```\n\nDo note, however, that this iterable will not be sharded (if you launch more than 1 task they will all get the full iterable).\nThis is usually useful for small workloads\u002Ftesting.\n\n#### Custom function\nFor simple processing you can simply pass in a custom function with the following signature:\n```python\nfrom datatrove.data import DocumentsPipeline\n\ndef uppercase_everything(data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:\n    \"\"\"\n        `data` is a generator of Document. You must also return a generator of Document (yield)\n        You can optionally use `rank` and `world_size` for sharding\n    \"\"\"\n    for document in data:\n        document.text = document.text.upper()\n        yield document\n\npipeline = [\n    ...,\n    uppercase_everything,\n    ...\n]\n```\n> [!TIP]\n> You might have some pickling issues due to the imports. If this happens, simply move whatever imports you need inside the function body.\n\n#### Custom block\nYou can also define a full block inheriting from [`PipelineStep`](src\u002Fdatatrove\u002Fpipeline\u002Fbase.py) or one of its subclasses:\n\n```python\nfrom datatrove.pipeline.base import PipelineStep\nfrom datatrove.data import DocumentsPipeline\nfrom datatrove.io import DataFolderLike, get_datafolder\n\n\nclass UppercaserBlock(PipelineStep):\n    def __init__(self, some_folder: DataFolderLike, some_param: int = 5):\n        super().__init__()\n        # you can take whatever parameters you need and save them here\n        self.some_param = some_param\n        # to load datafolders use get_datafolder()\n        self.some_folder = get_datafolder(some_folder)\n\n    def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:\n        # you could also load data from the `some_folder`:\n        for filepath in self.some_folder.get_shard(rank, world_size): # it also accepts a glob pattern, among other things\n            with self.some_folder.open(filepath, \"rt\") as f:\n                # do something\n                ...\n                yield doc\n\n        #\n        # OR process data from previous blocks (`data`)\n        #\n\n        for doc in data:\n            with self.track_time():\n                # you can wrap the main processing code in `track_time` to know how much each document took to process\n                nr_uppercase_letters = sum(map(lambda c: c.isupper(), doc.text))\n                # you can also keep track of stats per document using stat_update\n                self.stat_update(\"og_upper_letters\", value=nr_uppercase_letters)\n                doc.text = doc.text.upper()\n            # make sure you keep the yield outside the track_time block, or it will affect the time calculation\n            yield doc\n\n        #\n        # OR save data to disk\n        #\n\n        with self.some_folder.open(\"myoutput\", \"wt\") as f:\n            for doc in data:\n                f.write(doc...)\n```\n\n```python\npipeline = [\n    ...,\n    UppercaserBlock(\"somepath\"),\n    ...\n]\n```\n\nYou could also inherit from [`BaseExtractor`](src\u002Fdatatrove\u002Fpipeline\u002Fextractors\u002Fbase.py), [`BaseFilter`](src\u002Fdatatrove\u002Fpipeline\u002Ffilters\u002Fbase_filter.py), [`BaseReader`\u002F`BaseDiskReader`](src\u002Fdatatrove\u002Fpipeline\u002Freaders\u002Fbase.py), or [`DiskWriter`](src\u002Fdatatrove\u002Fpipeline\u002Fwriters\u002Fdisk_base.py).\n## Contributing\n\n```bash\ngit clone git@github.com:huggingface\u002Fdatatrove.git && cd datatrove\npip install -e \".[dev]\"\n```\n\nInstall pre-commit code style hooks:\n```bash\npre-commit install\n```\n\nRun code style checks:\n```bash\n# Fast local loop (changed Python files only)\nmake quality\nmake style\n\n# Full repository checks (same scope as CI)\nmake quality-full\nmake style-full\n```\n\nRun the tests:\n```bash\npytest -sv .\u002Ftests\u002F\n```\n\n## Citation\n\n```bibtex\n@misc{penedo2024datatrove,\n  author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},\n  title = {DataTrove: large scale data processing},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  url = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove}\n}\n```\n","# DataTrove\n\nDataTrove 是一个用于大规模处理、过滤和去重文本数据的库。它提供了一组预构建的常用处理模块，并配备了一个框架，方便用户添加自定义功能。\n\nDataTrove 的处理管道与平台无关，既可以在本地直接运行，也可以在 Slurm 集群上运行。由于其内存占用相对较低且采用多步骤设计，因此非常适合处理大型工作负载，例如处理大型语言模型的训练数据。\n\n通过 [fsspec](https:\u002F\u002Ffilesystem-spec.readthedocs.io\u002Fen\u002Flatest\u002F) 支持本地、远程及其他文件系统。\n\n## 目录\n\n\u003C!-- toc -->\n\n- [安装](#installation)\n- [快速入门示例](#quickstart-examples)\n- [术语](#terminology)\n- [管道](#pipeline)\n  * [DataTrove 文档](#datatrove-document)\n  * [管道块的类型](#types-of-pipeline-blocks)\n  * [完整管道](#full-pipeline)\n- [执行器](#executors)\n  * [LocalPipelineExecutor](#localpipelineexecutor)\n  * [SlurmPipelineExecutor](#slurmpipelineexecutor)\n  * [RayPipelineExecutor](#raypipelineexecutor)\n- [日志记录](#logging)\n- [数据文件夹\u002F路径](#datafolder--paths)\n- [实用指南](#practical-guides)\n  * [读取数据](#reading-data)\n  * [合成数据生成](#synthetic-data-generation)\n    + [自定义部署](#custom-rollouts)\n    + [即用型生成脚本](#ready-to-use-generation-script)\n    + [高级配置](#advanced-configuration)\n    + [进度监控](#progress-monitoring)\n    + [基准测试](#benchmarking)\n  * [提取文本](#extracting-text)\n  * [过滤数据](#filtering-data)\n  * [保存数据](#saving-data)\n  * [去重数据](#deduplicating-data)\n  * [汇总统计](#summary-statistics)\n  * [自定义块](#custom-blocks)\n    + [简单数据](#simple-data)\n    + [自定义函数](#custom-function)\n    + [自定义块](#custom-block)\n- [贡献](#contributing)\n- [引用](#citation)\n\n\u003C!-- tocstop -->\n\n## 安装\n\n需要 Python 3.10 或更高版本。\n\n```bash\npip install datatrove[FLAVOUR]\n```\n可用的风味（用逗号分隔，例如 `[processing,s3]`）：\n- `all`：安装所有依赖项：`pip install datatrove[all]`\n- `io`：用于读取 `warc\u002Farc\u002Fwet` 文件以及 arrow\u002Fparquet\u002F[Optimized-parquet](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fdatasets-libraries#optimized-parquet-files) 格式的依赖项：`pip install datatrove[io]`\n- `processing`：用于文本提取、过滤和分词的依赖项：`pip install datatrove[processing]`\n- `s3`：S3 支持：`pip install datatrove[s3]`\n- `cli`：命令行工具支持：`pip install datatrove[cli]`\n- `ray`：分布式计算引擎支持：`pip install datatrove[ray]`\n- `inference`：用于 LLM 推理管道：`pip install datatrove[inference]`\n- `decont`：使用 lighteval 进行去污支持：`pip install datatrove[decont]`\n- `multilingual`：用于多语言文本处理：`pip install datatrove[multilingual]`\n\n## 快速入门示例\n您可以查看以下 [示例](examples)：\n- [fineweb.py](examples\u002Ffineweb.py)：完整复现 [FineWeb 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FHuggingFaceFW\u002Ffineweb)\n- [process_common_crawl_dump.py](examples\u002Fprocess_common_crawl_dump.py)：完整管道，用于读取 CommonCrawl 的 WARC 文件，提取其中的文本内容，进行过滤，并将结果保存到 S3。该示例在 Slurm 上运行。\n- [tokenize_c4.py](examples\u002Ftokenize_c4.py)：直接从 Hugging Face Hub 读取数据，使用 `gpt2` 分词器对 C4 数据集的英文部分进行分词。\n- [estimate_tokens.py](examples\u002Festimate_tokens.py)：估算大型 HF 数据集的总 token 数——在创建随机打乱的子样本时（例如从数千亿 token 的数据集中抽取 1000 亿 token），这一步骤是设置正确 `SamplerFilter` 比率所必需的。该示例会流式处理每个数据集的小样本，最终收敛到平均每个文档的 token 数，再乘以总行数。\n- [smol_data.py](examples\u002Fsmol_data.py)：为多个大型 Hugging Face 数据集构建约 1000 亿 token 的子集，包括 50-30-20 的混合比例以及打乱后的变体。\n- [minhash_deduplication.py](examples\u002Fminhash_deduplication.py)：完整的 MinHash 去重管道。\n- [sentence_deduplication.py](examples\u002Fsentence_deduplication.py)：句子级别的精确去重示例。\n- [exact_substrings.py](examples\u002Fexact_substrings.py)：ExactSubstr 示例（需要 [此仓库](https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fdeduplicate-text-datasets)）。\n- [finephrase.py](examples\u002Finference\u002Ffinephrase.py)：独立示例，使用多种提示模板大规模生成合成数据集。\n\n## 术语\n- `pipeline`：要执行的一系列处理步骤（读取数据、过滤、写入磁盘等）。\n- `executor`：在特定执行环境中运行特定管道的组件（如 Slurm、多核 CPU 机器等）。\n- `job`：在给定执行器上执行的一个管道。\n- `task`：一个 `job` 由多个 `task` 组成，这些 `task` 用于并行化执行，通常每个 `task` 处理一个 `shard` 的数据。Datatrove 会跟踪哪些任务已完成，重新启动时只会运行未完成的任务。\n- `file`：单个输入文件（如 .json、.csv 等）。\n> [!TIP]\n> 请注意，每个文件将由一个单独的 `task` 处理。Datatrove 不会自动将一个文件拆分为多个部分，因此为了实现完全并行化，您应该准备多个中等大小的文件，而不是一个大文件。\n- `shard`：一组输入数据（通常是若干 `file`），会被分配给特定的 `task`。每个 `task` 将处理来自全部输入文件的不同且不重叠的 `shard` 数据。\n- `worker`：一次只能执行一个任务的计算资源，例如，如果您有 50 个 CPU 核心，则可以使用 `workers=50` 的 LocalPipelineExecutor 来同时执行 50 个 `task`（每个 CPU 执行一个）。一旦某个 `worker` 完成一个 `task`，就会开始处理下一个等待中的 `task`。\n\n> [!TIP]\n> 您的 `tasks` 数量决定了可以并行化的程度，也影响了每个处理单元所需的时间。如果您的任务数量较少（因此每个任务需要处理大量文件），并且这些任务失败，您将不得不从头开始；而如果任务数量较多但每个任务处理的文件较少，则每次失败的任务重新运行所需的时间会短得多。\n\n> [!CAUTION]\n> 如果您的 `tasks` 数量大于 `files` 数量，那么有些任务将不会处理任何数据，因此通常没有必要将 `tasks` 设置为超过 `files` 数量。\n\n### 示例\n在一个拥有 **100** 个 CPU 核心（`workers`）的机器上，运行一个处理 **10000** 个 `files` 的 `job`。如果我们选择使用 **1000** 个 `task`，那么每个 `task` 将处理包含 10 个文件的 `shard`。`workers=100` 表示我们可以同时处理 **100** 个 `task`。\n\n## 管道\n### DataTrove 文档\n每个管道块都以 Datatrove 的 [`Document`](src\u002Fdatatrove\u002Fdata.py) 格式处理数据：\n- `text`：每个样本的实际文本内容。\n- `id`：该样本的唯一 ID（字符串）。\n- `metadata`：一个字典，可用于存储任何附加信息。\n\n### 管道块的类型\n每个管道块都以 `Document` 的生成器作为输入，并返回另一个 `Document` 的生成器。\n- **[读取器](src\u002Fdatatrove\u002Fpipeline\u002Freaders)** 从不同格式中读取数据，并生成 `Document`\n- **[写入器](src\u002Fdatatrove\u002Fpipeline\u002Fwriters)** 将 `Document` 以不同格式保存到磁盘或云端\n- **[提取器](src\u002Fdatatrove\u002Fpipeline\u002Fextractors)** 从原始格式（如网页 HTML）中提取文本内容\n- **[过滤器](src\u002Fdatatrove\u002Fpipeline\u002Ffilters)** 根据特定规则或标准过滤掉部分 `Document`\n- **[统计器](src\u002Fdatatrove\u002Fpipeline\u002Fstats)** 用于收集数据集统计信息的块\n- **[分词器](src\u002Fdatatrove\u002Fpipeline\u002Ftokens)** 用于对数据进行分词或统计词数的块\n- **[去重器](src\u002Fdatatrove\u002Fpipeline\u002Fdedup)** 用于去重的块\n### 完整管道\n管道被定义为一系列管道块。例如，以下管道将从磁盘读取数据，随机过滤（移除）部分文档，然后将其写回磁盘：\n```python\nfrom datatrove.pipeline.readers import CSVReader\nfrom datatrove.pipeline.filters import SamplerFilter\nfrom datatrove.pipeline.writers import JsonlWriter\n\npipeline = [\n    CSVReader(\n        data_folder=\"\u002Fmy\u002Finput\u002Fpath\"\n    ),\n    SamplerFilter(rate=0.5),\n    JsonlWriter(\n        output_folder=\"\u002Fmy\u002Foutput\u002Fpath\"\n    )\n]\n```\n\n## 执行器\n管道与平台无关，这意味着同一管道无需更改其步骤即可在不同的执行环境中顺利运行。每种环境都有其专属的管道执行器。\n\n所有执行器通用的一些选项：\n- `pipeline`：一个由要运行的管道步骤组成的列表\n- `logging_dir`：用于保存日志文件、统计信息等的数据文件夹。请勿为不同管道或作业重复使用同一文件夹，否则会覆盖您的统计、日志和已完成任务记录。\n- `skip_completed`（布尔值，默认为 `True`）：Datatrove 会跟踪已完成的任务，以便在重新启动作业时跳过这些任务。将其设置为 `False` 可禁用此行为。\n- `randomize_start_duration`（整数，默认为 `0`）：每个任务开始前的最大延迟秒数，用于防止所有任务同时启动而可能造成系统过载。\n\n调用执行器的 `run` 方法即可执行其管道。\n\n\n> [!提示]\n> Datatrove 通过在 `${logging_dir}\u002Fcompletions` 文件夹中创建标记文件（空文件）来跟踪哪些任务已成功完成。作业完成后，如果部分任务失败，您可以**直接重新启动完全相同的执行器**，Datatrove 会检查并仅运行之前未完成的任务。\n\n> [!注意]\n> 如果因部分任务失败而重新启动管道，请**不要更改总任务数**，因为这会影响输入文件的分配和分片。\n\n\n\n### 本地管道执行器\n该执行器将在本地机器上启动管道。选项如下：\n- `tasks`：要运行的总任务数\n- `workers`：同时运行的任务数量。若设置为 `-1`，则无限制。任何大于 `1` 的值都会使用多进程来执行任务。\n- `start_method`：用于启动多进程池的方法。若 `workers` 为 `1`，则忽略此选项。\n\n\u003Cdetails>\n  \u003Csummary>执行器示例\u003C\u002Fsummary>\n\n```python\nfrom datatrove.executor import LocalPipelineExecutor\nexecutor = LocalPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    logging_dir=\"logs\u002F\",\n    tasks=10,\n    workers=5\n)\nexecutor.run()\n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>多节点并行处理\u003C\u002Fsummary>\n\n您可以通过使用 `local_tasks` 和 `local_rank_offset` 参数，让不同的节点或机器分别处理总任务的不同部分。对于每个节点、实例或机器，应使用以下选项启动：\n- `tasks`：需要执行的总任务数（跨所有机器）。**此值必须在每台机器上保持一致，否则输入文件的分配可能会重叠！** 示例：500\n- `local_tasks`：本机将执行的总任务中的子任务数。请注意，每台机器可以使用不同的数值。示例：100\n- `local_rank_offset`：本机将开始执行的第一项任务的排名。假设这是您启动作业的第 3 台机器，前两台机器分别执行了 250 和 150 个任务，则当前机器的 `local_rank_offset` 应设为 `400`。\n\n要获得最终合并的统计信息，您需要手动在包含所有机器统计信息的路径上调用 `merge_stats` 脚本。\n\u003C\u002Fdetails>\n\n### SlurmPipelineExecutor\n此执行器将在 Slurm 集群上启动一个流水线，使用 Slurm 作业数组来分组和管理任务。\n选项：\n- `tasks` 要运行的总任务数。**必填**\n- `time` Slurm 时间限制字符串。**必填**\n- `partition` Slurm 分区。**必填**\n- `workers` 同时运行的任务数量。如果设置为 `-1`，则无限制。Slurm 将一次运行 `workers` 个任务。（默认值：`-1`）\n- `job_name` Slurm 作业名称（默认值：`\"data_processing\"`）\n- `depends` 另一个 SlurmPipelineExecutor 实例，该实例将成为此流水线的依赖项（当前流水线仅在依赖的流水线成功完成后才会开始执行）\n- `sbatch_args` 字典，包含您希望传递给 sbatch 的任何其他参数\n- `slurm_logs_folder` 用于保存 Slurm 日志文件的目录。如果使用本地路径作为 `logging_dir`，日志将保存在 `logging_dir\u002Fslurm_logs` 中。否则，它们将被保存为当前目录下的子目录。\n\u003Cdetails>\n  \u003Csummary>其他选项\u003C\u002Fsummary>\n\n- `cpus_per_task` 每个任务分配的 CPU 数量（默认值：`1`）\n- `qos` Slurm QoS（默认值：`\"normal\"`）\n- `mem_per_cpu_gb` 每个 CPU 的内存大小，单位为 GB（默认值：`2`）\n- `env_command` 自定义命令，用于激活 Python 环境（如有需要）\n- `condaenv` 要激活的 Conda 环境\n- `venv_path` 要激活的 Python 环境路径\n- `max_array_size` `$ scontrol show config` 中的 _MaxArraySize_ 值。如果任务数量超过此值，将拆分为多个数组作业（默认值：`1001`）\n- `max_array_launch_parallel` 如果由于 max_array_size 需要多个作业，是同时并行启动还是依次启动（默认值：`False`）\n- `stagger_max_array_jobs` 当 max_array_launch_parallel 为 True 时，此选项决定并行启动每个作业之间应等待多少秒（默认值：`0`）\n- `run_on_dependency_fail` 即使我们所依赖的作业失败，也应在该作业完成后开始执行（默认值：`False`）\n- `randomize_start` 在约 3 分钟的时间窗口内随机化每个作业中任务的启动时间。例如，在大量访问 S3 存储桶时非常有用。（默认值：`False`）\n\u003C\u002Fdetails>\n\u003Cdetails>\n  \u003Csummary>执行器示例\u003C\u002Fsummary>\n\n```python\nfrom datatrove.executor import SlurmPipelineExecutor\nexecutor1 = SlurmPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    job_name=\"my_cool_job1\",\n    logging_dir=\"logs\u002Fjob1\",\n    tasks=500,\n    workers=100,  # 不设置则所有任务同时运行\n    time=\"10:00:00\",  # 10 小时\n    partition=\"hopper-cpu\"\n)\nexecutor2 = SlurmPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    job_name=\"my_cool_job2\",\n    logging_dir=\"logs\u002Fjob2\",\n    tasks=1,\n    time=\"5:00:00\",  # 5 小时\n    partition=\"hopper-cpu\",\n    depends=executor1  # 该流水线仅在 executor1 成功完成后才会启动\n)\n# executor1.run()\nexecutor2.run() # 这实际上会启动 executor1，因为它是一个依赖项，因此无需显式启动\n```\n\u003C\u002Fdetails>\n\n### RayPipelineExecutor\n此执行器将在 Ray 集群上启动一个流水线，使用 Ray 任务进行并行执行。\n选项：\n- `tasks` 要运行的总任务数。\n- `workers` 同时运行的任务数量。如果设置为 `-1`，则无限制。Ray 将一次运行 `workers` 个任务。（默认值：`-1`）\n- `depends` 又一个 RayPipelineExecutor 实例，该实例将成为此流水线的依赖项（当前流水线仅在依赖的流水线成功完成后才会开始执行）\n\u003Cdetails>\n  \u003Csummary>其他选项\u003C\u002Fsummary>\n\n- `cpus_per_task` 每个任务分配的 CPU 数量（默认值：`1`）\n- `mem_per_cpu_gb` 每个 CPU 的内存大小，单位为 GB（默认值：`2`）\n- `ray_remote_kwargs` 传递给 ray.remote 装饰器的额外关键字参数\n\u003C\u002Fdetails>\n\u003Cdetails>\n  \u003Csummary>执行器示例\u003C\u002Fsummary>\n\n```python\nimport ray\nfrom datatrove.executor import RayPipelineExecutor\nray.init()\nexecutor = RayPipelineExecutor(\n    pipeline=[\n        ...\n    ],\n    logging_dir=\"logs\u002F\",\n    tasks=500,\n    workers=100,  # 不设置则所有任务同时运行\n)\nexecutor.run()\n```\n\u003C\u002Fdetails>\n\n## 日志记录\n对于具有 `logging_dir` **mylogspath\u002Fexp1** 的流水线，将创建如下文件夹结构：\n\n\u003Cdetails>\n  \u003Csummary>查看文件夹结构\u003C\u002Fsummary>\n\n```\n└── mylogspath\u002Fexp1\n    │── executor.json ⟵ 执行器选项和流水线步骤的 JSON 转储\n    │── launch_script.slurm ⟵ 用于启动此作业的 Slurm 配置文件（如果在 Slurm 上运行）\n    │── executor.pik ⟵ 用于启动此作业的 Slurm 配置文件（如果在 Slurm 上运行）\n    │── ranks_to_run.json ⟵ 正在运行的任务列表\n    │── logs\u002F\n    │   └──[task_00000.log, task_00001.log, task_00002.log, ...] ⟵ 每个任务的单独日志文件\n    │── completions\u002F\n    │   └──[00004, 00007, 00204, ...] ⟵ 标记任务已完成的空文件。用于重新启动或恢复作业时使用（只会运行未完成的任务）\n    │── stats\u002F\n    │   └──[00000.json, 00001.json, 00002.json, ...] ⟵ 每个任务的单独统计信息（处理、过滤、移除等样本数量）\n    └── stats.json ⟵ 来自所有任务的全局统计信息\n```\n\u003C\u002Fdetails>\n\n### 颜色化\n日志消息支持颜色化。默认情况下，控制台消息的颜色化会自动检测，而日志文件（logs\u002Ftask_XXXXX.log）中的颜色化则会被禁用。\n要显式启用或禁用颜色化，您可以设置以下环境变量：\n- `DATATROVE_COLORIZE_LOGS` 设置为 `\"1\"` 以在控制台日志消息中添加 ANSI 颜色，设置为 `\"0\"` 则禁用颜色化。\n- `DATATROVE_COLORIZE_LOG_FILES` 设置为 `\"1\"` 以在保存到 logs\u002Ftask_XXXXX.log 的日志消息中添加 ANSI 颜色。\n\n## DataFolder \u002F 路径\nDatatrove 通过 [fsspec](https:\u002F\u002Ffilesystem-spec.readthedocs.io\u002Fen\u002Flatest\u002F) 支持各种输入\u002F输出源。\n\n有几种方式可以为 Datatrove 块提供路径（用于 `input_folder`、`logging_dir`、`data_folder` 等参数）：\n- `str`：最简单的方式是传递一个字符串。例如：`\u002Fhome\u002Fuser\u002Fmydir`、`s3:\u002F\u002Fmybucket\u002Fmyinputdata`、`hf:\u002F\u002Fdatasets\u002Fallenai\u002Fc4\u002Fen\u002F`\n\n- `(str, fsspec 文件系统实例)`：一个字符串路径和一个完全初始化的文件系统对象。例如：`(\"s3:\u002F\u002Fmybucket\u002Fmyinputdata\", S3FileSystem(client_kwargs={\"endpoint_url\": endpoint_uri}))`\n- `(str, dict)`：一个字符串路径和一个包含初始化文件系统选项的字典。例如（与上一行等效）：`(\"s3:\u002F\u002Fmybucket\u002Fmyinputdata\", {\"client_kwargs\": {\"endpoint_url\": endpoint_uri}})`\n- `DataFolder`：您可以直接初始化一个 [DataFolder](src\u002Fdatatrove\u002Fio.py) 对象，并将其作为参数传递。\n\n这些参数组合会在后台由 [`get_datafolder`](src\u002Fdatatrove\u002Fio.py#116) 解析。\n\n## 实用指南\n\n### 读取数据\n通常，管道会从一个 [Reader](src\u002Fdatatrove\u002Fpipeline\u002Freaders) 块开始。大多数 reader 都会接受一个 `data_folder` 参数——指向包含待读取数据的文件夹路径。\n\n这些文件会被分配到各个任务中。如果你有 `N` 个任务，那么排名为 `i`（从 0 开始）的任务将处理文件 `i, i+N, i+2N, i+3N,...`。\n\n在内部，每个 reader 会读取数据并将其转换为字典，然后再创建一个 `Document` 对象。\n\n一些大多数 reader 都通用的选项：\n- `text_key`：用于指定每个样本中文本内容的字典键。默认值为 `text`。\n- `id_key`：用于指定每个样本的 ID 的字典键。默认值为 `id`。\n- `default_metadata`：一个字典，用于添加任何你希望的默认元数据值（例如来源等）。\n- `recursive`：是否递归地在 `data_folder` 的子目录中查找文件。\n- `glob_pattern`：使用此字段来匹配特定的文件。例如，`glob_pattern=\"*\u002Fwarc\u002F*.warc.gz\"` 将匹配 `data_folder` 每个子目录的 `warc\u002F` 文件夹下所有 `.warc.gz` 文件扩展名的文件。\n- `adapter`：此函数接收 reader 返回的原始字典，并返回一个包含 `Document` 字段名称的字典。如果你需要，可以覆盖这个函数（[_default_adapter](src\u002Fdatatrove\u002Fpipeline\u002Freaders\u002Fbase.py)）。\n- `limit`：只读取一定数量的样本。这对于测试和调试非常有用。\n\n### 合成数据生成\n通过运行 `pip install datatrove[inference]` 安装推理相关的额外依赖，以引入轻量级 HTTP 客户端、检查点相关依赖以及异步 SQLite 缓存。\n\n我们支持 [vLLM](https:\u002F\u002Fgithub.com\u002Fvllm-project\u002Fvllm)、[SGLang](https:\u002F\u002Fgithub.com\u002Fsgl-project\u002Fsglang)、与 OpenAI 兼容的 HTTPS 端点，以及通过 [InferenceRunner 块](src\u002Fdatatrove\u002Fpipeline\u002Finference\u002Frun_inference.py)提供的本地 `dummy` 服务器。每个 datatrove 任务都可以启动自己的服务器副本（适用于 vllm、sglang 或 dummy），也可以直接与外部端点通信，同时通过异步批处理保持较高的 GPU 利用率。\n\n#### 自定义 rollout 流程\n\n核心抽象是一个 **rollout 函数**——一个普通的异步可调用对象，它接收一个 `Document`、一个 `generate(payload)` 回调函数，以及来自 `shared_context` 的任何额外关键字参数。你可以在 rollout 中自由编排多个顺序或并行的 `generate` 调用。这使你能够完全控制如何构建提示以及如何组合生成结果。请参阅 [inference_chunked.py](examples\u002Finference\u002Finference_chunked.py)，其中包含以下示例：\n- 简单的单请求 rollout；\n- 分块 rollout，将长文档拆分后再将生成结果拼接起来；\n- 使用进程池进行 CPU 密集型预处理，借助 `shared_context` 实现；\n- 多节点分布式推理。\n\n设置 `rollouts_per_document` 可以自动对每个样本运行多次相同的 rollout；运行器会将成功的输出收集到 `document.metadata[\"rollout_results\"]` 中。\n\n#### 即用型生成脚本\n对于大规模合成数据生成的即用型脚本（支持从 1B 到 1T 参数规模的模型、本地\u002FSLURM 执行以及多节点设置），请参阅 [`generate_data.py`](examples\u002Finference\u002Fgenerate_data.py)。该脚本处理基于提示的生成，并提供可配置的系统提示和模板。\n\n#### 高级配置\n`shared_context` 允许你将共享状态注入到每次 rollout 调用中。它可以接受：\n- 一个字典（作为关键字参数传递）；\n- 一个返回字典的可调用对象（便于延迟创建资源）；\n- 上下文管理器或返回上下文管理器的可调用对象（非常适合池、GPU 分配器、临时目录等）。上下文管理器会在每个任务中正确地进入和退出一次。\n\n可恢复的生成：\n- 设置 `checkpoints_local_dir` 并配合 `records_per_chunk`，会将每个 `Document` 写入本地分块文件（请记得在输出文件名模板中包含 `${chunk_index}`），然后通过配置好的写入器上传这些文件。失败的任务会自动从最后一个完成的分块继续执行。\n- 当启用检查点功能时，基于 SQLite 的 `RequestCache` 会通过负载哈希值来去重单个 rollout（需要 `xxhash` 和 `aiosqlite`），这样在重试过程中就不会重复发送已完成的生成内容。\n- 在 `InferenceRunner` 上设置 `skip_bad_requests=True`，可以跳过提供商端的 `BadRequestError`（例如上下文\u002F窗口溢出），并让其余文档继续运行。\n\n通过调整 `max_concurrent_generations` 来优化批处理，当预处理或后处理工作量较大时，可以提高 `max_concurrent_documents` 的值，以便在请求发送期间有更多的 rollout 协程可以构建负载。\n\n\u003Cdetails>\n  \u003Csummary>最小化端到端示例\u003C\u002Fsummary>\n\n  ```\n  from datatrove.data import Document\n  from datatrove.executor.local import LocalPipelineExecutor\n  from datatrove.pipeline.inference.run_inference import InferenceConfig, InferenceRunner\n  from datatrove.pipeline.writers import JsonlWriter\n\n  async def simple_rollout(doc: Document, generate):\n      payload = {\"messages\": [{\"role\": \"user\", \"content\": [{\"type\": \"text\", \"text\": doc.text}]}], \"max_tokens\": 2048}\n      return await generate(payload)\n\n  documents = [Document(text=\"东京的天气怎么样？\", id=str(i)) for i in range(1005)]\n  config = InferenceConfig(server_type=\"vllm\", model_name_or_path=\"google\u002Fgemma-3-27b-it\", rollouts_per_document=1, max_concurrent_generations=500)\n\n  LocalPipelineExecutor(\n      pipeline=[\n          documents,\n          InferenceRunner(\n              rollout_fn=simple_rollout,\n              config=config,\n              skip_bad_requests=True,\n              records_per_chunk=500,\n              checkpoints_local_dir=\"\u002Ffsx\u002F...\u002Ftranslate-checkpoints\",\n              output_writer=JsonlWriter(\"s3:\u002F\u002F...\u002Ffinal_output_data\", output_filename=\"${rank}_chunk_${chunk_index}.jsonl\"),\n          ),\n      ],\n      logging_dir=\"\u002Ffsx\u002F...\u002Finference_logs\",\n      tasks=1,\n  ).run()\n  ```\n\u003C\u002Fdetails>\n\n扩展版的 [inference_chunked.py](examples\u002Finference\u002Finference_chunked.py) 脚本展示了单次和多次 rollout 流程、可恢复的检查点机制，以及在多个 rollout 之间共享进程池的方法。\n\n#### 进度监控\n对于长时间运行的推理任务，你可以使用 `InferenceProgressMonitor` 定期更新 HuggingFace 数据集卡片上的进度条和预计完成时间。推理完成后，`InferenceDatasetCardGenerator` 会生成包含统计数据的最终数据集卡片。\n\n```python\nfrom datatrove.pipeline.inference import InferenceDatasetCardParams, InferenceProgressMonitor, InferenceDatasetCardGenerator\n\nparams = InferenceDatasetCardParams(\n    output_repo_id=\"your-username\u002Foutput-dataset\",\n    input_dataset_name=\"simplescaling\u002Fs1K-1.1\",\n    input_dataset_split=\"train\",\n    model_name=\"Qwen\u002FQwen3-0.6B\",\n    # ... 其他参数\n)\n\n# 监控管道（与 SLURM 上的推理并行运行）\nmonitor_pipeline = [InferenceProgressMonitor(params=params, update_interval=3600)]\n\n# 最终卡片生成（在推理完成后运行）\ndatacard_pipeline = [InferenceDatasetCardGenerator(params=params)]\n```\n\n完整示例及 Slurm 集成请参阅 [progress_monitoring.py](examples\u002Finference\u002Fprogress_monitoring.py)。\n\n#### 基准测试\n\n要测量不同模型和配置（TP、PP、推测解码）下的 vLLM 吞吐量，请使用 [基准测试工具](examples\u002Finference\u002Fbenchmark\u002FREADME.md)。该基准测试套件提供：\n- **`launch_experiments.py`**：从 YAML 配置文件启动扫描实验，并自动提交 Slurm 作业\n- **`analyze_results.py`**：解析服务器日志并生成包含每 GPU 每秒处理的 token 数、处理 10 亿 token 所需的 GPU 天数等指标的 CSV 摘要\n\n### 提取文本\n您可以使用 [extractors](src\u002Fdatatrove\u002Fpipeline\u002Fextractors) 从原始 HTML 中提取文本内容。Datatrove 中最常用的提取器是 [Trafilatura](src\u002Fdatatrove\u002Fpipeline\u002Fextractors\u002Ftrafilatura.py)，它基于 [trafilatura](https:\u002F\u002Ftrafilatura.readthedocs.io\u002Fen\u002Flatest\u002F) 库。\n\n### 数据过滤\n[Filters](src\u002Fdatatrove\u002Fpipeline\u002Ffilters) 是任何数据处理管道中最重要的模块之一。Datatrove 的过滤模块接收一个 `Document` 对象，并返回一个布尔值（`True` 表示保留文档，`False` 表示移除文档）。被移除的样本不会传递到下一个管道阶段。您还可以通过将 [Writer](src\u002Fdatatrove\u002Fpipeline\u002Fwriters) 传递给 `exclusion_writer` 参数，将这些被移除的样本保存到磁盘上。\n\n### 保存数据\n完成数据处理后，您可能希望将其保存到某个位置。为此，可以使用 [writer](src\u002Fdatatrove\u002Fpipeline\u002Fwriters\u002Fjsonl.py)。\n写入器需要指定 `output_folder`（数据应保存的路径）。您可以选择使用的压缩方式（默认为 `gzip`），以及每个文件的保存名称。\n\n对于 `output_filename`，会应用以下模板，其中替换参数如下：\n- `${rank}` 替换为当前任务的排名。请注意，如果未使用此标记，**不同任务可能会尝试写入同一位置**\n- `${id}` 替换为样本 ID\n- 元数据：任何其他 `${tag}` 将被替换为对应的 `document.metadata['tag']` 值\n\n以下示例根据样本的 `lang` 元数据字段按语言分离样本：\n```\nJsonlWriter(\n    f\"{MAIN_OUTPUT_PATH}\u002Fnon_english\u002F\",\n    output_filename=\"${language}\u002F\" + DUMP + \"\u002F${rank}.jsonl.gz\",  # 文件夹结构：语言\u002F转储\u002F文件\n)\n```\n\n### 数据去重\n有关数据去重的示例，请参阅 [minhash_deduplication.py](examples\u002Fminhash_deduplication.py)、[sentence_deduplication.py](examples\u002Fsentence_deduplication.py) 和 [exact_substrings.py](examples\u002Fexact_substrings.py)。\n\n### 概要统计\n要对您的数据进行概要统计，可以使用 [Stats](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Ftree\u002Fmain\u002Fsrc\u002Fdatatrove\u002Fpipeline\u002Fstats) 模块。这些模块提供了一种简单的方法，可以在分布式环境中收集数据集的特征信息。过程分为两步：\n1) 对每个分片中的文档进行遍历，将统计数据归入以下分组：`summary`（所有文档计入“summary”键）、`fqdn`（完全合格域名分组）、`suffix`（URL 路径末尾部分分组）或 `histogram`（基于数值的分组）。\n2) 将不同分片的统计数据合并到一个文件中。\n更多详情请参阅 [summary_stats.py](examples\u002Fsummary_stats.py)。\n\n每个统计结果都会保存在一个单独的文件中，文件结构如下：`output_folder\u002F{fqdn,suffix,summary,histogram}\u002F{stat_name}\u002Fmetric.json`\n\n每个此类文件都是一个 `MetricStatsDict` 对象，您可以轻松加载它，例如：\n```python\nfrom datatrove.pipeline.stats.summary_stats import MetricStatsDict\nimport json\nstats = MetricStatsDict.from_dict(json.load(open(\"fqdn\u002Flength\u002Fmetric.json\")))\n\n# 例如，计算 nytimes.com 文档的总长度\nstats[\"nytimes.com\"].total\n\n# 或者计算 cnn.com 文档的平均长度\nstats[\"cnn.com\"].mean\n```\n\n可用的统计包括：\n- `contamination_stats.py`：`word_contamination_{words[0]}`：文档中特定单词出现的频率。\n- `doc_stats.py`：`length`：文档长度，`white_space_ratio`：空白字符比例，`non_alpha_digit_ratio`：非字母和非数字字符比例，`digit_ratio`：数字比例，`uppercase_ratio`：大写字母比例，`elipsis_ratio`：省略号比例，`punctuation_ratio`：标点符号比例。\n- `lang_stats.py`：`fasttext_{language}`：文档使用指定语言书写的得分。得分由 FastText 模型计算得出。\n- `line_stats.py`：`n_lines`：每篇文档的行数，`avg_line_length`：每篇文档的平均行长，`long_line_ratio_chars_{chars}`：超过 k 个字符的行所占比例，`short_line_ratio_chars_{chars}`：少于 k 个字符的行所占比例，`bullet_point_lines_ratio`：以项目符号开头的行所占比例，`line_duplicates`：重复行所占比例，`line_char_duplicates`：重复行中重复字符占总字符的比例。\n- `paragraph_stats.py`：`n_paragraphs`：段落数，`avg_paragraph_length`：平均段落长度，`short_paragraph_ratio_{chars}`：短段落所占比例（小于 `{chars}` 个字符），`long_paragraph_ratio_{chars}`：长段落所占比例（大于 `{chars}` 个字符）。\n- `perplexity_stats.py`：`ccnet_perplexity_{model_dataset}_{language}`：使用 CCNet 模型，在 `{dataset}` 上以 `{language}` 书写的文档的困惑度。\n- `sentence_stats.py`：`n_sentences`：句子数量，`avg_sentence_length`：平均句子长度，`short_sentence_ratio_{chars}`：短句子所占比例（小于 `{chars}` 个字符），`long_sentence_ratio_{chars}`：长句子所占比例（大于 `{chars}` 个字符）。\n- `token_stats.py`：`token_count`：文档中的标记总数。\n- `word_stats.py`：`n_words`：文档中的单词数量，`avg_word_length`：文档中单词的平均长度，`avg_words_per_line`：文档中平均每行的单词数，`short_word_ratio_{chars}`：短于 `{chars}` 个字符的单词所占比例，`stop_word_ratio`：停用词所占比例，`long_word_ratio_{chars}`：长于 `{chars}` 个字符的单词所占比例，`type_token_ratio`：唯一单词数与总标记数之比，`capitalized_word_ratio`：首字母大写的单词所占比例，`uppercase_word_ratio`：全部大写的单词所占比例。\n\n### 自定义块\n\n#### 简单数据\n您可以直接将一个 `Document` 对象的可迭代对象作为管道块传递，如下所示：\n```python\nfrom datatrove.data import Document\nfrom datatrove.pipeline.filters import SamplerFilter\nfrom datatrove.pipeline.writers import JsonlWriter\n\npipeline = [\n    [\n        Document(text=\"some data\", id=\"0\"),\n        Document(text=\"some more data\", id=\"1\"),\n        Document(text=\"even more data\", id=\"2\"),\n    ],\n    SamplerFilter(rate=0.5),\n    JsonlWriter(\n        output_folder=\"\u002Fmy\u002Foutput\u002Fpath\"\n    )\n]\n```\n\n请注意，这个可迭代对象不会被分片（如果您启动超过 1 个任务，它们都会获得完整的可迭代对象）。这通常适用于小型工作负载或测试。\n\n#### 自定义函数\n对于简单的处理，您可以直接传入一个具有以下签名的自定义函数：\n```python\nfrom datatrove.data import DocumentsPipeline\n\ndef uppercase_everything(data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:\n    \"\"\"\n        `data` 是一个 Document 的生成器。您也必须返回一个 Document 的生成器（使用 yield）。\n        您可以选择性地使用 `rank` 和 `world_size` 进行分片。\n    \"\"\"\n    for document in data:\n        document.text = document.text.upper()\n        yield document\n\npipeline = [\n    ...,\n    uppercase_everything,\n    ...\n]\n```\n> [!提示]\n> 由于导入的原因，您可能会遇到一些序列化问题。如果发生这种情况，只需将所需的导入语句移到函数体内即可。\n\n#### 自定义块\n您还可以定义一个完整的块，继承自 `PipelineStep`（位于 src\u002Fdatatrove\u002Fpipeline\u002Fbase.py）或其子类：\n\n```python\nfrom datatrove.pipeline.base import PipelineStep\nfrom datatrove.data import DocumentsPipeline\nfrom datatrove.io import DataFolderLike, get_datafolder\n\n\nclass UppercaserBlock(PipelineStep):\n    def __init__(self, some_folder: DataFolderLike, some_param: int = 5):\n        super().__init__()\n        # 您可以在此处接收并保存所需的任何参数\n        self.some_param = some_param\n        # 使用 get_datafolder() 来加载数据文件夹\n        self.some_folder = get_datafolder(some_folder)\n\n    def run(self, data: DocumentsPipeline, rank: int = 0, world_size: int = 1) -> DocumentsPipeline:\n        # 您也可以从 `some_folder` 加载数据：\n        for filepath in self.some_folder.get_shard(rank, world_size): # 它还接受 glob 模式等\n            with self.some_folder.open(filepath, \"rt\") as f:\n                # 执行某些操作\n                ...\n                yield doc\n\n        #\n        # 或者处理来自先前块的数据 (`data`)\n        #\n\n        for doc in data:\n            with self.track_time():\n                # 您可以将主要处理代码包裹在 `track_time` 中，以了解每篇文档的处理时间\n                nr_uppercase_letters = sum(map(lambda c: c.isupper(), doc.text))\n                # 您还可以使用 stat_update 跟踪每篇文档的统计信息\n                self.stat_update(\"og_upper_letters\", value=nr_uppercase_letters)\n                doc.text = doc.text.upper()\n            # 确保 yield 语句位于 track_time 块之外，否则会影响时间计算\n            yield doc\n\n        #\n        # 或者将数据保存到磁盘\n        #\n\n        with self.some_folder.open(\"myoutput\", \"wt\") as f:\n            for doc in data:\n                f.write(doc...)\n```\n\n```python\npipeline = [\n    ...,\n    UppercaserBlock(\"somepath\"),\n    ...\n]\n```\n\n您也可以继承自 `BaseExtractor`（位于 src\u002Fdatatrove\u002Fpipeline\u002Fextractors\u002Fbase.py）、`BaseFilter`（位于 src\u002Fdatatrove\u002Fpipeline\u002Ffilters\u002Fbase_filter.py）、`BaseReader`\u002F`BaseDiskReader`（位于 src\u002Fdatatrove\u002Fpipeline\u002Freaders\u002Fbase.py），或 `DiskWriter`（位于 src\u002Fdatatrove\u002Fpipeline\u002Fwriters\u002Fdisk_base.py）。\n## 贡献\n\n```bash\ngit clone git@github.com:huggingface\u002Fdatatrove.git && cd datatrove\npip install -e \".[dev]\"\n```\n\n安装 pre-commit 代码风格钩子：\n```bash\npre-commit install\n```\n\n运行代码风格检查：\n```bash\n# 快速本地循环（仅检查已更改的 Python 文件）\nmake quality\nmake style\n\n# 整个仓库的检查（与 CI 相同的范围）\nmake quality-full\nmake style-full\n```\n\n运行测试：\n```bash\npytest -sv .\u002Ftests\u002F\n```\n\n## 引用\n\n```bibtex\n@misc{penedo2024datatrove,\n  author = {Penedo, Guilherme and Kydlíček, Hynek and Cappelli, Alessandro and Sasko, Mario and Wolf, Thomas},\n  title = {DataTrove: large scale data processing},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub repository},\n  url = {https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove}\n}\n```","# DataTrove 快速上手指南\n\nDataTrove 是一个用于大规模处理、过滤和去重文本数据的库。它提供了一套预置的处理模块，并支持在本地或 Slurm 集群上无缝运行，非常适合构建大语言模型（LLM）的训练数据处理流水线。\n\n## 环境准备\n\n*   **系统要求**：Linux \u002F macOS \u002F Windows\n*   **Python 版本**：Python 3.10 或更高版本\n*   **前置依赖**：\n    *   若需读取特定格式（如 warc, parquet），需安装对应依赖。\n    *   若需在分布式环境（如 Slurm, Ray）运行，需配置相应的集群环境。\n\n## 安装步骤\n\n使用 pip 进行安装。DataTrove 采用模块化依赖设计，你可以根据需求选择安装特定的功能包（flavours）。\n\n### 基础安装\n仅安装核心功能：\n```bash\npip install datatrove\n```\n\n### 按需安装推荐功能\n你可以组合安装多个功能包（使用逗号分隔），或直接安装所有依赖：\n\n*   **安装所有依赖**（推荐新手使用）：\n    ```bash\n    pip install datatrove[all]\n    ```\n*   **常用组合示例**（IO 读写 + 文本处理 + S3 支持）：\n    ```bash\n    pip install datatrove[io,processing,s3]\n    ```\n\n**可用功能包说明：**\n*   `io`: 支持读取 warc\u002Farc\u002Fwet 及 arrow\u002Fparquet 格式。\n*   `processing`: 支持文本提取、过滤和分词。\n*   `s3`: 支持 S3 对象存储。\n*   `ray`: 支持 Ray 分布式计算引擎。\n*   `multilingual`: 支持多语言文本处理。\n\n> **国内加速提示**：如果下载速度较慢，建议使用国内镜像源：\n> ```bash\n> pip install datatrove[all] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 基本使用\n\nDataTrove 的核心概念是 **Pipeline（流水线）** 和 **Executor（执行器）**。\n*   **Pipeline**：由一系列处理模块（如读取、过滤、写入）组成的列表。\n*   **Executor**：负责在特定环境（本地或集群）中执行 Pipeline。\n\n以下是一个最简单的本地运行示例：从 CSV 读取数据，随机保留 50% 的文档，然后保存为 JSONL 格式。\n\n### 代码示例\n\n创建文件 `quickstart.py`：\n\n```python\nfrom datatrove.pipeline.readers import CSVReader\nfrom datatrove.pipeline.filters import SamplerFilter\nfrom datatrove.pipeline.writers import JsonlWriter\nfrom datatrove.executor import LocalPipelineExecutor\n\n# 1. 定义处理流水线 (Pipeline)\npipeline = [\n    # 读取步骤：从指定文件夹读取 CSV 文件\n    CSVReader(\n        data_folder=\"\u002Fpath\u002Fto\u002Finput\u002Fdata\"\n    ),\n    # 过滤步骤：随机采样，保留 50% 的数据\n    SamplerFilter(rate=0.5),\n    # 写入步骤：将结果保存为 JSONL 格式\n    JsonlWriter(\n        output_folder=\"\u002Fpath\u002Fto\u002Foutput\u002Fdata\"\n    )\n]\n\n# 2. 配置执行器 (Executor)\nexecutor = LocalPipelineExecutor(\n    pipeline=pipeline,\n    logging_dir=\"logs\u002F\",  # 日志和统计信息保存路径\n    tasks=10,             # 总任务数（建议根据文件数量调整，实现并行）\n    workers=4             # 同时运行的工作进程数（对应 CPU 核心数）\n)\n\n# 3. 运行流水线\nexecutor.run()\n```\n\n### 关键概念说明\n\n*   **Document 格式**：流水线中传递的数据单元包含 `text` (文本内容), `id` (唯一标识), 和 `metadata` (元数据字典)。\n*   **Tasks 与 Workers**：\n    *   `tasks`：将输入文件划分为多少个任务块。每个任务处理一组文件（Shard）。设置较多的 tasks 可以提高容错率（失败时只需重跑少量任务）。\n    *   `workers`：同时运行的任务数量。设置为 CPU 核心数可最大化本地利用率。\n*   **断点续跑**：DataTrove 会自动记录已完成的任务。如果运行中途失败，再次运行相同的代码只会执行未完成的任务，无需从头开始。\n\n### 运行脚本\n\n```bash\npython quickstart.py\n```\n\n运行结束后，检查 `\u002Fpath\u002Fto\u002Foutput\u002Fdata` 目录即可看到处理后的数据，日志和统计信息将保存在 `logs\u002F` 目录中。","某 AI 实验室团队正致力于构建一个万亿级 token 的大语言模型训练数据集，需要处理从 Common Crawl 抓取的数百 TB 原始网页数据。\n\n### 没有 datatrove 时\n- **脚本维护噩梦**：工程师不得不编写数千行复杂的自定义 Python 脚本来串联解压、文本提取、过滤和去重步骤，代码耦合度高，难以复用和调试。\n- **扩展性瓶颈**：本地单机处理速度极慢，而将任务迁移到 Slurm 集群时，需重写大量分布式逻辑和容错代码，耗时数周且极易出错。\n- **资源管理失控**：缺乏优化的内存管理机制，处理大文件时频繁发生内存溢出（OOM），导致任务中断，数据中间状态丢失。\n- **格式适配困难**：面对 WARC、Parquet、Arrow 等多种输入输出格式，需手动集成各类底层库，开发效率低下。\n\n### 使用 datatrove 后\n- **流水线模块化**：利用 datatrove 预置的标准化处理模块（如文本提取器、过滤器），通过简单配置即可组装复杂流程，彻底告别“脚本拼凑”。\n- **无缝集群部署**：凭借平台无关的特性，只需切换执行器（Executor），同一套代码即可直接从本地运行平滑迁移至 Slurm 或 Ray 集群，实现弹性扩容。\n- **高效稳定运行**：datatrove 的多阶段设计和低内存占用特性，确保在大规模数据处理中稳定运行，显著降低硬件成本并提升吞吐量。\n- **统一数据接口**：内置对 fsspec 的支持，让团队能透明地读写本地磁盘、S3 存储或 Hugging Face Hub 上的多种文件格式，无需关心底层细节。\n\ndatatrove 将原本混乱的数据清洗工程转化为可维护、可扩展的标准化流水线，让团队能专注于数据策略而非基础设施搭建。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_datatrove_5c82fac8.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[83,87,91],{"name":84,"color":85,"percentage":86},"Python","#3572A5",97.5,{"name":88,"color":89,"percentage":90},"Rust","#dea584",2.4,{"name":92,"color":93,"percentage":94},"Makefile","#427819",0.1,2975,252,"2026-04-02T07:21:48","Apache-2.0","Linux, macOS, Windows","未说明","未说明（文档提及相对较低的内存占用，适合大规模工作负载）",{"notes":103,"python":104,"dependencies":105},"该工具平台无关，支持本地、Slurm 集群或 Ray 分布式运行。核心功能依赖 fsspec 支持多种文件系统。可根据需求安装不同功能包（如 io, processing, s3, ray, inference 等）。任务并行度由文件数量决定，建议将数据拆分为多个中等大小的文件以充分利用多核\u002F多节点并行处理。重启失败任务时切勿更改总任务数，以免破坏数据分片逻辑。","3.10+",[106,107,108,109,110,111],"fsspec","torch (可选，用于 inference)","transformers (可选，用于 multilingual\u002Finference)","ray (可选，用于分布式计算)","lighteval (可选，用于 decont)","boto3\u002Fs3fs (可选，用于 s3 支持)",[15,34],"2026-03-27T02:49:30.150509","2026-04-06T05:35:27.397617",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},13658,"在使用 RayPipeline 进行大规模去重时遇到 Hashorder 错误或内存不足问题，有什么解决方案？","如果增加内存无法解决问题，可以尝试切换到 LocalPipeline 的多节点模式（multi-node mode）。此外，维护者已在后续版本中为 MinHash 第一阶段增加了额外的检查，并改进了 Ray 执行器。对于大规模去重的参数调整，主要通过 `num_buckets` 和 `hashes_per_bucket` 来控制，但具体最佳参数需根据数据分布进行启发式调整。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F397",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},13659,"为什么 Datatrove 的 Minhash 去重率与其他实现（如基于 Spark 的实现）不同？","差异通常源于 n-gram 计算方式的不同。例如，某些实现使用了 `min_length` 参数，而 Datatrove 的实现逻辑可能不同。建议不要仅比较文件大小，而是统计实际保留的文档数量来评估差异。如果差异巨大，可能是其他实现（如 Spark 的 GraphFrame）存在潜在 Bug，Datatrove 的实现通常是正确的。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F107",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},13660,"运行 URL 过滤器时在多节点高并发下出现 tarfile 解压错误或竞态条件，如何解决？","这是一个由大量并发任务导致的下载\u002F解压竞态条件问题。临时解决方案是减少并发度（例如单节点运行测试）或重试管道。根本解决思路包括：1) 在实例初始化时预下载数据；2) 引入锁文件机制（lockfile），在下载前检查 `\u002Ftmp\u002F{id}\u002F{id_of_download}` 是否存在，若存在则等待直到写入完成，避免多个进程同时操作同一文件。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F140",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},13661,"MinhashDedupCluster 阶段运行速度过慢，是否有加速策略？","聚类阶段默认可能只使用单个 CPU 构建并集，导致耗时较长。虽然目前主要瓶颈在于算法复杂度，但可以确认索引文档（index doc）由于优先级队列机制通常位于左侧，被视为父节点而不会被移除。如果需要获取集群大小信息，可以在 Python 中使用 `save_cluster_sizes` 选项（Rust 实现中默认开启）。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F106",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},13662,"数据处理管道中出现 `std::length_error: basic_string::_S_create` 错误导致进程终止，原因是什么？","该错误通常由处理包含超过 26,000 个字符且无空格的垃圾文本（spam texts）触发，导致内存分配失败。这是底层依赖库（如 kiwipiepy）的问题。解决方案是等待相关库发布修复补丁，或在数据预处理阶段先过滤掉此类极端长度的无空格文本。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F279",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},13663,"运行 Minhash 去重时遇到未知的 multiprocess pool 错误，可能是什么原因？","此类错误通常与多进程池中的序列化问题或资源竞争有关。虽然具体堆栈信息可能被截断，但常见原因包括输入数据格式不一致、签名文件读取失败或环境变量配置问题。建议检查输入数据完整性，并确保所有节点环境一致。如果问题持续，可尝试减少并行 worker 数量以定位是否为并发导致的资源冲突。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fissues\u002F157",[147,152,157,162,167,172,177,182,187],{"id":148,"version":149,"summary_zh":150,"released_at":151},72521,"v0.9.0","## 变更内容\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F420 中修复推理流水线中的 CI 测试卡顿问题。\n* 由 @lewtun 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F412 中移除检查点保存时文档元数据中的 'file_path' 字段。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F422 中加速基准测试提交流程。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F423 中统一路径规范。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F424 中新增基准测试模式。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F425 中对上下文进行截断处理。\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F421 中在 README 中提及 optimized-parquet。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F427 中修复测试中数据集名称错误。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F428 中更新 README 和依赖项。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F426 中修复共享文件系统上 vLLM 缓存损坏的问题。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F429 中对推理基准测试进行多项杂项调整。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F430 中进一步加速基准测试提交流程。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F431 中改进基准测试分析功能。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F432 中修复在 SLURM 作业中反序列化流水线时出现的 ModuleNotFoundError 错误。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F433 中扩展基准测试框架。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F434 中简化分析流程。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F435 中将 vLLM 服务器指标添加到基准测试分析中。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F436 中提升基准测试的可靠性并增加新功能。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F437 中优化基准测试基础设施。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F439 中修复分布式 Ray 辅助工具中内存单位不匹配的问题。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F440 中清理依赖项。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F441 中提升基准测试的使用体验。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F442 中跟踪推理耗时。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F444 中为大型 Hugging Face 数据集添加令牌估算脚本。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F445 中完成基准测试的最终调整。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F449 中添加 smol_data 示例，用于 100B 数据集的工作流。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F450 中为 InferenceRunner 添加 skip_bad_requests 选项。\n* 由 @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fh","2026-03-04T13:50:46",{"id":153,"version":154,"summary_zh":155,"released_at":156},72522,"v0.8.0","## 变更内容\n* @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F402 中引入了 Finepdfs\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F414 中优化了 parquet 格式\n* @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F415 中添加了推理数据集卡片生成器\n* @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F417 中启用了布尔型关键字参数\n* @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F416 中添加了进度监控和示例脚本\n* @salmanmkc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F411 中升级了 GitHub Actions 以兼容 Node 24\n* @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F418 中添加了推理基准测试工具\n* @JoelNiklaus 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F419 中修复了 CI\u002FCD 相关问题\n* @Rolv-Arild 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F405 中为 dataclasses 添加了 `slots=True`\n\n## 新贡献者\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F414 中完成了首次贡献\n* @salmanmkc 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F411 中完成了首次贡献\n* @Rolv-Arild 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F405 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fcompare\u002Fv0.7.0...v0.8.0","2026-01-19T16:03:50",{"id":158,"version":159,"summary_zh":160,"released_at":161},72523,"v0.7.0","## 变更内容\n* 修复 @omahs 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F380 中的拼写错误\n* filters：修复 C4BadWordsFilter 中 _get_badwords 的语言变量遮蔽问题（修复 #377），由 @dipampaul17 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F379\n* chore：修复文档 summary_stats 中的一个失效链接，由 @Olexandr88 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F382\n* [BUG 修复] 当以 `skip_completed=False` 启动依赖的 `LocalPipelineExecutor` 时会导致程序一直等待，由 @silverriver 修复，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F300\n* 允许 postprocess_fn 将 self 作为参数，由 @JoelNiklaus 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F391\n* 修复拼写错误，由 @DeVikingMark 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F386\n* 文档：修复文档 stats 中的一个失效链接，由 @Olexandr88 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F383\n* 添加从磁盘加载 HF 数据集的支持，由 @iamgroot42 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F385\n* 确保 folder_path 的使用一致性，由 @hynky1999 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F366\n* 修复 #388，由 @zinccat 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F389\n* 修复 Rust mh3 中的哨兵条件，由 @jordane95 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F394\n* 推理运行器的 bug 修复、警告提示及回调选项，由 @guipenedo 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F395\n* 一般性 bug 修复，由 @guipenedo 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F396\n* 在 minhash 步骤 1 后添加额外验证，由 @guipenedo 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F404\n* 推理运行器重构：rollouts、生成参数等，由 @guipenedo 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F398\n* 修复缺失参数，由 @shallyan 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F407\n* 多节点分布式推理支持，由 @hynky1999 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F406\n* Ray 相关的小改进，由 @hynky1999 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F403\n* 修复 Parquet 最后一个批次问题及 Slurm srun 选项，由 @guipenedo 完成，链接为 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F409\n\n## 新贡献者\n* @omahs 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F380 中完成了首次贡献\n* @dipampaul17 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F379 中完成了首次贡献\n* @Olexandr88 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F382 中完成了首次贡献\n* @DeVikingMark 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F386 中完成了首次贡献\n* @iamgroot42 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F385 中完成了首次贡献\n* @zinccat 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F389 中完成了首次贡献\n* @shallyan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F407 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fcompare\u002Fv0.6.0...v0.7.0","2026-01-19T14:19:27",{"id":163,"version":164,"summary_zh":165,"released_at":166},72524,"v0.6.0","## 变更内容\n* 由 @LeMoussel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F369 中修复了 `FineWebQualityFilter` 中 `stop_chars` 和 `exclusion_writer` 的类型注解。\n* 由 @Tavish9 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F371 中更新了 Ray 的文档。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F378 中添加了使用 vLLM 和 SGLang 进行推理的功能。\n\n## 新贡献者\n* @LeMoussel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F369 中完成了首次贡献。\n* @Tavish9 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F371 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fcompare\u002Fv0.5.0...v0.6.0","2025-08-07T19:03:58",{"id":168,"version":169,"summary_zh":170,"released_at":171},72525,"v0.5.0","## 变更内容\n* 由 @kylematoba 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F314 中修复\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F319 中更改了 FTFY 的默认设置\n* 由 @TJ-Solergibert 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F304 中添加了 Megatron 分词管道\n* 由 @StephenRebel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F282 中为 `launch_slurm_job` 方法添加了 `job_id_position` 参数\n* 现在 `load_tokenizer` 可以加载本地的 Hugging Face 文件夹，由 @ceferisbarov 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F306 中实现\n* 由 @jordane95 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F313 中添加了用于哈希索引的 glob 模式\n* fix(utils): 增强依赖项检查，以包含 pip 发布包，由 @aiqwe 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F317 中完成\n* 由 @saforem2 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F323 中更新了 README.md\n* 由 @muzzynine 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F327 中修复了使用索引时 URL 去重的问题\n* 由 @BramVanroy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F320 中添加了获取 SLURM 作业 ID 的自定义功能\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F329 中修复了停用词的实现问题\n* 由 @BramVanroy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F330 中允许自定义 Parquet 架构\n* [草案] 由 @craffel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F342 中为 DocumentTokenizer 添加分块选项\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F343 中撤销了“[草案] 为 DocumentTokenizer 添加分块选项”\n* fix: 由 @jordane95 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F349 中修复了 SENTINEL 的根条件\n* 由 @VivienCabannes 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F355 中修正了 finemath 的元数据解析\n* 由 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F361 中增加了 OOM 分数并缩短了轮询间隔\n* 由 @habanoz 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F309 中解决了 issue 308\n* [草案] 由 @craffel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F344 中再次为 DocumentTokenizer 添加分块选项\n* 由 @nelson-liu 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F331 中添加了 RayPipelineExecutor\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F363 中将 \u002Fsrc\u002Fdatatrove\u002Ftools\u002Ffast_mh3 中的 ring 从 0.17.8 升级到 0.17.14\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F362 中将 \u002Fsrc\u002Fdatatrove\u002Ftools\u002Ffast_mh3 中的 tokio 从 1.41.1 升级到 1.43.1\n* 由 @nelson-liu 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F334 中修复了 MinhashBuildIndex 中签名优先队列的初始化问题\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F364 中实现了 DocumentTokenizerMerger 中按块打乱的功能\n* 当 return_positions=True 时，根据 .index 返回位置信息，由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F356 中完成\n\n## 新贡献者\n* @kylematoba 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F314 中做出了他们的首次贡献\n* @StephenRebel 做出了他们…","2025-05-01T14:52:23",{"id":173,"version":174,"summary_zh":175,"released_at":176},72526,"v0.4.0","## 变更内容\n* 由 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F280 中修复了 README 文件中的小问题。\n* 由 @lyuwen 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F286 中修复了一个 bug：在读取流水线中，文档计数总是比实际文档数量少文件的数量。\n* 由 @BramVanroy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F294 中修复了语言列表化相关的 bug。\n* [修复 bug] 确保每个 srun 命令只启动一个任务，由 @silverriver 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F296 中实现。\n* [修复 bug] 修复了 MinhashBuildIndex 中 get_datafolder 方法的问题……，由 @Youggls 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F307 中完成。\n* FineWeb-2：多语言支持、NumPy 2.0、MinHash 优化，由 @guipenedo 和 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F285 中完成：\n  - 升级以支持 NumPy 2.0\n  - 添加了额外的词分词器，并重构了词分词器分配机制\n  - MinHash 优化 + 新的 Rust 工具加速第 3 步\n  - 增加了 MinHash 聚类大小功能\n  - 修复了一些词分词器的内存泄漏问题\n  - 更新了 URL 黑名单\n  - 对部分词分词调用增加了缓存\n  - 支持 Glotlid\n  - 其他通用 bug 修复\n\n\n\n## 新贡献者\n* @lyuwen 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F286 中完成了首次贡献。\n* @BramVanroy 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F294 中完成了首次贡献。\n* @silverriver 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F296 中完成了首次贡献。\n* @Youggls 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F307 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fcompare\u002Fv0.3.0...v0.4.0","2024-12-06T18:43:59",{"id":178,"version":179,"summary_zh":180,"released_at":181},72527,"v0.3.0","## 变更内容\n* 添加了 c4 垃圾词过滤器，并由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F160 中为 tokenscounter 添加了批量分词功能。\n* 由 @rantav 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F167 中为所有读取器添加了 skip 参数（默认值为零）。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F172 中添加了基于 n-gram 的去污功能。\n* 修复：由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F139 中实现了在 to_dict 中无错误地处理非字典对象。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F153 中为 Slurm 执行器添加了 `tasks_per_job` 参数。\n* 由 @marianna13 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F154 中对分词器和 srun 参数进行了无符号整数优化。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F169 中增强了 BaseReader，允许自定义适配器访问实例变量。\n* 由 @QasidSaleem 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F181 中从 process_common_crawl_dump 示例中移除了 ListFilter。\n* 由 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F170 中更新了 Hf 数据集。\n* 由 @its5Q 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F174 中优化了 URLFilter，并增加了禁用内置词表的选项。\n* 由 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F176 中为文件添加了进度显示。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F185 中使文件和控制台输出的颜色化可配置。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F179 中将去重功能迁移到 xxhash。\n* [进行中] 多语言分词功能，由 @beme248 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F147 中实现。\n* 由 @vsabolcec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F187 中添加了更多词分词器。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F188 中使用 uv 加速 CI 流程。\n* 由 @hynky1999 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F191 中实现了 URL 索引以及缺失的 hash_config 结构推断。\n* 由 @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F189 中将管道块迁移到新的词分词器。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F192 中修复了示例代码（fineweb）中的快照表示和数值转换问题。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F193 中将 randomize_start 功能扩展到本地执行器。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F194 中为 randomize_start 添加了描述。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F199 中允许在 executor\u002Fbase.py 中为 'randomize_start' 使用整数参数。\n* 由 @TJ-Solergibert 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F203 中报告了 DatatroveFolderDataset 的问题。\n* 由 @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F207 中统一了 radomize_start_duration 的代码一致性。\n* 功能（CI）：由 @McPatate 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F211 中添加了 trufflehog 秘密检测功能。\n* 修复（CI）：由 @McPatate 在 h 中移除了不必要的权限。","2024-08-28T15:47:00",{"id":183,"version":184,"summary_zh":185,"released_at":186},72528,"v0.2.0","## 变更内容\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F85 中为本地执行器添加了多节点并行处理功能。\n* @Anacheron51 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F86 中将 FSX 日志输出的默认文件路径更改为用户的主目录。\n* [`Docs`] @standardAI 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F91 中修复了拼写错误。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F92 中修复了统计文件未保存到 S3 的问题。\n* @thomwolf 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F89 中修复了 URL 统计相关问题。\n* 效率优化：@giorgioangel 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F88 中将 `np.array` 替换为 `np.fromiter`。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F94 中为 NLTK 添加了语言选项。\n* @jordane95 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F95 中修复了压缩类型相关问题。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F98 中将读取逻辑与 DedupReader 解耦。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F99 中支持任意 FastText 模型。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F101 中添加了引用信息。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F103 中添加了 Parquet 写入器。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F105 中提供了用于高效并行上传数据集文件至 Hugging Face Hub 的工具。\n* @thomwolf 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F90 中添加了文档字符串，并引入了一种更快的分词文档合并器。\n* @thomwolf 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F111 中为 Slurm 添加了邮件通知功能，并扩展了 FastText 过滤器的功能。\n* @lvwerra 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F113 中新增了 `jobs_status` 命令。\n* @mariosasko 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F114 中重新启用了 `datasets` 测试。\n* @jordane95 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F115 中更新了 `warc.py` 文件。\n* @jordane95 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F126 中修复了文件为空时的 bug。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F122 中使用 `from_file` 方法加载分词器。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F100 中为 LocalPipelineExecutor 添加了 `depends=` 参数。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F124 中改进了 C4 过滤和去重功能。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F128 中为读取器添加了随机打乱输入文件的选项。\n* @adbar 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F130 中更新了 Trafilatura 版本。\n* @guipenedo 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F133 中对文本规范化、FTFY 以及行符号格式化进行了更改。\n* @justHungryMan 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F134 中对本地分词器加载的相关术语和文档进行了小幅更新。\n* @marianna13 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatatrove\u002Fpull\u002F144 中添加了重新排队和 QOS 的 Slurm 选项。\n* @jordane95 在 https:\u002F\u002Fgithub","2024-04-22T17:18:51",{"id":188,"version":189,"summary_zh":190,"released_at":191},72529,"v0.0.1","首次发布","2024-02-07T15:10:12"]