[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-datajuicer--data-juicer":3,"tool-datajuicer--data-juicer":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160784,2,"2026-04-19T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":77,"owner_twitter":76,"owner_website":76,"owner_url":78,"languages":79,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":32,"env_os":102,"env_gpu":103,"env_ram":104,"env_deps":105,"category_tags":112,"github_topics":114,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":129,"updated_at":130,"faqs":131,"releases":162},9738,"datajuicer\u002Fdata-juicer","data-juicer","Data processing for and with foundation models!  🍎 🍋 🌽 ➡️ ➡️🍸 🍹 🍷","Data-Juicer 是专为大模型时代打造的数据操作系统，旨在将杂乱无章的原始数据转化为高质量的 AI 就绪智能资产。它通过模块化的构建块，帮助用户高效完成数据的清洗、合成与分析，覆盖从预训练语料去重、智能体交互轨迹整理，到领域专用检索增强生成（RAG）索引构建的全流程。\n\n面对海量多模态数据处理难、流程复现成本高以及从小规模实验到千节点集群扩展复杂等痛点，Data-Juicer 提供了无缝衔接的解决方案。无论是个人开发者在笔记本上调试，还是企业在大规模集群上运行任务，都能无需编写繁琐的“胶水代码”即可轻松应对。\n\n这款工具特别适合 AI 研究人员、数据工程师及大模型开发者使用。其核心亮点在于拥有超过 200 个涵盖文本、图像、音频及视频的操作算子，支持通过 YAML 配置文件像管理代码一样版本化、分享和复用数据处理流水线（Recipe）。此外，它还具备云原生架构特性，支持热重载技术，让用户能在不重启流程的情况下快速迭代算子，极大地提升了数据处理的灵活性与效率。","#  Data-Juicer: The Data Operating System for the Foundation Model Era\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fpy-data-juicer?logo=pypi&color=026cad\" alt=\"PyPI\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdatajuicer_data-juicer_readme_c41078afd39f.png\" alt=\"Downloads\">\u003C\u002Fa>\n   \u003Ca href=\"https:\u002F\u002Fhub.docker.com\u002Fr\u002Fdatajuicer\u002Fdata-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fv\u002Fdatajuicer\u002Fdata-juicer?logo=docker&label=Docker&color=498bdf\" alt=\"Docker\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📖_Docs-Website-026cad\" alt=\"Docs\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧩_Operators-200+-blue\" alt=\"Operators\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🍳_Recipes-50+-brightgreen\" alt=\"Recipes\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fzh_CN\u002Fmain\u002Findex_ZH.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇨🇳_文档-主页-red\" alt=\"Chinese\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14755\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS'25_Spotlight-2.0-B31B1B?logo=arxiv\" alt=\"Paper\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?style=flat&url=https%3A%2F%2Fgist.githubusercontent.com%2FHYLcool%2Ff856b14416f08f73d05d32fd992a9c29%2Fraw%2Ftotal_cov.json&label=coverage&logo=codecov&color=4c1\" alt=\"Coverage\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>Multimodal | Cloud-Native | AI-Ready | Large-Scale \u003C\u002Fb>\n\u003C\u002Fp>\n\nData-Juicer (DJ) transforms raw data chaos into AI-ready intelligence. It treats data processing as *composable infrastructure*—providing modular building blocks to clean, synthesize, and analyze data across the entire AI lifecycle, unlocking latent value in every byte.\n\nWhether you're deduplicating web-scale pre-training corpora, curating agent interaction traces, or preparing domain-specific RAG indices, DJ scales seamlessly from your laptop to thousand-node clusters—no glue code required.\n\n> **Alibaba Cloud PAI** has deeply integrated Data-Juicer into its data processing products.  See **[Quickly submit a DataJuicer job](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fpai\u002Fuser-guide\u002Fquickly-submit-a-datajuicer-task)**.\n\n---\n\n## 🚀 Quick Start\n\n**Zero-install exploration**: \n- [JupyterLab Playground with Tutorials](http:\u002F\u002F8.138.149.181\u002F) \n- [Ask DJ Copilot](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html)\n\n**Install & run**:\n```bash\nuv pip install py-data-juicer\ndj-process --config demos\u002Fprocess_simple\u002Fprocess.yaml\n```\n\n**Or compose in Python**:\n```python\nfrom data_juicer.core.data import NestedDataset\nfrom data_juicer.ops.filter import TextLengthFilter\nfrom data_juicer.ops.mapper import WhitespaceNormalizationMapper\n\nds = NestedDataset.from_dict({\n    \"text\": [\"Short\", \"This passes the filter.\", \"Text   with   spaces\"]\n})\nres_ds = ds.process([\n    TextLengthFilter(min_len=10),\n    WhitespaceNormalizationMapper()\n])\n\nfor s in res_ds:\n    print(s)\n```\n\n\n---\n\n## ✨ Why Data-Juicer?\n\n### 1. Modular & Extensible Architecture\n- **200+ operators** spanning text, image, audio, video, and multimodal data\n- **Recipe-first**: Reproducible YAML pipelines you can version, share, and fork like code\n- **Composable**: Drop in a single operator, chain complex workflows, or orchestrate full pipelines\n- **Hot-reload**: Iterate on operators without pipeline restarts\n\n### 2. Full-Spectrum Data Intelligence\n- **Foundation Models**: Pre-training, fine-tuning, RL, and evaluation-grade curation\n- **Agent Systems**: Clean tool traces, structure context, de-identification, and quality gating\n- **RAG & Analytics**: Extraction, normalization, semantic chunking, deduplication, and data profiling\n\n\n### 3. Production-Ready Performance\n- **Scale**: Process 70B samples in 2h on 50 Ray nodes (6400 cores)\n- **Efficiency**: Deduplicate 5TB in 2.8h using 1280 cores\n- **Optimization**: Automatic OP fusion (2-10x speedup), adaptive parallelism, CUDA acceleration, robustness\n- **Observability**: Built-in tracing for debugging, auditing, and iterative improvement\n\n> *⭐ If Data-Juicer saved you time or improved your data work, please consider starring the repo.* It helps more people discover the project and keeps you notified of new releases and features.\n\n---\n\n## 📰 News\n\n\u003Cdetails open>\n\u003Csummary>[2026-03-17] Release v1.5.1: \u003Cb>LaTeX OPs; Compressed Format Support; Operator Robustness Fixes\u003C\u002Fb>\u003C\u002Fsummary>\n\n* 📄 Two new LaTeX-focused mapper OPs shipped, extending data-juicer's document processing capabilities to handle `.tex` archives and figure contexts.\n* 🗜️ Compressed dataset format support: `json[l].gz` files can now be loaded directly, and Ray datasets gain proper support for reading compressed JSON files.\n* 📚 New documentation added covering cache, export, and tracing workflows to help users better understand and debug data processing pipelines.\n* 🤖 Major refactor and upgrade of data-juicer-agents completed: The project architecture and CLI\u002Fsession capabilities were comprehensively redesigned for better maintainability and extensibility. See [date-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents) for more details.\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>[2026-02-12] Release v1.5.0: \u003Cb>Partitioned Ray Executor, OP-level Env Management, and More Embodied-AI OPs\u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🚀 *Enhanced Distributed Execution Framework* -- Introduced partitioned Ray executor and OP-level isolated environments to improve fault tolerance, scalability, and dependency conflict resolution.\n- 🤖 *Expanded Embodied AI Video Processing* -- Added specialized operators for camera calibration, video undistortion, hand reconstruction, and pose estimation to strengthen multi-view video handling.\n- 💪🏻 *System Performance & Developer Experience Optimizations* -- Enabled batch inference, memory\u002Flog reduction, core logic refactoring, and updated documentation\u002Ftemplates.\n- 🐳 *Critical Bug Fixes & Stability Improvements* -- Resolved duplicate tracking, parameter conflicts, homepage rendering issues, and outdated docs for higher reliability.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-02-02] Release v1.4.6: \u003Cb>Copilot, Video Bytes I\u002FO & Ray Tracing \u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🤖 *Q&A Copilot* —  Now live on our [Doc Site](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Findex.html) | [DingTalk](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) | [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK). Feel free to ask anything related to Data-Juicer ecosystem!  \n    - Check 🤖 [Data-Juicer Agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain) | 📃 [Deploy-ready codes](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot) | 🎬[ More demos](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot\u002FDEMO.md) for more details.\n- 🎬 *Video Bytes I\u002FO* — Direct bytes processing for video pipelines  \n- 🫆 *Ray Mode Tracer* — Track changed samples in distributed processing  \n- 🐳 *Enhancements & fixes* — refreshed Docker image, small perf boosts, GitHub Insights traffic workflow, Ray compatibility updates, and bug\u002Fdoc fixes.\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-01-15] Release v1.4.5: \u003Cb>20+ New OPs, Ray vLLM Pipelines & Sphinx Docs Upgrade\u003C\u002Fb> \u003C\u002Fsummary>\n\n- *Embodied-AI OPs*: added\u002Fenhanced mappers for video captioning (VLM), video object segmentation (YOLOE+SAM2), video depth estimation (viz + point cloud), human pose (MMPose), image tagging (VLM), single-image 3D body mesh recovery (SAM 3D Body), plus *S3 upload\u002Fdownload*.\n- *New Pipeline OP*: compose multiple OPs into one pipeline; introduced *Ray + vLLM* pipelines for LLM\u002FVLM inference.\n- *Docs upgrade*: moved to a unified *Sphinx-based* documentation build\u002Fdeploy workflow with isolated theme\u002Farchitecture repo.\n- *Enhancements & fixes*: dependency updates, improved Ray deduplication and S3 loading, OpenAI Responses API support, tracer consistency, Docker base updated to CUDA 12.6.3 + Ubuntu 24.04 + Py3.11, and multiple bug fixes. \n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2025-12-01] Release v1.4.4: \u003Cb>NeurIPS’25 Spotlight, 6 New Video\u002FMM OPs & S3 I\u002FO\u003C\u002Fb> \u003C\u002Fsummary>\n\n- NeurIPS'25 **Spotlight** for Data-Juicer 2.0\n- *Repo split*: sandbox\u002Frecipes\u002Fagents moved to standalone repos\n- *S3 I\u002FO* added to loader\u002Fexporter\n- *6 new video & multimodal OPs* (character detection, VGGT, whole-body pose, hand reconstruction) + docs\u002FRay\u002Fvideo I\u002FO improvements and bug fixes\n\u003C\u002Fdetails>\n\nView [All Release](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Freleases) and [News Archive](docs\u002Fnews.md)\n\n---\n\n## 🔌 Users & Ecosystems\n> The below list focuses on *developer-facing integration and usages* in *alphabetical order*.  \n> Missing your project \u002F name? Feel free to [open a PR](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fpulls) or [reach out](#contributing--community).\n\nData-Juicer plugs into your existing stack and evolves with community contributions:\n\n### Extensions\n- **[data-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents)** — DJ Copilot and agentic workflows  \n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Community recipes and best practices  \n- **[data-juicer-sandbox](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-sandbox)** — Data-model co-development with feedback loops  \n\n\n### Frameworks & Platforms\n[AgentScope](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002Fagentscope) · [Apache Arrow](https:\u002F\u002Fgithub.com\u002Fapache\u002Farrow) · [Apache HDFS](https:\u002F\u002Fhadoop.apache.org\u002Fdocs\u002Fstable\u002Fhadoop-project-dist\u002Fhadoop-hdfs\u002FHdfsUserGuide.html) · [Apache Hudi](https:\u002F\u002Fhudi.apache.org\u002F) · [Apache Iceberg](https:\u002F\u002Ficeberg.apache.org\u002F) · [Apache Paimon](https:\u002F\u002Fpaimon.apache.org\u002F) · [Alibaba PAI](https:\u002F\u002Fwww.alibabacloud.com\u002Fen\u002Fproduct\u002Fmachine-learning?_p_lc=1) · [Delta Lake](https:\u002F\u002Fdelta.io\u002F) · [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) · [EasyAnimate](https:\u002F\u002Fgithub.com\u002Faigc-apps\u002FEasyAnimate) · [Eval-Scope](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fevalscope) · [Huawei Ascend](https:\u002F\u002Fwww.huawei.com\u002Fen\u002Fproducts\u002Fcloud-computing-dc\u002Fatlas\u002Fascend) · [Hugging Face](https:\u002F\u002Fhuggingface.co\u002F) · [LanceDB](https:\u002F\u002Flancedb.github.io\u002Flance\u002F) · [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) · [ModelScope](https:\u002F\u002Fmodelscope.cn\u002F) · [ModelScope Swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift) · [NVIDIA NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) · [Ray](https:\u002F\u002Fdocs.ray.io\u002F) · [RM-Gallery](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FRM-Gallery) · [Trinity-RFT](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FTrinity-RFT) · [Volcano Engine](https:\u002F\u002Fwww.volcengine.com\u002F)\n\n### Industry\nAlibaba Group, Ant Group, BYD Auto, ByteDance, DTSTACK, JD.com, NVIDIA, OPPO, Xiaohongshu, Xiaomi, Ximalaya, and more.\n\n### Academia\nCAS, Nanjing University, Peking University, RUC, Tsinghua University, UCAS, Zhejiang University, and more.\n\n\n###  Contributing & Community\nWe believe in *building together*. Whether you're fixing a typo, crafting a new operator, or sharing a breakthrough recipe, every contribution shapes the future of data processing.\n\nWe welcome contributions at all levels: \n- **[Good First Issues](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Flabels\u002Fgood%20first%20issue)** — Add operators, improve docs, report issues, or fix bugs\n- **[Developer Guide](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — Optimize engines, add features, or enhance core infrastructure\n- **[DJ-Hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Share knowledge: recipes, papers, and best practices\n- **Connect**: [Slack](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fdata-juicer\u002Fshared_invite\u002Fzt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg) · [DingTalk](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) · [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK)\n\n| Discord | DingTalk |\n|:---:|:---:|\n| \u003Cimg src=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi1\u002FO1CN011Oj8CB1f8Bw5JpgJA_!!6000000003961-0-tps-762-769.jpg\" width=\"100\"> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdatajuicer_data-juicer_readme_59a41761720f.png\" width=\"100\"> |\n\n\nData-Juicer is made possible by the users and community:\n- **Initiated by**: Alibaba Tongyi Lab  \n- **Co-developed with**: Alibaba Cloud PAI, Anyscale (Ray team), Sun Yat-sen University, NVIDIA (NeMo team), and [contributors worldwide](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fgraphs\u002Fcontributors)\n- **Inspired by**: Apache Arrow, Ray, Hugging Face Datasets, BLOOM, RedPajama-Data, ...\n\n---\n\n\n## Documentation\n\nFor detailed documentation, please see [here](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html).\n\n**Quick Links:**\n- **[operator zoo](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html)** — Browse 200+ operators with examples\n- **[Agent interaction quality & bad-case](demos\u002Fagent\u002FREADME.md)** — In-repo recipe, JSONL pipeline, HTML report (`demos\u002Fagent\u002F`; operators such as `agent_bad_case_signal_mapper` are also listed in [docs\u002FOperators.md](docs\u002FOperators.md))\n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — Community-driven recipes and best practices\n- **[developer guide](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — Build your own code and contribute to DJ \n- **[data-juicer-cookbook](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Ftutorial\u002FDJ-Cookbook.html)** — resource archive\n- **[awesome_llm_data](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Fawesome_llm_data)** —  “Awesome List” for data-model co-development\n\n\n---\n\n## 📄 License & Attribution\n\nData-Juicer is released under the [Apache License 2.0](LICENSE).  \nAttribution is appreciated: please use our [badge](https:\u002F\u002Fdail-wlcb.oss-cn-wulanchabu.aliyuncs.com\u002Fdata_juicer\u002Fassets\u002FDJ-Org-Logo.jpeg), or text as \"This project uses Data-Juicer: https:\u002F\u002Fgithub.com\u002Fdatajuicer\".\n\n---\n\n## 📖 Citation\nIf you find Data-Juicer useful in your work, please cite:\n\n```bibtex\n@inproceedings{djv1,\n  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},\n  author={Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  booktitle={SIGMOD},\n  year={2024}\n}\n\n@article{djv2,\n  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},\n  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  journal={NeurIPS},\n  year={2025}\n}\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>More Publications\u003C\u002Fb> (Click to expand)\u003C\u002Fsummary>\n\n- (ICML'25 Spotlight) [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11784)\n\n- (CVPR'25) [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04594)\n \n- (TPAMI'25) [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.08583)\n\n- (NeurIPS'25) [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04380)\n\n- (NeurIPS'25) [MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09499)\n\n- (Benchmark Data) [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17574)\n \n- (Benchmark Data) [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16915)\n\n- (Data Scaling) [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14908)\n\n\u003C\u002Fdetails>\n\n","# Data-Juicer：面向基础模型时代的数据操作系统\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fpy-data-juicer?logo=pypi&color=026cad\" alt=\"PyPI\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fpy-data-juicer\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdatajuicer_data-juicer_readme_c41078afd39f.png\" alt=\"下载量\">\u003C\u002Fa>\n   \u003Ca href=\"https:\u002F\u002Fhub.docker.com\u002Fr\u002Fdatajuicer\u002Fdata-juicer\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fdocker\u002Fv\u002Fdatajuicer\u002Fdata-juicer?logo=docker&label=Docker&color=498bdf\" alt=\"Docker\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002F\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F📖_Docs-Website-026cad\" alt=\"文档\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🧩_Operators-200+-blue\" alt=\"算子\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🍳_Recipes-50+-brightgreen\" alt=\"配方\">\u003C\u002Fa>\n  \u003Cbr>\n  \u003Ca href=\"https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fzh_CN\u002Fmain\u002Findex_ZH.html\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F🇨🇳_文档-主页-red\" alt=\"中文\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2501.14755\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FNeurIPS'25_Spotlight-2.0-B31B1B?logo=arxiv\" alt=\"论文\">\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?style=flat&url=https%3A%2F%2Fgist.githubusercontent.com%2FHYLcool%2Ff856b14416f08f73d05d32fd992a9c29%2Fraw%2Ftotal_cov.json&label=coverage&logo=codecov&color=4c1\" alt=\"覆盖率\">\n  \u003C\u002Fa>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Cb>多模态 | 云原生 | AI就绪 | 大规模\u003C\u002Fb>\n\u003C\u002Fp>\n\nData-Juicer（DJ）将原始数据的混乱转化为AI就绪的智能。它将数据处理视为*可组合的基础设施*——提供模块化的构建块，用于在整个AI生命周期中清洗、合成和分析数据，从而释放每一字节中的潜在价值。\n\n无论您是在去重网络规模的预训练语料库、策划智能体交互轨迹，还是准备领域特定的RAG索引，DJ都能无缝扩展，从您的笔记本电脑到上千节点的集群——无需任何胶水代码。\n\n> **阿里云PAI**已将Data-Juicer深度集成到其数据处理产品中。请参阅**[快速提交DataJuicer任务](https:\u002F\u002Fwww.alibabacloud.com\u002Fhelp\u002Fen\u002Fpai\u002Fuser-guide\u002Fquickly-submit-a-datajuicer-task)**。\n\n---\n\n## 🚀 快速入门\n\n**零安装探索**：\n- [带有教程的JupyterLab Playground](http:\u002F\u002F8.138.149.181\u002F)\n- [向DJ Copilot提问](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html)\n\n**安装并运行**：\n```bash\nuv pip install py-data-juicer\ndj-process --config demos\u002Fprocess_simple\u002Fprocess.yaml\n```\n\n**或在Python中组合**：\n```python\nfrom data_juicer.core.data import NestedDataset\nfrom data_juicer.ops.filter import TextLengthFilter\nfrom data_juicer.ops.mapper import WhitespaceNormalizationMapper\n\nds = NestedDataset.from_dict({\n    \"text\": [\"Short\", \"This passes the filter.\", \"Text   with   spaces\"]\n})\nres_ds = ds.process([\n    TextLengthFilter(min_len=10),\n    WhitespaceNormalizationMapper()\n])\n\nfor s in res_ds:\n    print(s)\n```\n\n\n---\n\n## ✨ 为什么选择Data-Juicer？\n\n### 1. 模块化与可扩展架构\n- **200+种算子**，覆盖文本、图像、音频、视频及多模态数据\n- **配方优先**：可复现的YAML流水线，像代码一样可以版本控制、共享和分叉\n- **可组合性**：可单独插入一个算子，也可串联复杂的工作流，或编排完整的流水线\n- **热加载**：无需重启流水线即可迭代算子\n\n### 2. 全方位数据智能\n- **基础模型**：预训练、微调、强化学习及评估级别的数据精选\n- **智能体系统**：清理工具痕迹、结构化上下文、去标识化和质量门控\n- **RAG与分析**：提取、归一化、语义分块、去重以及数据概览\n\n### 3. 生产级性能\n- **规模**：在50个Ray节点（6400核）上2小时内处理700亿条样本\n- **效率**：使用1280核在2.8小时内去重5TB数据\n- **优化**：自动算子融合（提速2–10倍）、适应性并行、CUDA加速、高鲁棒性\n- **可观测性**：内置追踪功能，便于调试、审计和迭代改进\n\n> *⭐ 如果Data-Juicer为您节省了时间或提升了数据工作质量，请考虑给本仓库点个星。这有助于更多人发现该项目，并让您及时了解新版本和新功能。*\n\n---\n\n## 📰 新闻\n\n\u003Cdetails open>\n\u003Csummary>[2026-03-17] 发布 v1.5.1：\u003Cb>LaTeX 操作符；压缩格式支持；操作符鲁棒性修复\u003C\u002Fb>\u003C\u002Fsummary>\n\n* 📄 推出了两个新的以 LaTeX 为核心的映射器操作符，扩展了 data-juicer 的文档处理能力，可处理 `.tex` 归档文件和图表上下文。\n* 🗜️ 增加了对压缩数据集格式的支持：现在可以直接加载 `json[l].gz` 文件，Ray 数据集也正式支持读取压缩的 JSON 文件。\n* 📚 新增了关于缓存、导出和追踪工作流的文档，帮助用户更好地理解并调试数据处理管道。\n* 🤖 完成了 data-juicer-agents 的重大重构与升级：项目架构及 CLI\u002F会话功能得到了全面重新设计，以提升可维护性和扩展性。更多详情请参阅 [data-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents)。\n\u003C\u002Fdetails>\n\n\u003Cdetails open>\n\u003Csummary>[2026-02-12] 发布 v1.5.0：\u003Cb>分区式 Ray 执行器、OP 级别环境管理以及更多具身智能 OP\u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🚀 *增强的分布式执行框架* — 引入了分区式 Ray 执行器和 OP 级别的隔离环境，以提高容错性、可扩展性以及依赖冲突的解决能力。\n- 🤖 *扩展的具身智能视频处理* — 新增了用于相机标定、视频去畸变、手部重建和姿态估计的专用操作符，从而强化了多视角视频的处理能力。\n- 💪🏻 *系统性能与开发者体验优化* — 支持批量推理、减少内存和日志输出、重构核心逻辑，并更新了文档和模板。\n- 🐳 *关键 bug 修复与稳定性提升* — 解决了重复跟踪、参数冲突、首页渲染问题以及文档过时等问题，进一步提升了可靠性。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-02-02] 发布 v1.4.6：\u003Cb>Copilot、视频字节 I\u002FO 和 Ray 追踪\u003C\u002Fb>\u003C\u002Fsummary>\n\n- 🤖 *问答 Copilot* — 已在我们的 [文档网站](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Findex.html) | [钉钉](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) | [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK) 上线。欢迎随时提问有关 Data-Juicer 生态系统的任何问题！  \n    - 更多详情请查看 🤖 [Data-Juicer Agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain) | 📃 [部署就绪代码](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot) | 🎬[ 更多演示](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents\u002Fblob\u002Fmain\u002Fqa-copilot\u002FDEMO.md)。\n- 🎬 *视频字节 I\u002FO* — 直接处理视频管道中的字节数据。\n- 🫆 *Ray 模式追踪器* — 在分布式处理中跟踪发生变化的样本。\n- 🐳 *增强与修复* — 更新了 Docker 镜像，小幅提升了性能，优化了 GitHub Insights 流量工作流程，更新了 Ray 兼容性，并修复了一些 bug 和文档问题。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2026-01-15] 发布 v1.4.5：\u003Cb>20 多个新 OP、Ray vLLM 管道和 Sphinx 文档升级\u003C\u002Fb>\u003C\u002Fsummary>\n\n- *具身智能 OP*：新增或增强了用于视频字幕生成 (VLM)、视频目标分割 (YOLOE+SAM2)、视频深度估计 (viz + 点云)、人体姿态估计 (MMPose)、图像标签 (VLM)、单张图像 3D 身体网格恢复 (SAM 3D Body) 的映射器，同时还增加了 *S3 上传\u002F下载* 功能。\n- *新型管道 OP*：可将多个 OP 组合成一条管道；引入了用于 LLM\u002FVLM 推理的 *Ray + vLLM* 管道。\n- *文档升级*：迁移到统一的基于 Sphinx 的文档构建\u002F部署工作流，并采用了独立的主题和架构仓库。\n- *增强与修复*：更新了依赖项，改进了 Ray 的去重和 S3 加载功能，支持 OpenAI Responses API，提升了追踪器的一致性，Docker 基础镜像更新为 CUDA 12.6.3 + Ubuntu 24.04 + Py3.11，并修复了多个 bug。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>[2025-12-01] 发布 v1.4.4：\u003Cb>NeurIPS’25 特别关注、6 个新的视频\u002F多模态 OP 以及 S3 I\u002FO\u003C\u002Fb>\u003C\u002Fsummary>\n\n- Data-Juicer 2.0 荣获 NeurIPS'25 **特别关注**\n- *代码库拆分*：sandbox\u002Frecipes\u002Fagents 被移至独立的代码库。\n- *S3 I\u002FO* 被添加到加载器和导出器中。\n- *6 个新的视频和多模态 OP*（角色检测、VGGT、全身姿态、手部重建）+ 文档\u002FRay\u002F视频 I\u002FO 的改进以及 bug 修复。\n\u003C\u002Fdetails>\n\n查看 [所有发布](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Freleases) 和 [新闻存档](docs\u002Fnews.md)\n\n---\n\n## 🔌 用户与生态系统\n> 下列列表按字母顺序排列，重点介绍面向开发者的集成与使用情况。  \n> 如果您的项目或名字未在此列出，请随时 [提交 PR](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fpulls) 或 [联系我们](#contributing--community)。\n\nData-Juicer 可无缝接入您现有的技术栈，并随着社区贡献不断演进：\n\n### 扩展\n- **[data-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents)** — DJ Copilot 和代理式工作流。\n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — 社区配方与最佳实践。\n- **[data-juicer-sandbox](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-sandbox)** — 与反馈循环结合的数据模型协同开发。\n\n### 框架与平台\n[AgentScope](https:\u002F\u002Fgithub.com\u002Fagentscope-ai\u002Fagentscope) · [Apache Arrow](https:\u002F\u002Fgithub.com\u002Fapache\u002Farrow) · [Apache HDFS](https:\u002F\u002Fhadoop.apache.org\u002Fdocs\u002Fstable\u002Fhadoop-project-dist\u002Fhadoop-hdfs\u002FHdfsUserGuide.html) · [Apache Hudi](https:\u002F\u002Fhudi.apache.org\u002F) · [Apache Iceberg](https:\u002F\u002Ficeberg.apache.org\u002F) · [Apache Paimon](https:\u002F\u002Fpaimon.apache.org\u002F) · [阿里巴巴 PAI](https:\u002F\u002Fwww.alibabacloud.com\u002Fen\u002Fproduct\u002Fmachine-learning?_p_lc=1) · [Delta Lake](https:\u002F\u002Fdelta.io\u002F) · [DiffSynth-Studio](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FDiffSynth-Studio) · [EasyAnimate](https:\u002F\u002Fgithub.com\u002Faigc-apps\u002FEasyAnimate) · [Eval-Scope](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fevalscope) · [华为 Ascend](https:\u002F\u002Fwww.huawei.com\u002Fen\u002Fproducts\u002Fcloud-computing-dc\u002Fatlas\u002Fascend) · [Hugging Face](https:\u002F\u002Fhuggingface.co\u002F) · [LanceDB](https:\u002F\u002Flancedb.github.io\u002Flance\u002F) · [LLaMA-Factory](https:\u002F\u002Fgithub.com\u002Fhiyouga\u002FLLaMA-Factory) · [ModelScope](https:\u002F\u002Fmodelscope.cn\u002F) · [ModelScope Swift](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fms-swift) · [NVIDIA NeMo](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo) · [Ray](https:\u002F\u002Fdocs.ray.io\u002F) · [RM-Gallery](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FRM-Gallery) · [Trinity-RFT](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FTrinity-RFT) · [火山引擎](https:\u002F\u002Fwww.volcengine.com\u002F)\n\n### 行业\n阿里巴巴集团、蚂蚁集团、比亚迪汽车、字节跳动、DTSTACK、京东、英伟达、OPPO、小红书、小米、喜马拉雅等。\n\n### 学术界\n中国科学院、南京大学、北京大学、中国人民大学、清华大学、中国科学院大学、浙江大学等。\n\n### 贡献与社区\n我们坚信*共同构建*。无论您是在修正拼写错误、开发新的算子，还是分享突破性的数据处理方案，每一次贡献都在塑造数据处理的未来。\n\n我们欢迎各层次的贡献：\n- **[适合新手的议题](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Flabels\u002Fgood%20first%20issue)** — 添加算子、改进文档、报告问题或修复 bug\n- **[开发者指南](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — 优化引擎、添加功能或增强核心基础设施\n- **[DJ-Hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — 分享知识：数据处理配方、论文和最佳实践\n- **交流**：[Slack](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Fdata-juicer\u002Fshared_invite\u002Fzt-23zxltg9d-Z4d3EJuhZbCLGwtnLWWUDg) · [钉钉](https:\u002F\u002Fqr.dingtalk.com\u002Faction\u002Fjoingroup?code=v1,k1,N78tgW54U447gJP5aMC95B6qgQhlkVQS4+dp7qQq6MpuRVJIwrSsXmL8oFqU5ajJ&_dt_no_comment=1&origin=11?) · [Discord](https:\u002F\u002Fdiscord.gg\u002FngQbB9hEVK)\n\n| Discord | 钉钉 |\n|:---:|:---:|\n| \u003Cimg src=\"https:\u002F\u002Fgw.alicdn.com\u002Fimgextra\u002Fi1\u002FO1CN011Oj8CB1f8Bw5JpgJA_!!6000000003961-0-tps-762-769.jpg\" width=\"100\"> | \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdatajuicer_data-juicer_readme_59a41761720f.png\" width=\"100\"> |\n\n\nData-Juicer 的发展离不开用户和社区的支持：\n- **发起方**：阿里巴巴通义实验室  \n- **联合开发**：阿里云 PAI、Anyscale（Ray 团队）、中山大学、NVIDIA（NeMo 团队）以及[全球各地的贡献者](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fgraphs\u002Fcontributors)\n- **灵感来源**：Apache Arrow、Ray、Hugging Face Datasets、BLOOM、RedPajama-Data 等……\n\n---\n\n\n## 文档\n\n如需详细文档，请参阅[此处](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs_index.html)。\n\n**快速链接：**\n- **[算子大全](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FOperators.html)** — 浏览 200 多个算子及示例\n- **[Agent 交互质量与不良案例](demos\u002Fagent\u002FREADME.md)** — 仓库内配方、JSONL 流水线、HTML 报告（`demos\u002Fagent\u002F`；诸如 `agent_bad_case_signal_mapper` 等算子也列在 [docs\u002FOperators.md](docs\u002FOperators.md) 中）\n- **[data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub)** — 社区驱动的数据处理配方与最佳实践\n- **[开发者指南](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002FDeveloperGuide.html)** — 搭建您自己的代码并为 DJ 做出贡献\n- **[data-juicer-cookbook](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Ftutorial\u002FDJ-Cookbook.html)** — 资源档案\n- **[awesome_llm_data](https:\u002F\u002Fdatajuicer.github.io\u002Fdata-juicer\u002Fen\u002Fmain\u002Fdocs\u002Fawesome_llm_data)** — 数据与模型协同开发的“Awesome 列表”\n\n\n---\n\n## 📄 许可与署名\n\nData-Juicer 采用 [Apache License 2.0](LICENSE) 开源。  \n如您使用本项目，请注明署名：可使用我们的徽标[此处](https:\u002F\u002Fdail-wlcb.oss-cn-wulanchabu.aliyuncs.com\u002Fdata_juicer\u002Fassets\u002FDJ-Org-Logo.jpeg)，或以文字形式标注：“本项目使用 Data-Juicer：https:\u002F\u002Fgithub.com\u002Fdatajuicer”。\n\n---\n\n## 📖 引用\n若您在工作中使用了 Data-Juicer，请引用以下文献：\n\n```bibtex\n@inproceedings{djv1,\n  title={Data-Juicer: A One-Stop Data Processing System for Large Language Models},\n  author={Chen, Daoyuan and Huang, Yilun and Ma, Zhijian and Chen, Hesen and Pan, Xuchen and Ge, Ce and Gao, Dawei and Xie, Yuexiang and Liu, Zhaoyang and Gao, Jinyang and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  booktitle={SIGMOD},\n  year={2024}\n}\n\n@article{djv2,\n  title={Data-Juicer 2.0: Cloud-Scale Adaptive Data Processing for and with Foundation Models},\n  author={Chen, Daoyuan and Huang, Yilun and Pan, Xuchen and Jiang, Nana and Wang, Haibin and Zhang, Yilei and Ge, Ce and Chen, Yushuo and Zhang, Wenhao and Ma, Zhijian and Huang, Jun and Lin, Wei and Li, Yaliang and Ding, Bolin and Zhou, Jingren},\n  journal={NeurIPS},\n  year={2025}\n}\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Cb>更多出版物\u003C\u002Fb>（点击展开）\u003C\u002Fsummary>\n\n- (ICML'25 Spotlight) [Data-Juicer Sandbox: A Feedback-Driven Suite for Multimodal Data-Model Co-development](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11784)\n\n- (CVPR'25) [ImgDiff: Contrastive Data Synthesis for Vision Large Language Models](https:\u002F\u002Farxiv.org\u002Fabs\u002F2408.04594)\n \n- (TPAMI'25) [The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.08583)\n\n- (NeurIPS'25) [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2502.04380)\n\n- (NeurIPS'25) [MindGYM: What Matters in Question Synthesis for Thinking-Centric Fine-Tuning?](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09499)\n\n- (基准数据) [HumanVBench: Exploring Human-Centric Video Understanding Capabilities of MLLMs with Synthetic Benchmark Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F2412.17574)\n \n- (基准数据) [DetailMaster: Can Your Text-to-Image Model Handle Long Prompts?](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16915)\n\n- (数据扩展) [BiMix: A Bivariate Data Mixing Law for Language Model Pretraining](https:\u002F\u002Farxiv.org\u002Fabs\u002F2405.14908)\n\n\u003C\u002Fdetails>","# Data-Juicer 快速上手指南\n\nData-Juicer (DJ) 是面向基础模型时代的数据操作系统，旨在将原始数据转化为 AI 就绪的智能数据。它提供模块化的算子（Operators），支持文本、图像、音频、视频及多模态数据的清洗、合成与分析，可无缝扩展从单机到千节点集群的处理任务。\n\n## 环境准备\n\n*   **操作系统**：Linux, macOS, Windows (推荐 Linux 以获得最佳性能)\n*   **Python 版本**：Python 3.8 - 3.11 (官方 Docker 镜像基于 Python 3.11)\n*   **前置依赖**：\n    *   `pip` 或 `uv` (推荐用于快速安装)\n    *   若需分布式处理，需安装 Ray (通常随包自动安装或按需配置)\n*   **硬件建议**：\n    *   小规模测试：普通笔记本电脑即可。\n    *   大规模处理：支持多核 CPU 集群；部分算子支持 CUDA 加速（需 NVIDIA GPU）。\n\n> **提示**：如果您希望零安装体验，可直接访问官方提供的 [JupyterLab 在线游乐场](http:\u002F\u002F8.138.149.181\u002F) 进行尝试。\n\n## 安装步骤\n\n推荐使用 `uv` 进行极速安装，也可使用标准的 `pip`。\n\n### 方式一：使用 uv 安装（推荐）\n\n```bash\nuv pip install py-data-juicer\n```\n\n### 方式二：使用 pip 安装\n\n```bash\npip install py-data-juicer\n```\n\n### 方式三：使用 Docker（适合生产环境）\n\n```bash\ndocker pull datajuicer\u002Fdata-juicer\n```\n\n## 基本使用\n\nData-Juicer 支持两种主要使用模式：**命令行配置驱动**（适合复杂流水线）和 **Python 代码驱动**（适合灵活开发）。\n\n### 模式一：命令行运行 (YAML 配置)\n\n这是最推荐的工程化用法。您可以定义一个 YAML 配置文件来描述数据处理流水线。\n\n1.  **执行示例命令**：\n    运行以下命令即可处理官方提供的演示数据：\n\n    ```bash\n    dj-process --config demos\u002Fprocess_simple\u002Fprocess.yaml\n    ```\n\n2.  **自定义配置**：\n    您可以复制 `demos` 目录下的 yaml 文件，修改其中的 `operators` 列表以组合不同的清洗或增强算子，然后指向您的本地数据路径运行。\n\n### 模式二：Python 代码调用\n\n适合在 Jupyter Notebook 或现有 Python 项目中直接嵌入数据处理逻辑。\n\n```python\nfrom data_juicer.core.data import NestedDataset\nfrom data_juicer.ops.filter import TextLengthFilter\nfrom data_juicer.ops.mapper import WhitespaceNormalizationMapper\n\n# 1. 加载数据 (此处示例为从字典加载，实际可从文件加载)\nds = NestedDataset.from_dict({\n    \"text\": [\"Short\", \"This passes the filter.\", \"Text   with   spaces\"]\n})\n\n# 2. 定义处理流水线\n# 组合多个算子：先过滤短文本，再标准化空白字符\nres_ds = ds.process([\n    TextLengthFilter(min_len=10),\n    WhitespaceNormalizationMapper()\n])\n\n# 3. 查看结果\nfor s in res_ds:\n    print(s)\n```\n\n### 核心概念简述\n\n*   **Operators (算子)**：DJ 内置了 200+ 算子，分为 `Filter` (过滤)、`Mapper` (映射\u002F转换)、`Deduplicator` (去重) 等类型。\n*   **Recipes (食谱)**：指代完整的 YAML 处理流程配置，可在 [data-juicer-hub](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-hub) 社区仓库中找到大量最佳实践模板。\n*   **NestedDataset**：统一的数据结构，支持嵌套字段，兼容多种文件格式（JSON, Parquet, CSV, 压缩文件等）。","某大型电商团队正试图构建一个垂直领域的多模态客服大模型，需要处理海量且杂乱的用户咨询日志（包含文本对话、商品截图及语音录音）。\n\n### 没有 data-juicer 时\n- **脚本碎片化严重**：工程师需分别编写 Python 脚本清洗文本、调用独立工具处理图像分辨率、再用 FFmpeg 转换音频格式，维护成本极高且容易出错。\n- **数据质量不可控**：缺乏统一的去重和质量过滤机制，导致训练集中混入大量重复投诉、模糊图片或含敏感信息的录音，模型收敛缓慢且存在合规风险。\n- **扩展性瓶颈**：本地单机处理 GB 级数据尚可，一旦数据量升至 TB 级，原有“胶水代码”无法平滑迁移至集群，重构耗时数周。\n- **流程难以复现**：数据处理逻辑硬编码在脚本中，不同成员修改参数后无法追溯版本，导致模型效果波动时难以定位是数据还是算法问题。\n\n### 使用 data-juicer 后\n- **一站式流水线编排**：通过 YAML 配置文件即可串联起文本长度过滤、图片清晰度检测、语音静音切除等 200+ 算子，将原本分散的工具整合为统一的多模态处理流。\n- **智能化质量门禁**：利用内置的语义去重和敏感信息识别算子，自动剔除低质样本并脱敏，显著提升了训练数据的纯净度与安全性。\n- **云原生弹性伸缩**：data-juicer 支持从笔记本无缝切换至千节点集群，无需修改代码即可并行处理 PB 级数据，任务交付时间从数天缩短至数小时。\n- **可复用的配方管理**：团队将最佳实践固化为\"Recipe\"进行版本控制，任何成员均可一键复现相同的数据处理逻辑，确保实验结果稳定可靠。\n\ndata-juicer 将杂乱无章的原始数据转化为高质量智能燃料，让团队能专注于模型创新而非繁琐的数据清洗工程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdatajuicer_data-juicer_c8f09249.png","datajuicer","DataJuicer","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdatajuicer_9285a16b.jpg","Data processing for and with large models.",null,"datajuicer@outlook.com","https:\u002F\u002Fgithub.com\u002Fdatajuicer",[80,84,88,91,95],{"name":81,"color":82,"percentage":83},"Python","#3572A5",99.7,{"name":85,"color":86,"percentage":87},"C++","#f34b7d",0.1,{"name":89,"color":90,"percentage":87},"Shell","#89e051",{"name":92,"color":93,"percentage":94},"Dockerfile","#384d54",0,{"name":96,"color":97,"percentage":94},"Cython","#fedf5b",6303,359,"2026-04-18T07:18:33","Apache-2.0","Linux","非必需（支持 CPU 运行），可选 NVIDIA GPU 用于加速（提及 CUDA 加速及 Docker 镜像基于 CUDA 12.6.3）","未说明（大规模处理依赖集群节点数，示例提及 6400 核处理 70B 样本）",{"notes":106,"python":107,"dependencies":108},"该工具支持从单机扩展到千节点集群（基于 Ray 分布式框架）。官方提供 Docker 镜像（基于 Ubuntu 24.04 + CUDA 12.6.3 + Python 3.11）。支持多种数据格式（包括压缩的 jsonl.gz）及 S3 存储。拥有 200+ 算子覆盖文本、图像、音频、视频及多模态数据。可通过 YAML 配置文件或 Python 代码构建处理流水线。","3.11+",[109,110,111],"ray","py-data-juicer","vLLM (可选)",[14,113,35,16],"其他",[115,116,117,118,119,120,121,122,123,124,125,126,127,128],"data-analysis","data-science","large-language-models","llm","data-visualization","llms","instruction-tuning","pre-training","multi-modal","synthetic-data","data","data-pipeline","data-processing","foundation-models","2026-03-27T02:49:30.150509","2026-04-20T04:04:06.526484",[132,137,142,147,152,157],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},43734,"在 Data-Juicer 1.0 论文中，复现 RedPajama Book 数据集时使用的是哪个配置文件？为什么文档中的运行时间与论文图 8 不一致？","论文中的实验结果基于特定的软硬件环境。GitHub 仓库中 `configs\u002Freproduced_redpajama\u002F` 目录下的文档结果是代码库开源前的初步实验，已过时且未同步更新，因此与论文图 8 的结论（如运行时间）存在差异。这类实验的绝对数值对硬件非常敏感（例如 RedPajama 团队在 112 核下需约 24 小时）。建议用户在相同的机器和硬件条件下，以自己重跑的结果为准，不要直接对比文档中的旧数据。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F729",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},43735,"如何下载 YoukuMPLUG 视频数据集？modelscope 新版库无法下载怎么办？","最新的安装和使用方法已更新，请参考 Youku-AliceMind 的数据集页面：https:\u002F\u002Fmodelscope.cn\u002Fdatasets\u002Fmodelscope\u002FYouku-AliceMind 。请依据该页面的最新指引进行下载，旧版的下载方式可能已失效。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F341",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},43736,"停用词过滤器（stopwords_filter）所需的 JSON 文件格式是什么？Key 应该是什么？","虽然 Issue 中主要确认了 `flagged_words`（标记字过滤）默认用于过滤脏话、色情等词汇，但根据参数描述，停用词文件应为 JSON 格式，文件名需包含\"stopwords\"。通常此类文件的 Key 为文本内容或特定标识，Value 为标记状态。若需具体格式示例，建议查看源码中 `stopwords_filter` 的实现或直接参考官方提供的示例停用词文件结构。对于 `flagged_words`，其作用确实是过滤不良词汇。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F124",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},43737,"运行大规模数据去重时，进程因内存不足被系统 Kill（OOM），如何解决？","如果是云服务器环境，最直接的解决方案是重新扩充机器资源（增加内存）。如果是在本地或非弹性环境中，需要尝试扩大系统的内存限制（limit），或者优化配置以减少单次处理的内存占用（例如调整分片大小或减少并行进程数 np）。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F90",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},43738,"在 Kubernetes Ray Cluster 上使用 Ray Executor 处理数据时，报错\"Global Node is not initialized\"怎么办？","该问题通常发生在本地 Docker 容器连接外部 Ray 集群时。确保正确转发了 Ray Head 节点的端口（如 10001 用于客户端连接，8265 用于 Dashboard）。在配置中将 `ray_address` 从 `'auto'` 修改为具体的地址，例如 `ray:\u002F\u002Fhost.docker.internal:10001`。如果问题依旧，可能需要检查 Docker 网络配置是否允许容器访问宿主机网络，或确认 Ray 集群版本兼容性（曾尝试通过 PR #348 修复类似问题）。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F345",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},43739,"使用 formatter 调用 datasets.load_dataset 加载本地 JSON 文件时，为什么 HF_DATASETS_CACHE 环境变量不生效，缓存仍写入默认路径？","这是一个已知问题，`LocalFormatter.load_dataset` 在调用 HuggingFace `datasets.load_dataset` 时，可能未正确继承或传递系统指定的 `HF_DATASETS_CACHE` 环境变量，导致缓存仍然写入默认的 `~\u002F.cache\u002Fhuggingface\u002Fdatasets`。解决方法包括：1. 在运行脚本前显式 export 该环境变量；2. 检查代码中是否有硬编码路径；3. 升级到最新版本查看是否已修复此 Bug。","https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fissues\u002F41",[163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248,253,258],{"id":164,"version":165,"summary_zh":166,"released_at":167},351161,"v1.5.1","# 重大更新\n\n* 📊 统计：合并了 13 个 PR，来自 7 位贡献者\n* 📄 推出了两个新的 LaTeX 相关 Mapper OP，扩展了 data-juicer 的文档处理能力，支持 `.tex` 压缩包和图表上下文的提取。\n* 🗜️ 支持压缩数据集格式：现在可以直接加载 `json[l].gz` 文件，Ray 数据集也正式支持读取压缩的 JSON 文件。\n* 📚 新增了关于缓存、导出和追踪工作流的文档，帮助用户更好地理解并调试数据处理流水线。\n* 🤖 data-juicer-agents 完成了大规模重构与升级：项目架构以及 CLI\u002F会话功能得到了全面重新设计，以提升可维护性和扩展性。更多详情请参阅 [data-juicer-agents](https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer-agents)。\n\n# 新增 OP\n\n* `latex_merge_tex_mapper`：如果你有一堆 `.tex` 文件被打包在一个压缩包里，这个 OP 可以自动解压并合并成一个统一的 LaTeX 文档，从而更轻松地处理多文件的 LaTeX 项目。#932\n* `latex_figure_context_extractor_mapper`：从 LaTeX 源文件中提取与图表相关的上下文信息（如标题、周围段落等），方便你基于学术论文构建更丰富的多模态数据集。#923\n\n# 功能增强\n\n* **通过额外参数加载数据集**：现在可以通过新的 `load_dataset_kwargs` 配置字段向 `datasets.load_dataset()` 传递任意额外参数，这对于需要非标准加载选项的数据集非常实用。#922\n* **`RemoveRepeatSentencesMapper` 中支持自定义分词器**：该 Mapper 现在可以接受自定义分词器，不再局限于默认的句子拆分方式，非常适合非英语文本或特定领域的分词需求。#925\n* **压缩 JSON 支持**：新增了直接读取 `json[l].gz` 文件的功能，并修复了 Ray 数据集对压缩 JSON 的正确处理问题——无需再手动解压即可导入数据。#919\n* **`TokenNumFilter` 批量分词加速**：`TokenNumFilter` 现在不再逐条样本进行分词，而是整批一次性处理，显著提升了基于 token 数量的过滤效率。#929\n* **重复检测过滤器中缓存冗余的 `sum()` 调用**：此前重复检测过滤器会对同一份数据多次调用 `sum()`，现在这些结果会被缓存起来，从而避免在大型批次上进行不必要的重复计算。#924\n* **新增文档：缓存、导出与追踪**：添加了专门的文档页面，详细介绍了 data-juicer 如何处理缓存、结果导出以及执行追踪，这对调试复杂流水线来说是非常必要的补充。#935\n* **`op_search` 功能增强：支持 BM25、正则表达式搜索，并升级 MCP 服务器**：为 op_search 增加了 BM25 和正则表达式搜索模式（不再依赖 dj-agents），同时扩展了 MCP 服务器，新增了四个工具，分别用于 OP 搜索、数据集分析、配置 Schema 获取以及数据集加载策略发现。#937\n\n# 问题修复\n\n* **`ImageFaceCountFilter` 中缓存键错误**：该…","2026-03-17T09:12:26",{"id":169,"version":170,"summary_zh":171,"released_at":172},351162,"v1.5.0","# 主要更新\n- 📊 统计：共有244个文件被修改，新增22,394行、删除2,053行，由12位贡献者完成。\n- 🗂️ 新的分区化Ray执行器：#748\n  - 支持数据分区、检查点保存和事件日志记录等功能，适用于Ray模式。\n  - 提升了容错性、可扩展性、可观测性、灵活性以及处理性能。\n- 🤖 面向具身AI的新OP：增强了对相机视角视频的处理能力。\n- 🧩 在Ray模式下支持OP级别的隔离环境管理，以解决不同OP之间的依赖冲突问题。#892\n  - 允许以不同策略合并具有共同依赖的不同OP的运行环境，并复用已创建的环境。\n  - 基于Ray运行时环境实现。\n\n# 新增OP\n- `video_camera_calibration_static_deepcalib_mapper`：使用DeepCalib计算静态相机的内参和视场角（FOV）。#871\n- `video_camera_calibration_static_moge_mapper`：使用Moge-2计算静态相机的内参和视场角（FOV）。#871\n- `video_undistort_mapper`：根据对应的相机内参和畸变系数对原始视频进行去畸变处理。#871\n- `video_hand_reconstruction_hawor_mapper`：利用HaWoR和MoGe-2进行手部重建。#893\n- `video_camera_pose_mapper`：通过MegaSaM和MoGe-2提取相机位姿。#894\n\n# 功能增强\n- 允许对`image_captioning_mapper`进行批量推理，以提升处理性能。#901\n- 优化分支逻辑，避免不必要的函数调用。#903\n- 重构“算子搜索”和“元数据提取”模块，以提高准确性。#889\n- 允许`extract_keyframes`函数仅返回元信息，并在错误日志中移除样本信息，从而减小日志体积。#904\n- 通过仅遍历指定列的方式，降低`convert_to_absolute_paths`函数的内存占用。#907\n- 重新组织主README文件，并将游乐场中的教程更新至最新版本。#908\n- 优化问题模板：强调英文使用，并增加Q&A Copilot检查功能。#912\n- 将对象存储中的数据集路径转换为绝对路径。#913\n\n# 修复的Bug\n- 修复了MinHash去重器无法追踪所有重复项的问题。#906\n- 修复了TextFormmater中“num_proc参数存在多个值”的Bug。#905\n- 修复了首页渲染问题，并移除了过时的OP文档。#910\n- 修复了测试稳定性和鲁棒性方面的若干Bug。#918\n\n# 致谢\n- @dubin555帮助提升了部分OP和函数的处理性能。#901 #903\n- @HunterLine协助修复了MinHash去重器中无法追踪所有重复项的Bug。#906\n\n## 新贡献者\n* @HunterLine 在 https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fpull\u002F906 中完成了首次贡献。\n* @Dludora 在 https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fpull\u002F907 中完成了首次贡献。\n\n## 全体贡献者\n@HYLcool @dubin555 @claude @Qirui-jiao @cmgzn @Cathy0908 @Dludora @yxdyc @gemini-code-assist @H","2026-02-26T05:06:38",{"id":174,"version":175,"summary_zh":176,"released_at":177},351163,"v1.4.6","## 主要更新\n- 🤖 推出问答助手，用于解答用户问题。目前该机器人已在文档、钉钉群、Discord 等渠道上线。#891\n- 🎬 视频字节流的 I\u002FO：支持视频的字节读取与存储。#882\n- 🫆 射线模式下的追踪器：现在追踪器支持在射线模式下追踪已更改的样本。#885\n\n## 功能增强\n- 为具身智能用例准备新的 Dockerfile，并更新基础 Docker 镜像中的 CUDA、系统等版本。#887\n- 在文档中添加 Copilot 新闻、优化后的钉钉链接\u002F二维码以及 Discord 链接\u002F二维码。#891\n- 将单词检索从列表改为集合，以加快两项操作的速度。#890\n- 添加新工作流，自动从 GitHub Insights 获取流量报告。#899 #900\n\n## 问题修复\n- 修复在 YAML 配置中使用 `RequiredFieldsValidator` 的 `field_types` 时出现的 `TypeError`。#886\n- 将 `ray.data.Dataset.map_batches()` 调用中已弃用的 `concurrency` 参数替换为 `compute` 参数。#888\n- 防止在 Ray 集群未就绪时，`calculate_ray_np` 函数中出现除零错误。#864\n- 为多进程工作负载添加线程限制，以防止过度订阅。#877\n- 修复独立模式单元测试可能卡死的问题。#896\n- 更新文档中若干过时的链接。#898\n\n## 致谢\n- @dubin555 帮助修复了多个 bug，并提升了部分操作的处理性能。#886 #890\n- @xyuzh 帮助将部分操作中的 Ray 使用更新至最新版本，修复了一些 bug，并优化了并行策略。#888 #864\n- @XinyuLiu1999 帮助修复了多进程工作负载中因过度订阅导致的 bug。#877\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fdatajuicer\u002Fdata-juicer\u002Fcompare\u002Fv1.4.5...v1.4.6","2026-02-02T12:53:38",{"id":179,"version":180,"summary_zh":181,"released_at":182},351164,"v1.4.5","# 重大更新\n- 新增多项适用于具身人工智能的 OP。\n- 升级至文档系统：#842\n  - 将文档生成与部署迁移至统一的基于 Sphinx 的框架。\n  - 架构与样式以独立仓库形式维护，在构建各子仓库文档前会先拉取该仓库。\n\n# 新增 OP\n## Mapper\n- `video_captioning_from_vlm_mapper`：利用最新 VLM（如 Qwen3-VL）生成视频字幕。#820\n- `video_object_segmenting_mapper`：使用 YOLOE 和 SAM2 对视频中的有效目标进行文本引导的语义分割，并支持保存分割可视化结果。#801\n- `video_depth_estimation_mapper`：对视频进行深度估计，支持保存可视化结果及点云数据。#801\n- `image_mmpose_mapper`：使用 MMPose 模型进行人体关键点检测推理。#800\n- `image_tagging_vlm_mapper`：利用 VLM 生成图像标签。#800\n- `image_sam_3d_body_mapper`：借助可提示模型 SAM 3D Body (3DB)，实现单张图像中的人体全身三维网格重建（HMR）。#843\n- `s3_download_file_mapper`：从 S3 下载文件至本地或加载至内存。#839\n- `s3_upload_file_mapper`：将本地文件上传至 S3，并更新为 S3 URL 路径。#839\n\n## Filter\n- `text_tagging_by_prompt_mapper`：使用 LLM 结合提示词生成文本标签。#408\n- `image_subplot_filter`：检测并移除包含子图的样本。#840 #822\n- `video_motion_score_ptlflow_filter`：一种新的运动分数过滤器，其光流计算由 ptlflow 库完成。#824\n\n## Deduplicator\n- `document_minhash_deduplicator_with_uid`：针对每个样本具有唯一 ID 的数据集，提供更鲁棒的 `document_minhash_deduplicator` 版本。#832 #677\n- `ray_bts_minhash_deduplicator_with_uid`：同样适用于具有唯一 ID 样本的数据集，提供更鲁棒的 `ray_bts_minhash_deduplicator` 版本。#832 #677\n- `ray_bts_minhash_cpp_deduplicator`：通过将计算密集型部分迁移到 C++ 和 Cython，提升基础 BTS MinHash 去重器的性能。#851\n\n## Pipeline\n一种新型 OP，允许将多个 OP 组合成一个管道，或将难以拆分为多个原子 OP 的完整流程整合在一起。#835\n- `ray_vllm_engine_pipeline`：用于利用 Ray 的 vLLM 引擎的基础 OP。\n- `llm_ray_vllm_engine_pipeline`：在 Ray 上使用 vLLM 引擎生成 LLM 回答。\n- `vlm_ray_vllm_engine_pipeline`：在 Ray 上使用 vLLM 引擎生成 VLM 回答。\n\n# 功能增强\n- 更新 data-juicer 的若干主要依赖库至（接近）最新版本。#820\n- 重命名并统一基础 OP 中关于资源分配的核心参数，使其与 Ray 的参数保持一致。#837\n- 优化首页徽章，提升用户体验。#841\n- 针对部分视频 OP，支持指定 ffmpeg 的额外参数。#847\n- 更新“被使用情况 & 有价值的反馈”部分。","2026-01-13T06:36:37",{"id":184,"version":185,"summary_zh":186,"released_at":187},351165,"v1.4.4","## 重大更新\n- 🎉 NeurIPS 2025 新闻更新：我们的 Data-Juicer 2.0 论文被接受为 **NeurIPS'25 Spotlight**（所有投稿中排名前 3.1%）！此外，我们另外两篇工作也同时被 **NeurIPS'25** 接受。#788\n- 🧩 沙盒组件、data-juicer recipes 和 data-juicer agents 已正式从主仓库中分离，分别成为 data-juicer-sandbox、hub 和 agents，以支持独立开发和更快的迭代。#817 #827 #830\n- 🤝 S3 I\u002FO 支持：在数据加载器和导出器中新增 S3 支持，实现与云存储的无缝集成。#806\n\n## 新增 OP\n- `detect_main_character_mapper`：根据给定图像及其标题，提取所有主要角色名称。#795\n- `detect_character_locations_mapper`：给定一张图像和主要角色名称列表，提取画面中每位角色的边界框。（YOLOE + MLLM）#795\n- `detect_character_attributes_mapper`：以图像、标题及主要角色名称作为输入，提取角色的各项属性。#795\n- `vggt_mapper`：输入一段单场景视频，利用 VGGT 提取相机姿态、深度图、点云图以及 3D 点轨迹等信息。#804\n- `video_whole_body_pose_estimation_mapper`：输入包含人物的视频，使用 DWPose 模型提取视频中人物的身体、手部、脚部和面部关键点，即进行 2D 全身姿态估计。#812\n- `video_hand_reconstruction_mapper`：使用 WiLoR 模型进行手部定位与重建。#818\n\n## 功能增强\n- 完善了算子详情文档，大幅扩展了效果演示和使用示例的覆盖范围，并优化了首页样式，提升可读性。#778 #819\n- 在日志记录器设置中新增 Notebook 检测与自动重定向功能，以改善 Jupyter 环境下的用户体验。#790\n- 优化了 build_op_doc 钩子，使文档生成更加可靠。#794\n- 改进了 Ray 模式下 num_proc 的自动计算逻辑，从而提升各算子的资源利用率。#789 #825\n- 在 WebDataset I\u002FO 中新增对视频和音频的支持，扩展了多模态数据处理能力。#803\n- 更新了项目中的仓库 URL 和链接，确保一致性和正确性。#805\n- 在视频数据处理中新增 FFmpeg 和 Decord 后端支持，提升了灵活性和性能。#826 #829\n- 增加了 MCP 服务的 CLI 入口，便于模块化部署，并更新了 MCP 相关文档。#798\n\n## 问题修复\n- 修复了沙盒中的 Auto Prompt 流水线，恢复了正确的提示词生成行为。#791\n- 通过在资源工具函数中正确传递配置参数，修复了 Ray 连接错误。#808\n- 修复了多个基于 CUDA 的算子，使其能够使用内部资源监控机制。#809\n- 修复了自定义算子模块加载问题，并优化了 video_extract_frames_mapper，以更好地保存提取的帧。#803\n- 重置了 vLLM 的 num_proc 参数，并将 CUDA 算子的默认 batch_size 设置为 10，以改善性能。","2025-12-01T04:06:15",{"id":189,"version":190,"summary_zh":191,"released_at":192},351166,"v1.4.3","## 主要更新\n- 🤝 OP 文档更新：优化了多版本文档；文档字符串由通义千问Max重写并增强。#755 #765 #768 #769 #787 \n- 💪🏻 自动并行化优化：支持为每个 OP 指定 CPU\u002FGPU\u002F内存需求；优化了 ray 模式下的 `calculate_np`。#679 #774 #782 #786 \n- 🛠️ 沙盒优化：支持迭代式流水线和提前停止目标；重构了上下文信息；新增了自动提示优化示例及若干相关钩子。#757 \n- 📈 因旧版本被撤销，将 spacy 从 3.8.0 升级至 3.8.7。#763 \n\n## 新增 OP\n- `image_detection_yolo_mapper`: 对图像进行目标检测（使用 YOLO），并返回边界框值和类别标签。#764 \n- `optimize_prompt_mapper`: 基于现有提示词进行优化。#757 \n\n## 功能增强\n* 支持在 RayExporter 的 `export_extra_args` 中为写入方法指定 shard_size 和额外参数。#739 \n* 支持 min\u002Fmax_closed_interval 参数，以控制使用开闭区间进行过滤；同时新增 reversed_range 参数，允许保留指定范围之外的样本。#741 \n* 现有 `optimize_qa_mapper` 支持 API 模型。#771 \n\n## 修复的 Bug\n- 修复并重新启用已禁用的 op_list_to_trace 参数。#766 \n- 为分叉仓库中的多个基于 API 的测试用例添加缺失的 `skip` 标签。#767 \n- 将 `transformers` 版本限制为 \"\u003C4.55.0\"，以避免对 None 值进行计算。#781 \n- 修复若干工具中过时的调用方式。#785（源自 issue #750）\n- 修复 API 服务中的 500 错误。#785（源自 issue #777）\n- 从 NON_STATS_FILTER 中移除 `specified_xxx_filter`。#785（源自 issue #783）\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.4.2...v1.4.3","2025-09-11T09:11:03",{"id":194,"version":195,"summary_zh":196,"released_at":197},351167,"v1.4.2","## 主要更新\n- 💪🏻 Data-Juicer 现已兼容 Python 3.11 和 3.12。#749\n- 🧩 新增了 5 个用于 **数据归因** 的 OP。#735\n- 🤝 现在，Data-Juicer 支持通过 `custom_operator_paths` 参数，在外部路径中注册和应用自定义 OP。#758\n- 🔧 由于 `uv` 能够有效解决依赖冲突问题，现在安装 Data-Juicer 时优先推荐使用 `uv`。#760\n\n## 新增算子\n### 过滤类\n- 无需验证\n  - `llm_perplexity_filter`: 根据指定 LLM 计算的困惑度分数，过滤出该分数落在特定范围内的样本。#735\n  - `instruction_following_difficulty_filter`: 根据指令遵循难度（IFD，https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.12032），过滤出 IFD 值落在特定范围内的文本。#735\n- 需要验证\n  - `in_context_influence_filter`: 根据上下文对验证集的影响程度，过滤出该影响程度落在特定范围内的文本。#735\n  - `llm_task_relevance_filter`: 根据 LLM 评估的与验证任务的相关性得分，过滤出相关性较高的样本。#735\n  - `text_embd_similarity_filter`: 根据文本嵌入向量与一组给定验证文本的平均相似度，过滤出该相似度落在特定范围内的文本。#735\n\n## 功能增强\n- 新增环境变量 `DATA_JUICER_EXTERNAL_MODELS_HOME`，用于指定存储外部或只读模型的私有路径。#740\n- 优化文档中的视频链接转换及多版本维护，并将演示视频更新为更高分辨率的版本。#746\n- 支持为生成额外多模态数据的 OP 自定义保存目录。#751\n- 添加关于 Data-Juicer Agent 的官方详细文档。#759\n- 完善单元测试：显示当前测试用例的名称；针对 Ray 模式，在每个测试用例结束后回收资源。#749\n- 优化开发者指南，以更好地指导新 OP 的开发实践。#760\n\n## 修复的 Bug\n- 将多模态数据特殊标记的更新逻辑移至 `base_op` 初始化阶段，解决了并行处理数据时特殊标记可能无法与主进程同步的问题。#752\n- 修复了一些测试用例。#754\n\n## 致谢\n* @ShenQianli 首次贡献了 5 个新的 OP。#735\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.4.1...v1.4.2","2025-08-18T03:22:56",{"id":199,"version":200,"summary_zh":201,"released_at":202},351168,"v1.4.1","## 主要更新\n- 🔧 引入 Data-Juicer MCP 服务器。用户可以便捷地以 MCP 方式使用数据处理功能。#690 #737\n- 💪🏻 单元测试覆盖率提升至 85% 以上，并修复了多个测试用例中的问题（如 OOM、编码错误等），使 Data-Juicer 更加可靠。#698 #717 #720 #727\n- 🤝 与 NVIDIA 开发者合作，支持基于 GPU 的 Minhash 去重功能。#694 #644\n- 🧩 RayExporter 不仅支持导出 JSON\u002FJSONL 格式的 Ray 数据集，还新增了更多格式的支持。#687\n- 🎥 新增两段演示视频，介绍 Data-Juicer 的核心功能、代理式用法以及沙盒环境。#738\n\n## 新算子\n- `download_file_mapper` 可从 URL 下载数据到本地文件或指定字段中。#709\n\n## 功能增强\n- 新增分析方法：统计指标之间的相关性分析。#663\n- 更新并锁定多个核心依赖到较新版本，解决了依赖冲突问题。#715 #717 #723\n- 沙盒中的 EasyAnimate 流水线已更新，以适配沙盒的重构工作。#710\n- 应用更可靠的 pre-commit 工具，进一步优化 Data-Juicer 的代码风格。#714\n- 支持在数据集中存储和处理图像的字节数据。#725\n\n## 问题修复\n- 修复了 wheel 包和 Docker 镜像构建时的 bug。#706\n- 修复了 log_summarization 中的 bug。#710\n- 修复了通过 wheel 文件安装后出现的“no module named data_juicer”错误。#727\n\n## 致谢\n- @fanronghai 帮助修复了 dataset_splitting_by_language 工具中的参数错误。#713\n- @ayushdg 帮助实现了 GPU 版本的 Minhash 去重器。#644\n- @ricksun2023 帮助修复了配置中存在多个同名算子时的 bug。#730\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.4.0...v1.4.1","2025-07-16T13:05:20",{"id":204,"version":205,"summary_zh":206,"released_at":207},351169,"v1.4.0","摘要：共修改了200多个文件，新增18,535行代码，删除3,720行代码。\n\n---\n\n### 🔧 重大重构与改进\n\n- **🔄 沙盒可用性** (#686)：  \n  - 支持**多管道**、**上下文信息**以及**环境管理器**，可在不同环境中运行各类命令。  \n  - 包含**InternVL 示例**作为展示。\n\n- **📘 DJ-Doc 重新设计** (https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F675)：  \n  - 现已支持**多语言（英语\u002F中文）**，并采用**现代化风格**。\n\n- **📦 依赖管理更新** (#660, #680)：  \n  - 迁移到**`uv`**工具，以提升依赖解析速度。  \n  - 新增**子分组**，便于更好地组织依赖。\n\n---\n\n### 🌍 新功能与集成 (#683, #688, #692)\n\n- **🆕 新增支持的仓库**：  \n  - [Trinity-RFT](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002FTrinity-RFT) 现已纳入 Data-Juicer 的支持范围。\n\n- **📜 DJ-Awesome-List**：  \n  - 一篇综述论文已被**TPAMI'25**接收！\n\n- **🧪 新增合成基准测试**：  \n  - [DetailMaster](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2505.16915)——一项用于评估合成数据的新基准测试。\n\n- **🛠️ 新增算子** (#673, #701)：  \n  - `llm_analysis_filter`  \n  - `general_field_filter`  \n\n---\n\n### 🚀 核心优化与问题修复\n\n- ✅ **Ray 执行器增强** (#697)：  \n  - 增加了文件扩展名检测功能。  \n  - 支持更多数据格式。\n\n- ⏱️ **启动时间优化**：  \n  - 提升了启动性能。（[#684](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F684)）\n\n- 🧠 **文本嵌入支持**：  \n  - 新增通过**API**和**本地模型**进行文本嵌入的支持。（[#681](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F681)）\n\n- 🐳 **Docker 构建改进**：  \n  - 在构建 Docker 镜像时忽略已安装的 `distutils` 库。（[#668](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F668)）\n\n- 🛠️ **Mapper 模块修复**：  \n  - 修复了模块初始化时的问题。（#700）\n\n- 🗑️ **警告抑制**：  \n  - 抑制了来自 **fasttext** 的不必要的警告。（[#696](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F696)）\n\n---\n\n### 📚 完整变更日志\n\n[查看自 v1.3.3 以来的所有变更 →](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.3.3...v1.4.0)","2025-06-13T11:43:56",{"id":209,"version":210,"summary_zh":211,"released_at":212},351170,"v1.3.3","## 重大更新\n* 🎉 我们的工作 [Data-Juicer Sandbox](https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.11784) 已被 _ICML 2025_ 接受为 **Spotlight** 论文（在所有提交中排名前 2.6%）！\n* 为 [Img-Diff](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04594) 新增 OP 和配方。#658 \n\n## 功能增强\n* 为两个 llm_xxx_score_filter OP 添加对 HF LLM 的支持。#655 \n* 将 Docker 镜像同步到阿里云 OSS，以便在无法访问 Docker Hub 时进行下载。#657 \n* 将独立测试和分布式测试拆分，以节省重新运行失败测试的时间。#666 \n\n## 问题修复\n* 修复 `unify_format` 中可能缺失的 cfg 配置。#653 \n* 提升部分文档的清晰度并修复错误链接。#659 \n\n## 致谢\n* @co63oc 帮助修正了一些错别字。#654 \n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.3.2...v1.3.3","2025-05-09T10:20:47",{"id":214,"version":215,"summary_zh":216,"released_at":217},351171,"v1.3.2","## What's Changed\r\n* Human OP enhancements, in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F642 https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F645\r\n    * update label-studio version\r\n    * make service script more robust\r\n    * add documentation \r\n    * optimizing fields mapping  \r\n* OP efficiency optimization of `document_minhash_deduplicator`, in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F639\r\n* set temp_parser.usage to argparse.SUPPRESS, skip too much help log in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F643\r\n* fix date typo by in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F648\r\n* Fix docker building failure in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F650\r\n* Fix StreamToLoguru compatibility issue with torch._dynamo in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F651\r\n* add init file for annotation module, fix dj-process command error in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F652\r\n\r\n## New Contributor\r\n* @cmgzn made their first contribution in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F651","2025-04-25T11:17:23",{"id":219,"version":220,"summary_zh":221,"released_at":222},351172,"v1.3.1","## Major Updates\r\n* 💥 prototype Implementation for HumanOps (annotation). #617  Included features:\r\n  * boilerplate code for supporting label studio powered human annotation ops\r\n  * a human preference annotation reference implementation is provided\r\n  * label studio service script; can start up local instance using docker or pip, whichever is available\r\n  * reference configs and data\r\n  * event driven and notification mixins framework for ops\r\n\r\n## New OPs\r\n- `extract_tables_from_html_mapper`: extract tables from html texts. #634 \r\n- `general_fused_op`: an explicitly fused operator designed to execute multiple sequential operations (OPs) on the same batch, enabling fine-grained control over data processing. #626 \r\n\r\n## Bug Fixed\r\n* fix dataset builder initialization failure #630 \r\n* update Executor references from Executor to DefaultExecutor #632 #633 \r\n* switch the backend of `plt` to avoid sub-process\u002Fthread error #633 \r\n* fix some boundary condition bugs in several deduplicators #635 #637 \r\n\r\n## Others\r\n- check dataset when loading to support to pass dataset in the `DefaultExecutor.run` method. #633 \r\n- update docs to highlight light env installation part. #636 \r\n\r\n## Acknowledgement\r\n- @liuyuhanalex helps to add a new OP and fix some of the boundary condition bugs. #634 #635 \r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.3.0...v1.3.1","2025-04-11T09:48:47",{"id":224,"version":225,"summary_zh":226,"released_at":227},351173,"v1.3.0","## The Big Change 🚀\r\nRefactor of dataset builder and executor, see https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F537, @cyruszhang\r\n📜 YAML explicitly defines different sources of datasets; local and remote are defined separately.\r\n🔧 More flexible parameterized control; supports source-specific parameters, validations, and extensible configurations.\r\n🔌 Unbind Executor's hardcode support: No longer restricted to local JSON formats; input format is determined dynamically via formatters\u002Fdownloaders.\r\n🚀 Enhanced Executor extensibility to natively support engines like Nemo, Dask, Spark, etc.\r\n🔍 Add data format validation to ensure consistency and correctness.\r\n🌐 Expanded data source support:\r\na. 📦 ModelScope integration.\r\nb. 📚 ArXiv dataset import (download, decompress, ingest).\r\nc. 📚 Wikipedia dataset support (download, decompress, ingest).\r\nd. 🌐 Common Crawl integration (download, decompress, ingest).\r\n🔗 Backward compatibility with existing dataset_path command-line syntax.\r\n🔀 Support for data mixtures to combine multiple datasets dynamically.\r\n🔧 Support for empty formatters\u002Fgenerated datasets without pre-defined config files.\r\n\r\n## Others 💡\r\n🔊 New audio processing operator: audio_add_gaussian_noise ([PR #622](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F622)), @liuyuhanalex \r\n📊 Added dynamic coverage rate badge to the README for transparency ([PR #625](https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F625)) \r\n","2025-03-28T12:08:36",{"id":229,"version":230,"summary_zh":231,"released_at":232},351174,"v1.2.2","# Major Updates\r\n- 🧪 Add document for API service. Add parameter transmission using `json.dumps` to support API calls for arbitrary registration functions and classes. #613\r\n- 🚀 Add unit tests for the analysis module and utils module to increase test coverage. #604 #616\r\n- ![new](https:\u002F\u002Fimg.alicdn.com\u002Fimgextra\u002Fi4\u002FO1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) A new data synthesis method is proposed, which encourages LLMs to self-generate challenging cognitive questions, achieving superior data efficiency, cross-modality generalization, and SFT effects over SOTA baselines (e.g., *16%* gain on [MathVision](https:\u002F\u002Fmathllm.github.io\u002Fmathvision\u002F#leaderboard) using only *400 samples*). See more details in  [MindGym: Enhancing Vision-Language Models via Synthetic Self-Challenging Questions](https:\u002F\u002Farxiv.org\u002Fabs\u002F2503.09499).\r\n\r\n# New OPs\r\n- `llm_quality_score_filter`: Filter to keep sample with high quality score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620\r\n- `llm_difficulty_score_filter`: Filter to keep sample with high difficulty score estimated by LLM, standing for API calling and local VLLM calling. #606 #614 #620\r\n\r\n# Others\r\n- Fix config in LLaVa pretrain recipe. #610\r\n- Update news for MindGYM and fix doc. #615\r\n- Fix decode error through UTF-8 decoding. #618\r\n","2025-03-14T09:58:23",{"id":234,"version":235,"summary_zh":236,"released_at":237},351175,"v1.2.1","## Major Updates\r\n- ![new](https:\u002F\u002Fimg.alicdn.com\u002Fimgextra\u002Fi4\u002FO1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) DJ has been integrated in [Ray's official Ecosystem](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fray-overview\u002Fray-libraries.html) and [Example Gallery](https:\u002F\u002Fdocs.ray.io\u002Fen\u002Flatest\u002Fdata\u002Fexamples\u002Fdata_juicer_distributed_data_processing.html). Besides, our patch in DJ2.0 for the streaming JSON reader has been officially integrated by [Apache Arrow](https:\u002F\u002Fgithub.com\u002Fapache\u002Farrow\u002Fpull\u002F45084). \r\n- ![new](https:\u002F\u002Fimg.alicdn.com\u002Fimgextra\u002Fi4\u002FO1CN01kUiDtl1HVxN6G56vN_!!6000000000764-2-tps-43-19.png) Our work on contrastive data synthesis, [ImgDiff](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2408.04594), has been accepted by *CVPR 2025*! \r\n- Unit test optimization:\r\n  - split unit tests to partial and regression: partial test is triggered by PR and only test on corresponding test cases of changed files; regression test on all cases and triggered at 7:00 on every Friday in Beijing time. #598 \r\n  - use primitive `@unittest.skip` and remove `SKIPPED_TESTS`. #586 \r\n  - upload test coverage reports to GitHub artifacts. #586 \r\n\r\n## New OPs\r\n\r\n- `image_remove_background_mapper`: remove the background of images. #589 \r\n\r\n## Others\r\n- add missing LOADED_AUDIOS to ALL_INTER_VARS to enable OP fusion and context sharing. #585 \r\n- only build doc for py3.10. #586 \r\n- move dependency on `ray` to minimal requirements. #586 #594 #595 \r\n- allow executor and other tool functions to consume a loaded dataset in addition to the config file. #596 #597 \r\n- fix undefined `fileno` bug of the logger. #594 \r\n\r\n## Acknowledgement\r\n- @liuyuhanalex helps simplify the code logic of OP fusion, add a new OP `image_remove_background_mapper`, and fix some minor bugs. #581 #585 #589 \r\n- @co63oc helps to fix typos in code and documents. #582 #583 #588 #591 #593 \r\n- @danielhjz helps to fix the implicit memory leak problem in `image_nsfw_filter`. #590 \r\n","2025-02-28T07:50:44",{"id":239,"version":240,"summary_zh":241,"released_at":242},351176,"v1.2.0","## What's New\r\n* 📚 The DJ doc is refactored and improved, e.g., *RecipeGallery, DeveloperGuide, DistributedProcess, DJ-related Competitions, typos bad links*\r\n* 🔎 More unit-tests added.\r\n* 🎛 The data pre-split and export are improved.\r\n* 🔮 A new data selection method, DaaR, is proposed. See [Diversity as a Reward: Fine-Tuning LLMs on a Mixture of Domain-Undetermined Data](https:\u002F\u002Fwww.arxiv.org\u002Fabs\u002F2502.04380).\r\n\r\n## Detailed PRs\r\n* fix export error when export_stats columns is null in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F557\r\n* Resplit input dataset in ray mode in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F549\r\n* Refactor and improve doc for RecipeGallery, DeveloperGuide, DistributedProcess and DJ-related Competitions in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F561\r\n* Resolve most skipped unit-tests by in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F559\r\n* fix translation error in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F562\r\n* Add unittest for ray text dedup in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F540\r\n* [Typo]correct a small typo in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F563\r\n* update the 2.0 paper link & the DaaR news in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F566\r\n* Fix typos in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F571\r\n* Optimization for sdxl_prompt2prompt_mapper dependency importing by in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F570\r\n* Fix typos in https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F572\r\n\r\n## Acknowledgment \r\n* @liuyuhanalex @co63oc made their first PRs\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.1.0...v1.2.0","2025-02-14T09:40:37",{"id":244,"version":245,"summary_zh":246,"released_at":247},351177,"v1.1.0","# Major Updates\r\n- 🧪 User now can run ray-based distributed data processing under the guidance of added docs. #523\r\n- 🧪 The DJ-Cookbook has gathered numerous high-quality data processing recipes from various vertical fields, and the related documents have been updated on the homepage. #542\r\n- 💥 Change Task mode to Actor mode for ray deduplication, allowing users to use these operators without installing Redis. #526\r\n- 🚀 Append a log summarization for warnings and errors at the running ending to make them recognizable under the sample fault tolerance mechanism. #534\r\n- 🚀 Automatically update relevant documents when adding OPs to reduce the development burden. #527\r\n- 🛝 Add usability tags for OPs: \r\n    - `alpha` tag for OPs in which only the basic OP implementations are finished;\r\n    - `beta` tag for OPs in which unittests are added based on the `alpha` version;\r\n    - `stable` tag for OPs in which OP optimizations related to DJ (e.g. model management, batched processing, OP fusion, ...) are added based on the `beta` version.\r\n\r\n\r\n# New OPs\r\n- `image_segment_mapper`: Perform segment-anything on images and return the bounding boxes. #550\r\n- `mllm_mapper`: Mapper to use MLLMs to generate texts for images. #550\r\n- `sdxl_prompt2prompt_mapper`: Use the generative model SDXL and image editing technique Prompt-to-Prompt to generate pairs of similar images. #550\r\n- `sentence_augmentation_mapper`: Augment sentences using LLMs. #550\r\n- `text_pair_similarity_filter`: Filter samples according to the similarity score between the text pair. #550\r\n\r\n# Bug Fixed\r\n- Add global `skip_op_error` param to enable fault-tolerant when execute DataJuicer analyzer and executor, but disable fault-tolerant for unit test. #528\r\n- Fix model force download bug. #529\r\n- Fix `IndexError` if the number of samples in the result dataset is less than the number of workers when saving dataset to disk. #536\r\n- Fix missing field meta tag on ray mode. #538\r\n- Update `max_tokens` or `max_new_tokens` for vllm-based OPs to avoid too short generation. #544\r\n- Fix bug in the role playing data generation demo. #545\r\n\r\n# Others\r\n- Enhance unit test for API calling OPs. #528\r\n- Remove sandbox requirements installation from Dockerfile. #530\r\n- Update the `datasource` related APIs to be compatible with the latest version of Ray. #532\r\n- Limit the generated qa num for each text in `generate_qa_from_text_mapper`. #541\r\n- Update docs for preparing DJ2.0 release. #542\r\n- Update a quick cdn link for arch figure. #543\r\n- Add a video demo for role playing data generation. #545\r\n- Optimize op doc for global textual search. #552\r\n- Use a more stable and fast translator than google translator for automatic OP doc building. #554\r\n\r\n## Acknowledgement\r\n* @Qirui-jiao made great contributions to enrich the Data-Juicer OP pool. #550\r\n\r\n","2025-01-17T09:46:54",{"id":249,"version":250,"summary_zh":251,"released_at":252},351178,"v1.0.3","# Major Updates\r\n- 💥 Support **Ray-based MinHashLSH deduplicator**, which implemented a **multi-process Union-Find set** based on Ray Actor and [BTS algorithm](https:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F10598116) to complete equivalence class merging. #502 \r\n- 💥 Support **post-tuning dataset formats** in LLaMA-Factory and ModelScope-Swift.\r\n  - Data-Juicer chooses the Query-Response format as the intermediate format for the post-tuning dataset. #514\r\n  - Refine the overall intermediate format of Data-Juicer to support various dataset formats better. (`meta`, `stats`) #514 #518 \r\n  - Provide several format conversion tools for converting to Data-Juicer format and vice versa. #514 \r\n- 🚀 Add **10 more post-tuning OPs** to process post-tuning datasets better. It's listed in detail in the below New OPs section. #513\r\n- 🚀 Support **Ray Actor mode** for **GPU-based OPs**. #511 \r\n\r\n# New OPs\r\n## Post-tuning OPs for fine-grained analysis of dialog data. #513 \r\n### Mapper\r\n- `dialog_intent_detection_mapper`: Mapper to generate user's intent labels in feed back dialog data.\r\n- `dialog_sentiment_detection_mapper`: Mapper to generate user's sentiment labels in feed back dialog data.\r\n- `dialog_sentiment_intensity_mapper`: Mapper to predict user's sentiment intensity (from -5 to 5 in default\r\nprompt) in feed back dialog data.\r\n- `dialog_topic_detection_mapper`: Mapper to generate user's topic labels in feed back dialog data.\r\n- `query_intent_detection_mapper`: Mapper to predict user's Intent label in a query.\r\n- `query_sentiment_detection_mapper`: Mapper to predict user's sentiment label ('negative', 'neutral' and\r\n'positive') in a query.\r\n- `query_topic_detection_mapper`: Mapper to predict user's topic label in a query.\r\n### Aggregator\r\n- `meta_tags_aggregator`: Merge similar meta tags to one tag.\r\n### Selector\r\n- `tags_specified_field_selector`: Select samples based on the tags of specified field.\r\n### Grouper\r\n- `naive_reverse_grouper`: Split bathed sample to samples.\r\n\r\n# Bug Fixed\r\n- Fix the wrong argument passing in `generate_qa_from_example_mapper`. #517 \r\n- Update the out-of-date Dingding QR code on the main page. #513 \r\n\r\n## Acknowledgement\r\n* @jackylee-ch made their first contribution to help fix several invalid links in the document. #521\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fcompare\u002Fv1.0.2...v1.0.3","2025-01-03T10:59:11",{"id":254,"version":255,"summary_zh":256,"released_at":257},351179,"v1.0.2","## Major Updates\r\n- Added more mapper\u002Fgrouper\u002Faggregator OPs for post-tuning scenarios.\r\n- Optimized the distributed mode performance and usability with more automatic features.\r\n\r\n## DJ-Operators\r\n* `extract_support_text_mapper`, `relation_identity_mapper`, `python_file_mapper`, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F500\r\n* `naive_grouper`, `key_value_grouper`, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F500\r\n* `nested_aggregator`, `entity_attribute_aggregator`, `most_relavant_entities_aggregator`, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F500\r\n* `video_extract_frames_mapper`,  https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F507\r\n\r\n## Performance\r\n* Optimize ray mode performance, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F442\r\n* Patch for Performance Benchmark in CI\u002FCD workflows, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F506\r\n* DJ Ray mode supports streaming loading of `jsonl` files, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F515\r\n\r\n## Usability and Analysis\r\n* support dj-install in recipe-level, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F508\r\n* support dj-analyze with --auto mode, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F512\r\n* support op-wise insight auto mining, https:\u002F\u002Fgithub.com\u002Fmodelscope\u002Fdata-juicer\u002Fpull\u002F516\r\n\r\n\r\n## Acknowledgment\r\nThanks to Data-Juicer users and contributors for their helpful feedback, issues and PRs!","2024-12-20T12:15:17",{"id":259,"version":260,"summary_zh":261,"released_at":262},351180,"v1.0.1","# Major Updates\r\n+ 🚀 Supports automatically arranging operators from fastest to slowest based on their execution speed, and also supports automating the operator batch size according to the execution speed. #464\r\n+ 🚀 **[UnitTest]** Performance benchmark for efficiency tests of 4 modalities. Reports will be uploaded to internal wandb server. #483\r\n+ 💥 Added some useful OPs, including the construction of DPO training data and a lightweight user-customizable OP interface. See more details below~ #491 #492 #493\r\n\r\n# OPs\r\n## Text OPs\r\n+ `pair_preference_mapper`:  Mapper to construct preference answers for QA pairs. #491\r\n## Script OPs\r\n+ `python_lambda_mapper`: Mapper for executing customized Python lambda functions on data samples. #492\r\n+ `python_file_mapper`: Mapper for executing customized Python functions on data samples. #493\r\n\r\n# Bugs Fixed\r\n- Add an argument to control whether to open `Monitor` for data processing. It's True by default. #483\r\n- For the mp start method of monitor, set it to \"spawn\" for Windows systems and \"fork\" for others. #483\r\n- Update transformers version to >=4.47.0 to avoid \"shape mismatch\" bug from older version 4.46.3. #483\r\n- Fix the logic errors in Turbo acceleration and batch processing, and ensure that map and filter are consistent in this part of the logic. #504\r\n\r\n# Others\r\n- Pin the PyAV version to prevent inconsistent updates. #504\r\n- Skip some unit test for audio OPs to avoid lazy_loader failure during multiprocessing. #503\r\n- Remove unnecessary UNFORKABLE marks for some OPs. #491\r\n- Refine the docker image building. Add a new self-hosted runner for docker image building, optimize the building logic for auto docker image building on release, change the default full image to a GPU-version image. #494 #501\r\n\r\n# Acknowledgment\r\nHere we thank public contributors for their PRs and issues to make Data-Juicer better!","2024-12-06T09:09:47"]