[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NVIDIA-NeMo--Curator":3,"tool-NVIDIA-NeMo--Curator":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":108,"forks":109,"last_commit_at":110,"license":111,"difficulty_score":10,"env_os":112,"env_gpu":113,"env_ram":114,"env_deps":115,"category_tags":127,"github_topics":128,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":148,"updated_at":149,"faqs":150,"releases":179},3110,"NVIDIA-NeMo\u002FCurator","Curator","Scalable data pre processing and curation toolkit for LLMs","Curator 是 NVIDIA NeMo 套件中一款专为大模型训练打造的可扩展数据预处理与精选工具。它核心解决了 AI 开发中“垃圾进，垃圾出”的痛点，帮助开发者高效清洗海量原始数据，确保用于训练文本、图像、视频及音频的数据具备高质量、多样性和安全性，从而加速构建更优秀的 AI 模型。\n\n这款工具特别适合从事大语言模型（LLM）或多模态模型研发的研究人员与工程师使用。无论是需要在笔记本电脑上验证想法，还是要在多节点集群上处理 PB 级数据，Curator 都能通过模块化流水线轻松应对。其独特的技术亮点在于全面的 GPU 加速能力：在文本处理上，支持基于 MinHash 的模糊去重和语义去重，并提供 30 多种启发式过滤规则；在视觉与听觉领域，它能自动执行美学评分、不适宜内容（NSFW）检测、场景切割以及语音转录质量评估。借助 Curator，团队可以将原本耗时数周的数据清洗工作大幅缩短，让精力更专注于模型架构与创新。","\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fblob\u002Fmain\u002FLICENSE\">![https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fcodecov.io\u002Fgithub\u002FNVIDIA-NeMo\u002FCurator\">![codecov](https:\u002F\u002Fcodecov.io\u002Fgithub\u002FNVIDIA-NeMo\u002FCurator\u002Fgraph\u002Fbadge.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F\">![https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fnemo-curator.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fgraphs\u002Fcontributors\">![NVIDIA-NeMo\u002FCurator](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Freleases\">![https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_3a8caf6da1d4.png)\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n# NVIDIA NeMo Curator\n\n**GPU-accelerated data curation for training better AI models, faster.** Scale from laptop to multi-node clusters with modular pipelines for text, images, video, and audio.\n\n> *Part of the [NVIDIA NeMo](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fai-data-science\u002Fproducts\u002Fnemo\u002F) software suite for managing the AI agent lifecycle.*\n\n## What You Can Do\n\n| Modality | Key Capabilities | Get Started |\n|----------|-----------------|-------------|\n| **Text** | Deduplication • Classification • Quality Filtering • Language Detection | [Text Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Ftext.html) |\n| **Image** | Aesthetic Filtering • NSFW Detection • Embedding Generation • Deduplication | [Image Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Fimage.html) |\n| **Video** | Scene Detection • Clip Extraction • Motion Filtering • Deduplication | [Video Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Fvideo.html) |\n| **Audio** | ASR Transcription • Quality Assessment • WER Filtering | [Audio Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Faudio.html) |\n\n## Quick Start\n\n```bash\n# Install for your modality\nuv pip install \"nemo-curator[text_cuda12]\"\n\n# Run the quickstart example\npython tutorials\u002Fquickstart.py\n```\n\n**Full setup:** [Installation Guide](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fadmin\u002Finstallation.html) • [Docker](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fnemo-curator) • [Tutorials](tutorials\u002F)\n\n---\n\n## Features by Modality\n\n### Text Curation\n\nProcess and curate high-quality text datasets for large language model (LLM) training with multilingual support.\n\n| Category | Features | Documentation |\n|----------|----------|---------------|\n| **Data Sources** | Common Crawl • Wikipedia • ArXiv • Custom datasets | [Load Data](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fload-data\u002Findex.html) |\n| **Quality Filtering** | 30+ heuristic filters • fastText classification • GPU-accelerated classifiers for domain, quality, safety, and content type | [Quality Assessment](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fquality-assessment\u002Fheuristic.html) |\n| **Deduplication** | Exact • Fuzzy (MinHash LSH) • Semantic (GPU-accelerated) | [Deduplication](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fdeduplication\u002Findex.html) |\n| **Processing** | Text cleaning • Language identification | [Content Processing](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fcontent-processing\u002Ftext-cleaning.html) |\n\n---\n\n### Image Curation\n\nCurate large-scale image datasets for vision language models (VLMs) and generative AI training.\n\n| Category | Features | Documentation |\n|----------|----------|---------------|\n| **Data Loading** | WebDataset format • Large-scale image-text pairs | [Load Data](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fload-data\u002Findex.html) |\n| **Embeddings** | CLIP embeddings for semantic analysis | [Embeddings](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fprocess-data\u002Fembeddings\u002Findex.html) |\n| **Filtering** | Aesthetic quality scoring • NSFW detection | [Filters](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fprocess-data\u002Ffilters\u002Findex.html) |\n\n---\n\n### Video Curation\n\nProcess large-scale video corpora with distributed, GPU-accelerated pipelines for world foundation models (WFMs).\n\n| Category | Features | Documentation |\n|----------|----------|---------------|\n| **Data Loading** | Local paths • S3-compatible storage • HTTP(S) URLs | [Load Data](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fload-data\u002Findex.html) |\n| **Clipping** | Fixed-stride splitting • Scene-change detection (TransNetV2) | [Clipping](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fclipping.html) |\n| **Processing** | GPU H.264 encoding • Frame extraction • Motion filtering • Aesthetic filtering | [Processing](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Ffiltering.html) |\n| **Embeddings** | Cosmos-Embed1 for clip-level embeddings | [Embeddings](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fembeddings.html) |\n| **Deduplication** | K-means clustering • Pairwise similarity for near-duplicates | [Deduplication](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fdedup.html) |\n\n---\n\n### Audio Curation\n\nPrepare high-quality speech datasets for automatic speech recognition (ASR) and multimodal AI training.\n\n| Category | Features | Documentation |\n|----------|----------|---------------|\n| **Data Loading** | Local files • Custom manifests • Public datasets (FLEURS) | [Load Data](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fload-data\u002Findex.html) |\n| **ASR Processing** | NeMo Framework pretrained models • Automatic transcription | [ASR Inference](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Fasr-inference\u002Findex.html) |\n| **Quality Assessment** | Word Error Rate (WER) calculation • Duration analysis • Quality-based filtering | [Quality Assessment](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Fquality-assessment\u002Findex.html) |\n| **Integration** | Text curation workflow integration for multimodal pipelines | [Text Integration](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Ftext-integration\u002Findex.html) |\n\n---\n\n## Why NeMo Curator?\n\n### Performance at Scale\n\nNeMo Curator leverages NVIDIA RAPIDS™ libraries such as cuDF, cuML, and cuGraph along with Ray to scale workloads across multi-node, multi-GPU environments.\n\n**Proven Results:**\n- **16× faster** fuzzy deduplication on 8 TB RedPajama v2 (1.78 trillion tokens)\n- **40% lower** total cost of ownership (TCO) compared to CPU-based alternatives\n- **Near-linear scaling** from one to four H100 80 GB nodes (2.05 hrs → 0.50 hrs)\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_7805c72e5e46.png\" alt=\"Performance benchmarks showing 16x speed improvement, 40% cost savings, and near-linear scaling\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n### Quality Improvements\n\nData curation modules measurably improve model performance. In ablation studies using a 357M-parameter GPT model trained on curated Common Crawl data:\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_6bfca0491820.png\" alt=\"Model accuracy improvements across curation pipeline stages\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n**Results:** Progressive improvements in zero-shot downstream task performance through text cleaning, deduplication, and quality filtering stages.\n\n---\n\n## Learn More\n\n| Resource | Links |\n|----------|-------|\n| **Documentation** | [Main Docs](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002F) • [API Reference](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fapidocs\u002Findex.html) • [Concepts](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fabout\u002Fconcepts\u002Findex.html) |\n| **Tutorials** | [Text](tutorials\u002Ftext\u002F) • [Image](tutorials\u002Fimage\u002F) • [Video](tutorials\u002Fvideo\u002F) • [Audio](tutorials\u002Faudio\u002F) |\n| **Deployment** | [Installation](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fadmin\u002Finstallation.html) • [Infrastructure](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Freference\u002Finfrastructure\u002Findex.html) |\n| **Community** | [GitHub Discussions](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fdiscussions) • [Issues](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues) |\n\n---\n\n## Contribute\n\nWe welcome community contributions! Please refer to [CONTRIBUTING.md](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Fblob\u002Fstable\u002FCONTRIBUTING.md) for guidelines.\n","\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fblob\u002Fmain\u002FLICENSE\">![https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fcodecov.io\u002Fgithub\u002FNVIDIA-NeMo\u002FCurator\">![codecov](https:\u002F\u002Fcodecov.io\u002Fgithub\u002FNVIDIA-NeMo\u002FCurator\u002Fgraph\u002Fbadge.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F\">![https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fnemo-curator.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fgraphs\u002Fcontributors\">![NVIDIA-NeMo\u002FCurator](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Freleases\">![https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002FNVIDIA-NeMo\u002FCurator)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_3a8caf6da1d4.png)\u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n# NVIDIA NeMo Curator\n\n**GPU加速的数据整理工具，助您更快、更高效地训练优质AI模型。** 通过模块化的文本、图像、视频和音频处理流水线，可从笔记本电脑扩展至多节点集群。\n\n> *作为[NVIDIA NeMo](https:\u002F\u002Fwww.nvidia.com\u002Fen-us\u002Fai-data-science\u002Fproducts\u002Fnemo\u002F)软件套件的一部分，用于管理AI智能体的生命周期。*\n\n## 您可以实现的功能\n\n| 模态 | 核心能力 | 入门指南 |\n|----------|-----------------|-------------|\n| **文本** | 去重 • 分类 • 质量过滤 • 语言检测 | [文本指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Ftext.html) |\n| **图像** | 审美筛选 • 不适宜内容检测 • 嵌入生成 • 去重 | [图像指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Fimage.html) |\n| **视频** | 场景检测 • 片段提取 • 运动过滤 • 去重 | [视频指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Fvideo.html) |\n| **音频** | ASR转录 • 质量评估 • WER过滤 | [音频指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fget-started\u002Faudio.html) |\n\n## 快速入门\n\n```bash\n# 根据您的模态安装\nuv pip install \"nemo-curator[text_cuda12]\"\n\n# 运行快速入门示例\npython tutorials\u002Fquickstart.py\n```\n\n**完整设置：** [安装指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fadmin\u002Finstallation.html) • [Docker](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fnemo-curator) • [教程](tutorials\u002F)\n\n---\n\n## 各模态功能概览\n\n### 文本数据整理\n\n利用多语言支持，处理并整理高质量文本数据集，用于大型语言模型（LLM）的训练。\n\n| 类别 | 功能 | 文档 |\n|----------|----------|---------------|\n| **数据源** | Common Crawl • Wikipedia • ArXiv • 自定义数据集 | [加载数据](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fload-data\u002Findex.html) |\n| **质量过滤** | 30余种启发式过滤器 • fastText分类 • GPU加速的领域、质量、安全及内容类型分类器 | [质量评估](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fquality-assessment\u002Fheuristic.html) |\n| **去重** | 精确去重 • 模糊去重（MinHash LSH）• 语义去重（GPU加速） | [去重](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fdeduplication\u002Findex.html) |\n| **预处理** | 文本清洗 • 语言识别 | [内容处理](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-text\u002Fprocess-data\u002Fcontent-processing\u002Ftext-cleaning.html) |\n\n---\n\n### 图像数据整理\n\n为视觉语言模型（VLM）及生成式AI训练整理大规模图像数据集。\n\n| 类别 | 功能 | 文档 |\n|----------|----------|---------------|\n| **数据加载** | WebDataset格式 • 大规模图文对 | [加载数据](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fload-data\u002Findex.html) |\n| **嵌入** | CLIP嵌入用于语义分析 | [嵌入](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fprocess-data\u002Fembeddings\u002Findex.html) |\n| **筛选** | 审美质量评分 • NSFW检测 | [筛选](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-images\u002Fprocess-data\u002Ffilters\u002Findex.html) |\n\n---\n\n### 视频数据整理\n\n使用分布式、GPU加速的流水线处理大规模视频语料库，服务于世界基础模型（WFM）。\n\n| 类别 | 功能 | 文档 |\n|----------|----------|---------------|\n| **数据加载** | 本地路径 • S3兼容存储 • HTTP(S) URL | [加载数据](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fload-data\u002Findex.html) |\n| **片段切割** | 固定步长分割 • 场景变化检测（TransNetV2） | [片段切割](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fclipping.html) |\n| **处理** | GPU H.264编码 • 帧提取 • 运动过滤 • 审美筛选 | [处理](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Ffiltering.html) |\n| **嵌入** | Cosmos-Embed1用于片段级嵌入 | [嵌入](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fembeddings.html) |\n| **去重** | K-means聚类 • 成对相似度用于近似重复检测 | [去重](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fdedup.html) |\n\n---\n\n### 音频数据整理\n\n为自动语音识别（ASR）及多模态AI训练准备高质量语音数据集。\n\n| 类别 | 功能 | 文档 |\n|----------|----------|---------------|\n| **数据加载** | 本地文件 • 自定义清单 • 公共数据集（FLEURS） | [加载数据](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fload-data\u002Findex.html) |\n| **ASR处理** | NeMo框架预训练模型 • 自动转录 | [ASR推理](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Fasr-inference\u002Findex.html) |\n| **质量评估** | 计算词错误率（WER）• 时长分析 • 基于质量的过滤 | [质量评估](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Fquality-assessment\u002Findex.html) |\n| **集成** | 与文本处理流程集成，构建多模态管道 | [文本集成](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Fprocess-data\u002Ftext-integration\u002Findex.html) |\n\n---\n\n## 为什么选择NeMo Curator？\n\n### 大规模性能\n\nNeMo Curator 利用 NVIDIA RAPIDS™ 库（如 cuDF、cuML 和 cuGraph）以及 Ray，在多节点、多 GPU 环境中实现工作负载的横向扩展。\n\n**经过验证的结果：**\n- 在 8 TB 的 RedPajama v2 数据集（1.78 万亿个标记）上，模糊去重速度提升 **16 倍**\n- 总拥有成本（TCO）相比基于 CPU 的方案降低 **40%**\n- 从单节点到四节点 H100 80 GB 配置时，性能呈现 **近线性扩展**（2.05 小时 → 0.50 小时）\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_7805c72e5e46.png\" alt=\"性能基准测试显示 16 倍加速、40% 成本节约及近线性扩展\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n### 质量提升\n\n数据编排模块能够显著提升模型性能。在使用经编排后的 Common Crawl 数据训练的 3.57 亿参数 GPT 模型进行的消融实验中：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_readme_6bfca0491820.png\" alt=\"不同编排阶段对模型准确率的逐步提升\" width=\"700\"\u002F>\n\u003C\u002Fp>\n\n**结果：** 通过文本清洗、去重和质量过滤等环节，零样本下游任务的性能得到持续提升。\n\n---\n\n## 了解更多\n\n| 资源 | 链接 |\n|----------|-------|\n| **文档** | [主文档](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002F) • [API 参考](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fapidocs\u002Findex.html) • [概念](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fabout\u002Fconcepts\u002Findex.html) |\n| **教程** | [文本](tutorials\u002Ftext\u002F) • [图像](tutorials\u002Fimage\u002F) • [视频](tutorials\u002Fvideo\u002F) • [音频](tutorials\u002Faudio\u002F) |\n| **部署** | [安装](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fadmin\u002Finstallation.html) • [基础设施](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Freference\u002Finfrastructure\u002Findex.html) |\n| **社区** | [GitHub 讨论区](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fdiscussions) • [问题](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues) |\n\n---\n\n## 参与贡献\n\n我们欢迎社区贡献！请参阅 [CONTRIBUTING.md](https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo\u002Fblob\u002Fstable\u002FCONTRIBUTING.md) 获取相关指南。","# NVIDIA NeMo Curator 快速上手指南\n\nNVIDIA NeMo Curator 是一款 GPU 加速的数据清洗工具，旨在帮助用户快速构建高质量的文本、图像、视频和音频数据集，以训练更优秀的 AI 模型。它支持从单机笔记本到多节点集群的弹性扩展。\n\n## 1. 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04\u002F22.04)。\n*   **Python 版本**: 3.9 - 3.11。\n*   **GPU 硬件**: 支持 CUDA 的 NVIDIA GPU（推荐使用 Ampere 架构或更新版本，如 A100\u002FH100，以获得最佳性能）。\n*   **CUDA 驱动**: 已安装与 GPU 匹配的 NVIDIA 驱动及 CUDA Toolkit（本指南示例基于 CUDA 12.x）。\n*   **依赖管理**: 推荐使用 `uv` 或 `pip` 进行包管理。\n\n> **国内加速建议**：\n> 由于 PyPI 源在国外访问较慢，建议配置国内镜像源（如清华源或阿里源）以加速安装。\n> ```bash\n> export PIP_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> # 如果使用 uv\n> export UV_INDEX_URL=https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 2. 安装步骤\n\nNeMo Curator 采用模块化安装方式，您可以根据需要处理的数据类型（文本、图像、视频、音频）安装对应的依赖。\n\n### 使用 uv 安装（推荐）\n\n`uv` 是一个极速 Python 包安装器。以下命令以安装 **文本处理（Text）** 模块且适配 **CUDA 12** 为例：\n\n```bash\nuv pip install \"nemo-curator[text_cuda12]\"\n```\n\n如需其他模态，可替换括号内的标签：\n*   图像：`image_cuda12`\n*   视频：`video_cuda12`\n*   音频：`audio_cuda12`\n*   全部功能：`all_cuda12`\n\n### 使用 pip 安装\n\n如果您未安装 `uv`，可使用标准 pip 命令：\n\n```bash\npip install \"nemo-curator[text_cuda12]\"\n```\n\n> **注意**：完整的生产环境部署（如多节点集群或 Docker 容器）请参考官方 [安装指南](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fadmin\u002Finstallation.html)。\n\n## 3. 基本使用\n\n安装完成后，您可以立即运行官方提供的快速入门示例，以验证环境并了解基本工作流程。\n\n### 运行快速入门示例\n\n在项目目录中执行以下命令，该脚本将演示如何加载数据并应用基础的清洗流程：\n\n```bash\npython tutorials\u002Fquickstart.py\n```\n\n### 核心功能概览\n\n根据您的数据类型，Curator 提供了针对性的处理能力：\n\n| 数据类型 | 核心能力 | 典型应用场景 |\n| :--- | :--- | :--- |\n| **文本 (Text)** | 去重 (精确\u002F模糊\u002F语义)、质量过滤、语言检测、内容分类 | LLM 预训练数据清洗 |\n| **图像 (Image)** | 美学评分、NSFW 检测、CLIP 嵌入生成、去重 | 视觉语言模型 (VLM) 数据集构建 |\n| **视频 (Video)** | 场景检测、片段提取、运动过滤、Cosmos-Embed1 嵌入 | 世界基础模型 (WFM) 视频语料处理 |\n| **音频 (Audio)** | ASR 自动转录、WER 质量评估、时长分析 | 语音识别 (ASR) 数据集准备 |\n\n### 下一步\n\n*   **查看文档**: [完整文档](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002F)\n*   **更多教程**: 访问 `tutorials\u002F` 目录获取针对文本、图像、视频和音频的详细代码示例。\n*   **Docker 部署**: 可通过 [NGC Catalog](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fnemo-curator) 拉取预构建的 Docker 镜像。","某自动驾驶初创公司的算法团队正急需构建一个包含百万级路标与交通场景的高质量图像数据集，以训练下一代视觉语言模型（VLM）。\n\n### 没有 Curator 时\n- 数据清洗完全依赖 CPU 串行处理，筛选千万张图像需耗时数周，严重拖慢模型迭代节奏。\n- 缺乏高效的美学评分与 NSFW（不适宜内容）检测机制，导致大量模糊、无关或不合规图片混入训练集。\n- 难以识别并剔除语义重复的图像，造成模型过拟合，且在特定场景下的泛化能力显著下降。\n- 手动编写脚本整合多源数据极其繁琐，一旦数据规模扩大，内存溢出频发，工程维护成本高昂。\n\n### 使用 Curator 后\n- 利用 GPU 加速的并行流水线，将原本数周的数据预处理时间压缩至数小时，实现从笔记本到多节点集群的无缝扩展。\n- 一键调用内置的美学过滤与 NSFW 检测模块，自动剔除低质与违规样本，确保输入数据的纯净度与安全性。\n- 通过基于 CLIP 嵌入的语义去重功能，精准移除冗余图像，显著提升模型在复杂路况下的识别准确率。\n- 采用模块化处理流程，轻松加载 WebDataset 格式的大规模图文对，稳定支撑亿级数据量的自动化策展。\n\nCurator 通过 GPU 加速的全流程数据策展能力，将原本耗时费力的数据清洗工作转化为高效自动化的标准工序，让团队能专注于核心模型架构的创新。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNVIDIA-NeMo_Curator_64eefe5d.png","NVIDIA-NeMo","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNVIDIA-NeMo_ef2128b9.png","",null,"https:\u002F\u002Fnvidia.com\u002F","https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo",[82,86,90,94,97,101,104],{"name":83,"color":84,"percentage":85},"Python","#3572A5",83,{"name":87,"color":88,"percentage":89},"MDX","#fcb32c",16.4,{"name":91,"color":92,"percentage":93},"CSS","#663399",0.2,{"name":95,"color":96,"percentage":93},"TypeScript","#3178c6",{"name":98,"color":99,"percentage":100},"Shell","#89e051",0.1,{"name":102,"color":103,"percentage":100},"Dockerfile","#384d54",{"name":105,"color":106,"percentage":107},"Makefile","#427819",0,1500,249,"2026-04-03T23:43:43","Apache-2.0","Linux","必需 NVIDIA GPU。支持多节点多卡扩展（示例提及 H100 80GB）。依赖 CUDA 环境（安装示例提及 'text_cuda12'，暗示支持 CUDA 12.x），需安装 NVIDIA RAPIDS 库 (cuDF, cuML, cuGraph)。","未说明（取决于数据集规模，支持从笔记本到多节点集群扩展）",{"notes":116,"python":117,"dependencies":118},"该工具专为大规模数据清洗设计，利用 GPU 加速可实现比 CPU 方案快 16 倍的性能。支持文本、图像、视频和音频四种模态。可通过 Docker 容器部署或从 PyPI 安装特定模态版本（如 nemo-curator[text_cuda12]）。支持从单台笔记本电脑扩展到多节点集群环境。","3.9+ (根据 PyPI badge 推断，具体版本需参考官方安装指南，文中未明确写出数字但展示了 pyversions 徽章)",[119,120,121,122,123,124,125,126],"nemo-curator","RAPIDS (cuDF, cuML, cuGraph)","Ray","fastText","TransNetV2","CLIP","Cosmos-Embed1","NeMo Framework",[51,13,26],[129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147],"data-curation","llm","data","data-prep","data-preparation","data-processing","data-processing-pipelines","data-quality","datacuration","datarecipes","deduplication","fast-data-processing","fine-tuning","large-language-models","large-scale-data-processing","llmapps","python","llm-data-quality","semantic-deduplication","2026-03-27T02:49:30.150509","2026-04-06T05:17:02.219907",[151,156,161,166,170,175],{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},14323,"安装 nemo-curator 时遇到 comment_parser 包构建失败或依赖错误怎么办？","这是一个已知的依赖问题。虽然维护者建议将其添加为依赖项，但目前最可靠的解决方法是手动确保环境满足以下要求：\n1. Python 3.10 或更高版本。\n2. 确保安装了 `packaging>=22.0`。\n3. 操作系统推荐 Ubuntu 22.04\u002F20.04。\n4. 如果使用 GPU，需要 Volta™ 或更高架构（计算能力 7.0+）以及 CUDA 12 或以上版本。\n\n如果仍然失败，建议检查是否缺少系统级构建工具，并参考官方文档中的 `docs\u002Fuser-guide\u002Fimage\u002Fgettingstarted.rst` 更新您的环境配置。","https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues\u002F504",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},14324,"运行 FuzzyDeduplicationWorkflow 时报错 'Failed to look up actor' 或 Ray Actor 丢失是什么原因？","这是由于 `create_id_generator_actor` 函数中的 `finally` 块错误地调用了 `ray.shutdown()`，导致在管道执行器访问之前就已经销毁了分离的 ID 生成器 Actor。\n\n解决方案是修改源码 `nemo_curator\u002Fstages\u002Fdeduplication\u002Fid_generator.py`，移除或调整 `finally` 块中的 `ray.shutdown()` 调用，确保 Actor 在工作流完成前保持存活。如果您使用的是容器版本，可能需要等待官方修复或自行构建修补后的镜像。","https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues\u002F1216",{"id":162,"question_zh":163,"answer_zh":164,"source_url":165},14325,"NeMo Curator 团队推荐使用什么开发工具和 workflow？","根据团队反馈，常用的开发工具包括：\n1. **IDE**: 大多数团队成员使用 Cursor，部分人最近切换到了 Claude Code。\n2. **代码审查**: 团队普遍对 Greptile 感到满意，它常用于 PR 审查以识别关键问题（尽管偶尔会有误报）。\n3. **开发环境**: 推荐使用原生 Linux 机器进行开发。团队成员通常通过 SSH 连接到拥有多张 A100 GPU 和高核数 CPU 的共享远程服务器进行大规模任务处理。","https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues\u002F1411",{"id":167,"question_zh":168,"answer_zh":169,"source_url":165},14326,"如何在 WSL2 (Windows Subsystem for Linux) 上设置 NeMo Curator？","目前官方主要支持原生 Linux 环境（如 Ubuntu 22.04\u002F20.04）。虽然有用户计划在 WSL2 上进行设置测试，但团队建议使用原生 Linux 机器或通过 SSH 连接远程 Linux 服务器以获得最佳兼容性和性能。如果在 WSL2 上遇到特定问题，建议查阅相关社区 Issue（如 #706）或直接提交新问题以获取针对该环境的特定指导。",{"id":171,"question_zh":172,"answer_zh":173,"source_url":174},14327,"SemDedup 在创建嵌入（embeddings）时抛出 TypeError 异常如何解决？","这通常与环境配置或库版本不匹配有关。如果遇到此错误且复现了官方文档的代码：\n1. 首先尝试升级相关的依赖库（特别是 dask, nemo-curator 及其底层依赖）。\n2. 检查 Python 版本是否符合要求（推荐 3.10+）。\n3. 如果升级后问题依旧，请详细记录您的硬件设置、容器版本及完整的堆栈跟踪信息并向社区反馈，因为这可能涉及特定环境下的兼容性问题。","https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fissues\u002F722",{"id":176,"question_zh":177,"answer_zh":178,"source_url":165},14328,"有哪些适合新手的贡献方向或高优先级任务？","对于想要贡献代码的新手，以下方向被标记为高优先级或适合作为入门：\n1. **修复文件读取错误**：处理各类文件格式读取时的边界情况报错。\n2. **改进通用过滤器**：将教程（tutorials\u002F）文件夹中现有的通用过滤器改进并集成到主库中，而不是仅停留在示例阶段。\n3. **基准测试脚本**：例如为 FastText 过滤器添加基准测试脚本，因为其行为与启发式过滤器不同，需要专门的模型 setup。\n\n参与贡献时，请确保遵循贡献指南，并使用 `git commit -sS` 对提交进行签名和签署。",[180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,254,259,264,269,274],{"id":181,"version":182,"summary_zh":183,"released_at":184},81094,"v1.1.0","### 新特性\n\n- **阶段与流水线基准测试**：支持所有模态（文本、图像、视频、音频）的基准测试\n- **YAML 配置**：声明式流水线配置，提供代码过滤、去重、启发式过滤及 FastText 的预建配置\n- **流水线性能与指标日志记录**：自动跟踪处理时间、吞吐量和资源使用情况；为失败阶段提供详细日志和错误报告\n\n### 改进\n\n- **视频**：移除 InternVideo2；vLLM 0.15.1、FFmpeg 8.0.1\n- **音频**：增强 ASR\u002FWER 文档，提升清单文件的健壮性处理\n- **图像**：优化批大小（batch_size=100，num_threads=16），增加内存指导说明\n- **文本**：改进大规模语义去重的内存管理\n- **去重**：支持 ParquetReader\u002FWriter 使用云存储（S3、GCS、Azure），实现非阻塞 ID 生成，并优化空批次处理逻辑\n\n### 依赖更新\n\n- Transformers 4.55.2、vLLM 0.15.1、FFmpeg 8.0.1\n- 安全补丁：aiohttp、urllib3、python-multipart、setuptools\n\n### 错误修复\n\n- FastText 与 numpy>2 的兼容性问题、NeMo 文档链接、ID 生成器阻塞问题、vLLM 视频 API、Gliner\u002FSDG 教程、语义去重测试可靠性问题\n\n### 基础设施\n\n- 秘密信息检测、Dependabot、增强安装测试、AWS runner 支持、Docker\u002Fuv 优化、Cursor 规则\n\n### 破坏性变更\n\n- **移除 InternVideo2**：请使用 Cosmos-Embed1 进行视频嵌入\n\n### 文档\n\n- 启发式过滤指南、分布式分类器内存指导、安装故障排除、内存管理、AWS 凭证相关说明","2026-02-23T22:04:05",{"id":186,"version":187,"summary_zh":188,"released_at":189},81095,"v1.0.0","本次重大发布标志着架构从 [Dask](https:\u002F\u002Fwww.dask.org\u002F) 向 [Ray](https:\u002F\u002Fwww.ray.io\u002F) 的根本性转变，同时扩展了 NeMo Curator 的功能，使其能够支持多模态数据编排，并新增了[视频](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Findex.html)和[音频](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-audio\u002Findex.html)处理能力。此次重构实现了统一的后端处理流程、更好的异构计算支持以及针对动态工作负载的增强型自动伸缩功能。\n\n### 安装更新\n\n- **全新 Docker 容器**：更新后的 Docker 基础架构采用 CUDA 12.8.1 和 Ubuntu 24.04；可通过 [NGC 目录](https:\u002F\u002Fcatalog.ngc.nvidia.com\u002Forgs\u002Fnvidia\u002Fcontainers\u002Fnemo-curator) 获取（`nvcr.io\u002Fnvidia\u002Fnemo-curator:25.09`）\n- **用于构建自定义镜像的 Dockerfile**：简化了[Dockerfile](https:\u002F\u002Fgithub.com\u002FNVIDIA-NeMo\u002FCurator\u002Fblob\u002Fmain\u002Fdocker\u002FDockerfile)结构，支持 FFmpeg，便于用户构建自定义容器\n- **UV 源安装**：集成 UV 包管理器（v0.8.22），以提升依赖管理速度\n- **PyPI 改进**：优化了 PyPI 安装方式，提供模块化的附加组件以满足特定功能需求：\n\n  | 附加组件 | 安装命令 | 说明 |\n  |----------|----------|------|\n  | **所有模态** | `nemo-curator[all]` | 完整安装，包含所有模态及 GPU 支持 |\n  | **文本编排** | `nemo-curator[text_cuda12]` | 基于 RAPIDS 的 GPU 加速文本处理 |\n  | **图像编排** | `nemo-curator[image_cuda12]` | 使用 NVIDIA DALI 进行图像处理 |\n  | **音频编排** | `nemo-curator[audio_cuda12]` | 结合 NeMo ASR 模型实现语音识别 |\n  | **视频编排** | `nemo-curator[video_cuda12]` | GPU 加速视频处理 |\n  | **基础 GPU 功能** | `nemo-curator[cuda12]` | 提供 CUDA 工具集，不含特定模态依赖 |\n\n  所有 GPU 相关安装均需使用 NVIDIA PyPI 源：\n\n  ```bash\n  uv pip install --extra-index-url https:\u002F\u002Fpypi.nvidia.com nemo-curator[EXTRA]\n  ```\n\n### 新增模态\n\n#### 视频\n\nNeMo Curator 现已支持全面的[视频数据编排](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Findex.html)，并具备分布式处理能力：\n\n- **视频切分**：支持[固定步长切片](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fclipping.html)以及基于场景变化检测的[TransNetV2 方法](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fclipping.html)，用于提取片段\n- **语义去重**：利用[K-means 聚类与两两相似度计算](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Fdedup.html)，移除近似重复的视频片段\n- **内容过滤**：提供[基于运动的过滤](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Ffiltering.html)和[美学过滤](https:\u002F\u002Fdocs.nvidia.com\u002Fnemo\u002Fcurator\u002Flatest\u002Fcurate-video\u002Fprocess-data\u002Ffiltering.html)，以提升视频质量\n- **嵌入生成**：支持 InternVideo2 和 Cosmo 模型","2025-10-01T15:15:09",{"id":191,"version":192,"summary_zh":193,"released_at":194},81096,"v0.9.0","### 主要功能与改进\n\n- 新增操作指南数据配方（教程）\n  - 多模态 DAPT 整理与 PDF 提取\n  - Llama Nemotron 数据整理\n  - LLM NIM — PII 去标识化\n- 性能与代码优化\n  - 简化聚类逻辑：显著提升语义去重聚类性能\n  - 移除导致性能问题的复杂后端切换逻辑\n  - 消除可能在大型数据集上引发超时的昂贵长度断言\n  - 提升 KMeans 聚类操作中的 GPU 利用率\n  - 在 7 张 GPU 上对 3,700 万条嵌入数据（80GB）进行了测试，性能大幅提升\n\n### 错误修复\n\n- FastText 下载 URL 修复\n  - 更正了 nemotron-cc 教程中 `fasttext` 模型的下载 URL\n  - 将地址由 `dl.fbaipublicfiles.com\u002FfastText\u002F` 更改为 `dl.fbaipublicfiles.com\u002Ffasttext\u002F`\n  - 确保语言识别模型的可靠下载\n- NeMo Retriever 教程错误修复\n  - 修复了 `RetrieverEvalSetGenerator` 中的 Lambda 函数 bug\n  - 将分数赋值从 `df[\"question\"].apply(lambda: 1)` 改为 `df[\"score\"] = 1`\n- API 使用更新\n  - 更新示例和教程，使用正确的 `DocumentDataset` API\n  - 替代已弃用的 `write_to_disk(result, output_dir, output_type=\"parquet\")`，改用 `result.to_parquet(output_dir)`\n  - 更新精确去重工作流：`deduplicator.remove()` 现在直接返回 `DocumentDataset`","2025-07-28T20:18:46",{"id":196,"version":197,"summary_zh":198,"released_at":199},81097,"v0.8.0","- 基于Llama的PII信息脱敏\n- Trafilatura文本提取器\n- 针文本提取器的中文和日文停用词表\n- 写入gzip压缩的JSONL数据集\n- 使用难负样本挖掘进行检索器定制的训练数据集筛选\n- 在语义去重中实现了内存高效的成对相似度计算","2025-05-09T01:11:57",{"id":201,"version":202,"summary_zh":203,"released_at":204},81098,"v0.8.0rc3.dev0","预发布：NVIDIA NeMo Curator 0.8.0rc3.dev0（2025年4月15日）","2025-04-15T19:44:52",{"id":206,"version":207,"summary_zh":208,"released_at":209},81099,"v0.8.0rc2.dev0","预发布：NVIDIA NeMo Curator 0.8.0rc2.dev0（2025年4月7日）","2025-04-07T20:15:47",{"id":211,"version":212,"summary_zh":213,"released_at":214},81100,"v0.7.1","- 修复 Transformers + CUDA 上下文错误\n- 修复 SDG 检索器评估教程中的速率限制","2025-03-31T22:52:55",{"id":216,"version":217,"summary_zh":218,"released_at":219},81101,"v0.7.0","- Python 3.12 支持\n- Blackwell 上的 Curator\n- Nemotron-CC 数据集配方\n- 面向模糊去重的高性能 S3","2025-03-12T21:22:12",{"id":221,"version":222,"summary_zh":223,"released_at":224},81102,"v0.7.0rc2.dev0","预发布：NVIDIA NeMo Curator 0.7.0rc2.dev0（2025年2月25日）","2025-02-25T13:12:42",{"id":226,"version":227,"summary_zh":228,"released_at":229},81103,"v0.7.0rc1.dev1","预发布：NVIDIA NeMo Curator 0.7.0rc1.dev1（2025-02-19）","2025-02-19T18:21:59",{"id":231,"version":232,"summary_zh":233,"released_at":234},81104,"v0.7.0rc0.dev1","Prerelease: NVIDIA NeMo Curator 0.7.0rc0.dev1 (2025-02-04)\r\n\r\n","2025-02-04T21:41:45",{"id":236,"version":237,"summary_zh":238,"released_at":239},81105,"v0.6.0","## What's changed\r\n\r\n- Synthetic Data Generation for Text Retrieval\r\n  - LLM-based Filters\r\n    - Easiness\r\n    - Answerability\r\n  - Q&A Retrieval Generation Pipeline\r\n- Parallel Dataset Curation for Machine Translation\r\n  - Load\u002FWrite Bitext Files\r\n  - Heuristic filtering (Histogram, Length Ratio)\r\n  - Classifier filtering (Comet, Cometoid)","2025-01-07T15:41:08",{"id":241,"version":242,"summary_zh":243,"released_at":244},81106,"v0.6.0rc2.dev1","Prerelease: NVIDIA NeMo Curator 0.6.0rc2.dev1 (2025-01-03)","2025-01-03T21:38:16",{"id":246,"version":247,"summary_zh":248,"released_at":249},81107,"0.6.0rc1.dev1","Prerelease: NVIDIA NeMo Curator 0.6.0rc1.dev1 (2024-12-20)","2024-12-20T22:24:50",{"id":251,"version":252,"summary_zh":252,"released_at":253},81108,"v0.6.0rc0","2024-12-13T06:09:14",{"id":255,"version":256,"summary_zh":257,"released_at":258},81109,"v0.5.1","## What's Changed\r\n* Add pin to NeMo Toolkit by @sarahyurick in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F407\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fcompare\u002Fv0.5.0...v0.5.1","2024-12-03T22:27:21",{"id":260,"version":261,"summary_zh":262,"released_at":263},81110,"v0.5.0","## Highlights\r\n- Image Curation\r\n  - Image Embedding Creation\r\n  - Aesthetic Classifier\r\n  - NSFW Classifier\r\n  - Semantic Deduplication\r\n- Text Curation\r\n  - Quality Classifier\r\n  - Aegis Classifier\r\n  - FineWeb-Edu Classifier\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fcommits\u002Fv0.5.0","2024-10-30T17:30:17",{"id":265,"version":266,"summary_zh":267,"released_at":268},81111,"v0.4.1","## What's Changed\r\n* Add spacy\u003C3.8 pin to r0.4.1 by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F279\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fcompare\u002Fv0.4.0...v0.4.1","2024-10-03T19:35:20",{"id":270,"version":271,"summary_zh":272,"released_at":273},81112,"v0.4.0","## Highlights\r\n- Semantic Deduplication\r\n- Resiliparse for Text Extraction\r\n- Improve Distributed Data Classification - Domain classifier is 1.55x faster through intelligent batching\r\n- Synthetic data generation for fine-tuning\r\n\r\n## What's Changed\r\n* Update README by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F6\r\n* [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F5\r\n* Add workflow for running cpu pytests by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F13\r\n* Add pre-commit style checks by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F14\r\n* Add citation by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F15\r\n* Fix Noisy CUDA Shutdown by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F20\r\n* Bump Python and RAPIDS versions by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F16\r\n* Add batched decorator by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F18\r\n* Add issue templates by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F22\r\n* Add dependency to fix justext by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F24\r\n* Fix metadata inference with pandas and dask by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F35\r\n* Disable PyTorch Compile Multiprocessing by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F34\r\n* Improve speed of AddId module by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F36\r\n* Make GPU dependencies optional by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F27\r\n* Fix failing GPU tests with latest pandas bump by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F41\r\n* Adds Nemo Curator K8s example by @terrykong in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F40\r\n* Move common dedup utils and remove unused code by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F42\r\n* Fix lang id example by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F37\r\n* Add dataset blending tool by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F32\r\n* High level fuzzy duplicates module by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F46\r\n* Fix indexing in PII Modifier by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F55\r\n* Disable string conversion globally by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F56\r\n* Fix issue #43 (empty files creation) and improve reading\u002Fwriting speed by @miguelusque in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F57\r\n* [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F45\r\n* Only import PII constants during Curator import by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F61\r\n* Align `extract_partitioning_index` logic with upstream shuffling by @rjzamora in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F60\r\n* [REVIEW] Switch Models to use Crossfit by @VibhuJawa in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F58\r\n* Remove argparse from get_client function signature by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F12\r\n* Fuzzy Dedup:  Use text_field instead of hardcoded text column by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F74\r\n* Add pull request template by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F78\r\n* Add jupyter notebook tutorial for single node mulilingual dataset by @nicoleeeluo in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F30\r\n* Update issue templates by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F81\r\n* Fix #91 - Incorrect reference to domain_classifier_example.py by @miguelusque in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F92\r\n* Fix #63. Add --input-meta parameter to explicitly specify the jsonl field dtypes by @miguelusque in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F75\r\n* Update readme by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F93\r\n* Update documentation for new version by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F83\r\n* Update requirements documentation. by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F98\r\n* Make sure query-planning is disabled for now by @rjzamora in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F97\r\n* Applying SEO Best Pratices by @aschilling-nv in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F104\r\n* Shuffle CC result on group before writing out by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F110\r\n* Added tutorials to index.rst by @jgerh in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F113\r\n* Pin to numpy\u003C2 to avoid spacy compat issues by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F119\r\n* Fix #116. Fix broken links by @miguelusque in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F117\r\n* Update index.rst by @aschilling-nv in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F129\r\n* Fix nemo_curator import in CPU only environment when GPU packages are installed. by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F123\r\n* Improve Co","2024-08-14T21:54:17",{"id":275,"version":276,"summary_zh":277,"released_at":278},81113,"v0.3.0","## What's Changed\r\n* Update README by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F6\r\n* [Tutorials] Add a readme file for the TinyStories tutorial by @Maghoumi in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F5\r\n* Add workflow for running cpu pytests by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F13\r\n* Add pre-commit style checks by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F14\r\n* Add citation by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F15\r\n* Fix Noisy CUDA Shutdown by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F20\r\n* Bump Python and RAPIDS versions by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F16\r\n* Add batched decorator by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F18\r\n* Add issue templates by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F22\r\n* Add dependency to fix justext by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F24\r\n* Fix metadata inference with pandas and dask by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F35\r\n* Disable PyTorch Compile Multiprocessing by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F34\r\n* Improve speed of AddId module by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F36\r\n* Make GPU dependencies optional by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F27\r\n* Fix failing GPU tests with latest pandas bump by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F41\r\n* Adds Nemo Curator K8s example by @terrykong in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F40\r\n* Move common dedup utils and remove unused code by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F42\r\n* Fix lang id example by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F37\r\n* Add dataset blending tool by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F32\r\n* High level fuzzy duplicates module by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F46\r\n* Fix indexing in PII Modifier by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F55\r\n* Disable string conversion globally by @ryantwolf in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F56\r\n* Fix issue #43 (empty files creation) and improve reading\u002Fwriting speed by @miguelusque in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F57\r\n* [Tutorials] Add a tutorial for PEFT data curation by @Maghoumi in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F45\r\n* Only import PII constants during Curator import by @ayushdg in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F61\r\n* Align `extract_partitioning_index` logic with upstream shuffling by @rjzamora in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F60\r\n\r\n## New Contributors\r\n* @Maghoumi made their first contribution in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F5\r\n* @terrykong made their first contribution in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F40\r\n* @miguelusque made their first contribution in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F57\r\n* @rjzamora made their first contribution in https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fpull\u002F60\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FNVIDIA\u002FNeMo-Curator\u002Fcommits\u002Fv0.3.0\r\n\r\n## PyPi\r\nhttps:\u002F\u002Fpypi.org\u002Fproject\u002Fnemo-curator\u002F0.3.0\u002F","2024-06-10T21:59:44"]