[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-cleanlab--cleanlab":3,"tool-cleanlab--cleanlab":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",160411,2,"2026-04-18T23:33:24",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":64,"owner_name":72,"owner_avatar_url":73,"owner_bio":74,"owner_company":75,"owner_location":75,"owner_email":75,"owner_twitter":75,"owner_website":76,"owner_url":77,"languages":78,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":98,"github_topics":100,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":121,"updated_at":122,"faqs":123,"releases":154},9463,"cleanlab\u002Fcleanlab","cleanlab","Cleanlab's open-source library is the standard data-centric AI package for data quality and machine learning with messy, real-world data and labels.","Cleanlab 是一款专注于“以数据为中心”的开源人工智能工具，旨在帮助开发者轻松应对现实世界中混乱、充满噪声的数据集。在机器学习实践中，模型效果不佳往往源于数据标注错误或样本质量问题，而非算法本身。Cleanlab 的核心价值在于它能利用你现有的机器学习模型，自动诊断并定位数据中的各类隐患，包括错误标签、异常值、重复样本等。\n\n无论是处理图像、文本、音频还是表格数据，Cleanlab 都能通过简单的代码接口快速扫描数据集，生成详细的质量报告，甚至为多标注者场景提供一致性分析。其独特的技术亮点在于无需更换现有模型架构，仅通过分析模型的预测概率和特征嵌入，即可反向推导数据问题，指导用户优先修正哪些数据以获得最大的性能提升。\n\n这款工具非常适合机器学习工程师、数据科学家以及 AI 研究人员使用。它改变了传统只关注调参优化的工作流，倡导“先清洗数据，再优化模型”的高效范式。通过迭代式地发现问题、清洗数据并重新训练，用户往往能在不修改任何模型代码的情况下，显著提升最终系统的准确率与鲁棒性，是让真实世界数据发挥最大价值的得力助手。","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_50a28bfd6538.png\" width=60%>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fpypi\u002Fcleanlab\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcleanlab.svg\" alt=\"pypi_versions\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fpypi\u002Fcleanlab\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue\" alt=\"py_versions\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fapp.codecov.io\u002Fgh\u002Fcleanlab\u002Fcleanlab\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fcodecov.io\u002Fgh\u002Fcleanlab\u002Fcleanlab\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\" alt=\"coverage\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fstargazers\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcleanlab\u002Fcleanlab?style=social&maxAge=2592000\" alt=\"Github Stars\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Ftwitter.com\u002FCleanlabAI\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FCleanlabAI?style=social\" alt=\"Twitter\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fdocs.cleanlab.ai\u002F\">Documentation\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\">Examples\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn\u002F\">Blog\u003C\u002Fa> |\n        \u003Ca href=\"#citation-and-related-publications\">Research\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\nCleanlab’s open-source library helps you **clean** data and **lab**els by automatically detecting issues in a ML dataset. To facilitate **machine learning with messy, real-world data**, this data-centric AI package uses your *existing* models to estimate dataset problems that can be fixed to train even *better* models.\n \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_84f26135c316.png\" width=74%>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n    Examples of various issues in Cat\u002FDog dataset \u003Cb>automatically detected\u003C\u002Fb> by cleanlab via this code:    \n\u003C\u002Fp>\n\n```python\n        lab = cleanlab.Datalab(data=dataset, label=\"column_name_for_labels\")\n        # Fit any ML model, get its feature_embeddings & pred_probs for your data\n        lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)\n        lab.report()\n```\n\n- Use cleanlab to automatically check every: [text](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Ftext.html), [audio](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Faudio.html), [image](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Fimage.html), or [tabular](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Ftabular.html) dataset.\n- Use cleanlab to automatically: [detect data issues (outliers, duplicates, label errors, etc)](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Fdatalab_quickstart.html), [train robust models](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Findepth_overview.html), [infer consensus + annotator-quality for multi-annotator data](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultiannotator.html), [suggest data to (re)label next (active learning)](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb).\n\n\n---\n\n\n## Run cleanlab open-source\n\nThis cleanlab package runs on Python 3.10+ and supports Linux, macOS, as well as Windows.\n\n- Get started [here](https:\u002F\u002Fdocs.cleanlab.ai\u002F)! Install via `uv`, `pip`, or `conda`.\n- Developers who install the bleeding-edge from source should refer to [this master branch documentation](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Findex.html).\n\n**Practicing data-centric AI can look like this:**\n1. Train initial ML model on original dataset.\n2. Utilize this model to diagnose data issues (via cleanlab methods) and improve the dataset.\n3. Train the same model on the improved dataset. \n4. Try various modeling techniques to further improve performance.\n\nMost folks jump from Step 1 → 4, but you may achieve big gains without *any* change to your modeling code by using cleanlab!\nContinuously boost performance by iterating Steps 2 → 4 (and try to evaluate with *cleaned* data).\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_ba2bf5558e12.png)\n\n\n## Use cleanlab with any model and in most ML tasks\n\nAll features of cleanlab work with **any dataset** and **any model**. Yes, any model: PyTorch, Tensorflow, Keras, JAX, HuggingFace, OpenAI, XGBoost, scikit-learn, etc.\n\ncleanlab is useful across a wide variety of Machine Learning tasks. Specific tasks this data-centric AI package offers dedicated functionality for include:\n1. [Binary and multi-class classification](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Findepth_overview.html)\n2. [Multi-label classification](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultilabel_classification.html) (e.g. image\u002Fdocument tagging)\n3. [Token classification](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Ftoken_classification.html) (e.g. entity recognition in text)\n4. [Regression](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fregression.html) (predicting numerical column in a dataset)\n5. [Image segmentation](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fsegmentation.html) (images with per-pixel annotations)\n6. [Object detection](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fobject_detection.html) (images with bounding box annotations)\n7. [Classification with data labeled by multiple annotators](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultiannotator.html)\n8. [Active learning with multiple annotators](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb) (suggest which data to label or re-label to improve model most)\n9. [Outlier detection](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Foutliers.html) (identify atypical data that appears out of distribution)\n\nFor other ML tasks, cleanlab can still help you improve your dataset if appropriately applied.\nSee our [Example Notebooks](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples) and [Blog](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn\u002F).\n\n\n## So fresh, so cleanlab\n\nBeyond automatically catching [all sorts of issues](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Fcleanlab\u002Fdatalab\u002Fguide\u002Fissue_type_description.html) lurking in your data, this data-centric AI package helps you deal with **noisy labels** and train more **robust ML models**.\nHere's an example:\n\n```python\n\n# cleanlab works with **any classifier**. Yup, you can use PyTorch\u002FTensorFlow\u002FOpenAI\u002FXGBoost\u002Fetc.\ncl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier())\n\n# cleanlab finds data and label issues in **any dataset**... in ONE line of code!\nlabel_issues = cl.find_label_issues(data, labels)\n\n# cleanlab trains a robust version of your model that works more reliably with noisy data.\ncl.fit(data, labels)\n\n# cleanlab estimates the predictions you would have gotten if you had trained with *no* label issues.\ncl.predict(test_data)\n\n# A universal data-centric AI tool, cleanlab quantifies class-level issues and overall data quality, for any dataset.\ncleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)\n```\n\ncleanlab **clean**s your data's **lab**els via state-of-the-art *confident learning* algorithms, published in this [paper](https:\u002F\u002Fjair.org\u002Findex.php\u002Fjair\u002Farticle\u002Fview\u002F12125) and [blog](https:\u002F\u002Fl7.curtisnorthcutt.com\u002Fconfident-learning). See some of the datasets cleaned with cleanlab at [labelerrors.com](https:\u002F\u002Flabelerrors.com).\n\ncleanlab is:\n\n1. **backed by theory** -- with [provable guarantees](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.00068) of exact label noise estimation, even with imperfect models.\n2. **fast** -- code is parallelized and scalable.\n4. **easy to use** -- one line of code to find mislabeled data, bad annotators, outliers, or train noise-robust models.\n6. **general** -- works with **[any dataset](https:\u002F\u002Flabelerrors.com\u002F)** (text, image, tabular, audio,...) + **any model** (PyTorch, OpenAI, XGBoost,...)\n\u003Cbr\u002F>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_29db5c4ce66b.png)\n\u003Cp align=\"center\">\nExamples of incorrect given labels in various image datasets \u003Ca href=\"https:\u002F\u002Fl7.curtisnorthcutt.com\u002Flabel-errors\">found and corrected\u003C\u002Fa> using cleanlab. \nWhile these examples are from image datasets, this also works for text, audio, tabular data.\n\u003C\u002Fp>\n\n\n## Citation and related publications\n\ncleanlab is based on peer-reviewed research. Here are relevant papers to cite if you use this package:\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.00068\">Confident Learning (JAIR '21)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @article{northcutt2021confidentlearning,\n        title={Confident Learning: Estimating Uncertainty in Dataset Labels},\n        author={Curtis G. Northcutt and Lu Jiang and Isaac L. Chuang},\n        journal={Journal of Artificial Intelligence Research (JAIR)},\n        volume={70},\n        pages={1373--1411},\n        year={2021}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.01936\">Rank Pruning (UAI '17)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{northcutt2017rankpruning,\n        author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},\n        title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},\n        booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence},\n        series = {UAI'17},\n        year = {2017},\n        location = {Sydney, Australia},\n        numpages = {10},\n        url = {http:\u002F\u002Fauai.org\u002Fuai2017\u002Fproceedings\u002Fpapers\u002F35.pdf},\n        publisher = {AUAI Press},\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Fjonasmueller.org\u002Finfo\u002FLabelQuality_icml.pdf\"> Label Quality Scoring (ICML '22)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{kuan2022labelquality,\n        title={Model-agnostic label quality scoring to detect real-world label errors},\n        author={Kuan, Johnson and Mueller, Jonas},\n        booktitle={ICML DataPerf Workshop},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.03920\"> Label Errors in Token Classification \u002F Entity Recognition (NeurIPS '22)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{wang2022tokenerrors,\n        title={Detecting label errors in token classification data},\n        author={Wang, Wei-Chen and Mueller, Jonas},\n        booktitle={NeurIPS Workshop on Interactive Learning for Natural Language Processing (InterNLP)},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.13895\"> Label Errors in Multi-Label Classification (ICLR '23)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{thyagarajan2023multilabel,\n        title={Identifying Incorrect Annotations in Multi-Label Classification Data},\n        author={Thyagarajan, Aditya and Snorrason, Elías and Northcutt, Curtis and Mueller, Jonas},\n        booktitle={ICLR Workshop on Trustworthy ML},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00832\"> Label Errors in Object Detection (ICML '23)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{tkachenko2023objectlab,\n        title={ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data},\n        author={Tkachenko, Ulyana and Thyagarajan, Aditya and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05080\"> Label Errors in Image Segmentation (ICML '23)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{lad2023segmentation,\n        title={Estimating label quality and errors in semantic segmentation data via any model},\n        author={Lad, Vedang and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16583\"> Detecting Errors in Numerical Data (DMLR '24)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{zhou2023errors,\n        title={Detecting Errors in a Numerical Response via any Regression Model},\n        author={Zhou, Hang and Mueller, Jonas and Kumar, Mayank and Wang, Jane-Ling and Lei, Jing},\n        booktitle={Journal of Data-centric Machine Learning Research},\n        year={2024}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.03061\"> Out-of-Distribution Detection (ICML '22)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{kuan2022ood,\n        title={Back to the Basics: Revisiting Out-of-Distribution Detection Baselines},\n        author={Kuan, Johnson and Mueller, Jonas},\n        booktitle={ICML Workshop on Principles of Distribution Shift},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.06812\"> CROWDLAB for Data with Multiple Annotators (NeurIPS '22)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{goh2022crowdlab,\n        title={CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators},\n        author={Goh, Hui Wen and Tkachenko, Ulyana and Mueller, Jonas},\n        booktitle={NeurIPS Human in the Loop Learning Workshop},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.11856\"> ActiveLab: Active learning with data re-labeling (ICLR '23)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{goh2023activelab,\n        title={ActiveLab: Active Learning with Re-Labeling by Multiple Annotators},\n        author={Goh, Hui Wen and Mueller, Jonas},\n        booktitle={ICLR Workshop on Trustworthy ML},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15696\"> Detecting Dataset Drift and Non-IID Sampling (ICML '23)\u003C\u002Fa> (\u003Cb>click to show bibtex\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{cummings2023drift,\n        title={Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors},\n        author={Cummings, Jesse and Snorrason, Elías and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\nTo understand\u002Fcite other cleanlab functionality not described above, check out our [Blog](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn).\n\n\n## Other resources\n\n- [Example Notebooks demonstrating practical applications of this package](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples)\n\n- [Cleanlab Blog](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn)\n\n- [Blog post: Introduction to Confident Learning](https:\u002F\u002Fl7.curtisnorthcutt.com\u002Fconfident-learning)\n\n- [NeurIPS 2021 paper: Pervasive Label Errors in Test Sets Destabilize Machine Learning Benchmarks](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.14749)\n\n- [Introduction to Data-centric AI (MIT IAP Course)](https:\u002F\u002Fdcai.csail.mit.edu\u002F)\n\n- [Release notes for past versions](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Freleases)\n\n**Interested in contributing?**  See the [contributing guide](CONTRIBUTING.md), [development guide](DEVELOPMENT.md), and [ideas on useful contributions](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fwiki#ideas-for-contributing-to-cleanlab).\n\n**Have questions?**  Check out [our FAQ](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Ffaq.html) and [Github Issues](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues?q=is%3Aissue).\n","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_50a28bfd6538.png\" width=60%>\n\u003C\u002Fdiv>\n\n\u003Cdiv align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fpypi\u002Fcleanlab\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcleanlab.svg\" alt=\"pypi_versions\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fpypi\u002Fcleanlab\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%2B-blue\" alt=\"py_versions\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fapp.codecov.io\u002Fgh\u002Fcleanlab\u002Fcleanlab\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fcodecov.io\u002Fgh\u002Fcleanlab\u002Fcleanlab\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\" alt=\"coverage\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fstargazers\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Fcleanlab\u002Fcleanlab?style=social&maxAge=2592000\" alt=\"Github Stars\">\u003C\u002Fa>\n\u003Ca href=\"https:\u002F\u002Ftwitter.com\u002FCleanlabAI\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Ffollow\u002FCleanlabAI?style=social\" alt=\"Twitter\">\u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Ch4 align=\"center\">\n    \u003Cp>\n        \u003Ca href=\"https:\u002F\u002Fdocs.cleanlab.ai\u002F\">文档\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\">示例\u003C\u002Fa> |\n        \u003Ca href=\"https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn\u002F\">博客\u003C\u002Fa> |\n        \u003Ca href=\"#citation-and-related-publications\">研究\u003C\u002Fa>\n    \u003Cp>\n\u003C\u002Fh4>\n\nCleanlab 的开源库能够帮助您通过自动检测机器学习数据集中的问题来**清理**数据和**标注**。为了促进使用**混乱的真实世界数据进行机器学习**，这个以数据为中心的人工智能工具包会利用您*现有的*模型来估计数据集中的问题，并提出可修复的建议，从而训练出更加*优秀*的模型。\n \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_84f26135c316.png\" width=74%>\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n    通过这段代码，Cleanlab 可以\u003Cb>自动检测\u003C\u002Fb>猫狗数据集中存在的各种问题：\n\u003C\u002Fp>\n\n```python\n        lab = cleanlab.Datalab(data=dataset, label=\"column_name_for_labels\")\n        # 拟合任意机器学习模型，获取其特征嵌入和预测概率\n        lab.find_issues(features=feature_embeddings, pred_probs=pred_probs)\n        lab.report()\n```\n\n- 使用 Cleanlab 自动检查任何：[文本](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Ftext.html)、[音频](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Faudio.html)、[图像](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Fimage.html)，或[表格数据](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Ftabular.html)集。\n- 使用 Cleanlab 自动：[检测数据问题（异常值、重复样本、标签错误等）](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Fdatalab_quickstart.html)、[训练鲁棒模型](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Findepth_overview.html)、[推断多标注者数据的一致性及标注者质量](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultiannotator.html)、[建议下一步需要重新标注的数据（主动学习）](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb)。\n\n\n---\n\n\n## 运行 Cleanlab 开源工具\n\n此 Cleanlab 软件包支持 Python 3.10 及以上版本，并兼容 Linux、macOS 和 Windows。\n\n- 从[这里](https:\u002F\u002Fdocs.cleanlab.ai\u002F)开始！可通过 `uv`、`pip` 或 `conda` 安装。\n- 从源码安装最新开发版的开发者，请参考[主分支文档](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Findex.html)。\n\n**以数据为中心的人工智能实践可以这样进行：**\n1. 在原始数据集上训练初始机器学习模型。\n2. 利用该模型诊断数据问题（通过 Cleanlab 方法），并改进数据集。\n3. 在改进后的数据集上重新训练同一模型。\n4. 尝试不同的建模技术以进一步提升性能。\n\n大多数人通常直接从步骤 1 跳到步骤 4，但您只需使用 Cleanlab，无需对建模代码做任何修改，就能获得显著收益！通过不断迭代步骤 2 至 4 来持续提升性能（并尽量使用*清理后*的数据进行评估）。\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_ba2bf5558e12.png)\n\n\n## Cleanlab 可与任何模型配合，适用于大多数机器学习任务\n\nCleanlab 的所有功能都适用于**任何数据集**和**任何模型**。是的，任何模型：PyTorch、TensorFlow、Keras、JAX、HuggingFace、OpenAI、XGBoost、scikit-learn 等。\n\nCleanlab 在多种机器学习任务中都非常有用。该以数据为中心的人工智能工具包为以下特定任务提供了专门的功能：\n1. [二分类和多分类](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Findepth_overview.html)\n2. [多标签分类](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultilabel_classification.html)（例如图像\u002F文档标注）\n3. [标记分类](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Ftoken_classification.html)（例如文本中的实体识别）\n4. [回归](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fregression.html)（预测数据集中数值型列）\n5. [图像分割](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fsegmentation.html)（带有像素级标注的图像）\n6. [目标检测](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fobject_detection.html)（带有边界框标注的图像）\n7. [多标注者标注的数据分类](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultiannotator.html)\n8. [多标注者的主动学习](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb)（建议优先标注或重新标注哪些数据以最大程度提升模型性能）\n9. [异常值检测](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Foutliers.html)（识别分布外的异常数据）\n\n对于其他机器学习任务，只要应用得当，Cleanlab 仍然可以帮助您改进数据集。\n请参阅我们的[示例笔记本](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples)和[博客](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn\u002F)。\n\n\n## 如此清新，如此 Cleanlab\n\n除了能够自动捕捉您数据中潜藏的[各类问题](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Fcleanlab\u002Fdatalab\u002Fguide\u002Fissue_type_description.html)之外，这款以数据为中心的 AI 工具还能帮助您处理**噪声标签**，并训练出更加**鲁棒的机器学习模型**。以下是一个示例：\n\n```python\n\n# Cleanlab 适用于**任何分类器**。没错，您可以使用 PyTorch\u002FTensorFlow\u002FOpenAI\u002FXGBoost 等。\ncl = cleanlab.classification.CleanLearning(sklearn.YourFavoriteClassifier())\n\n# Cleanlab 可以在**任何数据集**中找到数据和标签问题……只需一行代码！\nlabel_issues = cl.find_label_issues(data, labels)\n\n# Cleanlab 会训练一个鲁棒版本的您的模型，使其在存在噪声标签的情况下也能更可靠地工作。\ncl.fit(data，labels)\n\n# Cleanlab 估算如果您在没有标签问题的情况下训练模型，将会得到怎样的预测结果。\ncl.predict(test_data)\n\n# 一款通用的数据中心型人工智能工具，cleanlab 可以量化任何数据集中的类别级问题及整体数据质量。\ncleanlab.dataset.health_summary(labels, confident_joint=cl.confident_joint)\n```\n\ncleanlab 通过最先进的 *置信学习* 算法来“清理”你的数据标签，这些算法发表在本 [论文](https:\u002F\u002Fjair.org\u002Findex.php\u002Fjair\u002Farticle\u002Fview\u002F12125) 和 [博客](https:\u002F\u002Fl7.curtisnorthcutt.com\u002Fconfident-learning) 中。你可以在 [labelerrors.com](https:\u002F\u002Flabelerrors.com) 上查看一些使用 cleanlab 清理过的数据集。\n\ncleanlab 具有以下特点：\n\n1. **理论支持**——即使模型不完美，也能提供可证明的精确标签噪声估计保证。\n2. **速度快**——代码已并行化且具有可扩展性。\n4. **易于使用**——只需一行代码即可找出误标注数据、不良标注者、异常值，或训练对噪声鲁棒的模型。\n6. **通用性强**——适用于 **[任何数据集](https:\u002F\u002Flabelerrors.com)**（文本、图像、表格、音频等）以及 **任何模型**（PyTorch、OpenAI、XGBoost 等）。\n\u003Cbr\u002F>\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_readme_29db5c4ce66b.png)\n\u003Cp align=\"center\">\n使用 cleanlab 找到并修正的各类图像数据集中错误标注的示例。虽然这些例子来自图像数据集，但该方法同样适用于文本、音频和表格数据。\n\u003C\u002Fp>\n\n## 引用及相关出版物\n\ncleanlab 基于经过同行评审的研究。如果您使用此软件包，请引用以下相关论文：\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.00068\">置信学习（JAIR '21）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @article{northcutt2021confidentlearning,\n        title={Confident Learning: Estimating Uncertainty in Dataset Labels},\n        author={Curtis G. Northcutt and Lu Jiang and Isaac L. Chuang},\n        journal={Journal of Artificial Intelligence Research (JAIR)},\n        volume={70},\n        pages={1373--1411},\n        year={2021}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.01936\">排名剪枝（UAI '17）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{northcutt2017rankpruning,\n        author={Northcutt, Curtis G. and Wu, Tailin and Chuang, Isaac L.},\n        title={Learning with Confident Examples: Rank Pruning for Robust Classification with Noisy Labels},\n        booktitle = {Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence},\n        series = {UAI'17},\n        year = {2017},\n        location = {Sydney, Australia},\n        numpages = {10},\n        url = {http:\u002F\u002Fauai.org\u002Fuai2017\u002Fproceedings\u002Fpapers\u002F35.pdf},\n        publisher = {AUAI Press},\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Fjonasmueller.org\u002Finfo\u002FLabelQuality_icml.pdf\">标签质量评分（ICML '22）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{kuan2022labelquality,\n        title={Model-agnostic label quality scoring to detect real-world label errors},\n        author={Kuan, Johnson and Mueller, Jonas},\n        booktitle={ICML DataPerf Workshop},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.03920\">标记分类\u002F实体识别中的标签错误（NeurIPS '22）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{wang2022tokenerrors,\n        title={Detecting label errors in token classification data},\n        author={Wang, Wei-Chen and Mueller, Jonas},\n        booktitle={NeurIPS Workshop on Interactive Learning for Natural Language Processing (InterNLP)},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.13895\">多标签分类中的标签错误（ICLR '23）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{thyagarajan2023multilabel,\n        title={Identifying Incorrect Annotations in Multi-Label Classification Data},\n        author={Thyagarajan, Aditya and Snorrason, Elías and Northcutt, Curtis and Mueller, Jonas},\n        booktitle={ICLR Workshop on Trustworthy ML},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.00832\">目标检测中的标签错误（ICML '23）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{tkachenko2023objectlab,\n        title={ObjectLab: Automated Diagnosis of Mislabeled Images in Object Detection Data},\n        author={Tkachenko, Ulyana and Thyagarajan, Aditya and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.05080\">图像分割中的标签错误（ICML '23）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{lad2023segmentation,\n        title={Estimating label quality and errors in semantic segmentation data via any model},\n        author={Lad, Vedang and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.16583\">数值数据中的错误检测（DMLR '24）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{zhou2023errors,\n        title={Detecting Errors in a Numerical Response via any Regression Model},\n        author={Zhou, Hang and Mueller, Jonas and Kumar, Mayank and Wang, Jane-Ling and Lei, Jing},\n        booktitle={Journal of Data-centric Machine Learning Research},\n        year={2024}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2207.03061\">分布外检测（ICML '22）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{kuan2022ood,\n        title={Back to the Basics: Revisiting Out-of-Distribution Detection Baselines},\n        author={Kuan, Johnson and Mueller, Jonas},\n        booktitle={ICML Workshop on Principles of Distribution Shift},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2210.06812\">用于多标注者数据的 CROWDLAB（NeurIPS '22）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{goh2022crowdlab,\n        title={CROWDLAB: Supervised learning to infer consensus labels and quality scores for data with multiple annotators},\n        author={Goh, Hui Wen and Tkachenko, Ulyana and Mueller, Jonas},\n        booktitle={NeurIPS Human in the Loop Learning Workshop},\n        year={2022}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2301.11856\">ActiveLab：带数据重新标注的主动学习（ICLR '23）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{goh2023activelab,\n        title={ActiveLab: Active Learning with Re-Labeling by Multiple Annotators},\n        author={Goh, Hui Wen and Mueller, Jonas},\n        booktitle={ICLR Workshop on Trustworthy ML},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\u003Csummary>\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.15696\">检测数据集漂移和非 IID 采样（ICML '23）\u003C\u002Fa> (\u003Cb>点击以显示 BibTeX\u003C\u002Fb>) \u003C\u002Fsummary>\n\n    @inproceedings{cummings2023drift,\n        title={Detecting Dataset Drift and Non-IID Sampling via k-Nearest Neighbors},\n        author={Cummings, Jesse and Snorrason, Elías and Mueller, Jonas},\n        booktitle={ICML Workshop on Data-centric Machine Learning Research},\n        year={2023}\n    }\n\n\u003C\u002Fdetails>\n\n如需了解或引用上述未提及的 cleanlab 其他功能，请访问我们的 [博客](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn)。\n\n## 其他资源\n\n- [演示本包实际应用的示例笔记本](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples)\n\n- [Cleanlab 博客](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Flearn)\n\n- [博客文章：置信学习简介](https:\u002F\u002Fl7.curtisnorthcutt.com\u002Fconfident-learning)\n\n- [NeurIPS 2021 论文：测试集中普遍存在的标签错误会破坏机器学习基准的稳定性](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.14749)\n\n- [以数据为中心的人工智能导论（MIT IAP 课程）](https:\u002F\u002Fdcai.csail.mit.edu\u002F)\n\n- [过往版本的发布说明](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Freleases)\n\n**有兴趣贡献吗？** 请参阅[贡献指南](CONTRIBUTING.md)、[开发指南](DEVELOPMENT.md)，以及[关于有用贡献的想法](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fwiki#ideas-for-contributing-to-cleanlab)。\n\n**有问题吗？** 请查看[我们的常见问题解答](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Ffaq.html)和[Github 问题](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues?q=is%3Aissue)。","# Cleanlab 快速上手指南\n\nCleanlab 是一个开源的数据中心 AI 库，旨在通过自动检测机器学习数据集中的问题（如标签错误、异常值、重复数据等）来“清洗”数据和标签。它利用你现有的模型来诊断数据问题，从而训练出更鲁棒的模型。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**：支持 Linux, macOS 以及 Windows。\n*   **Python 版本**：需要 Python 3.10 或更高版本。\n*   **前置依赖**：你需要有一个已经训练好或可以训练的机器学习模型（支持 PyTorch, TensorFlow, Scikit-learn, XGBoost, HuggingFace 等任意框架），并能够获取模型的预测概率 (`pred_probs`) 或特征嵌入 (`features`)。\n\n## 安装步骤\n\n推荐使用 `pip` 进行安装。国内开发者建议使用清华或阿里镜像源以加速下载。\n\n**使用 pip 安装（推荐）：**\n```bash\npip install cleanlab -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n**使用 conda 安装：**\n```bash\nconda install -c conda-forge cleanlab\n```\n\n**使用 uv 安装（极速）：**\n```bash\nuv pip install cleanlab\n```\n\n## 基本使用\n\nCleanlab 的核心工作流非常简洁：初始化 `Datalab` -> 传入模型预测结果 -> 自动发现问题 -> 生成报告。\n\n以下是最简单的快速开始示例，适用于文本、图像、音频或表格数据：\n\n### 1. 自动检测数据问题 (Datalab)\n\n假设你已经有了数据集 `dataset`（包含特征和标签列），并且通过某个模型获得了特征嵌入 `feature_embeddings` 和预测概率 `pred_probs`。\n\n```python\nimport cleanlab\n\n# 1. 初始化 Datalab，指定数据源和标签列名\nlab = cleanlab.Datalab(data=dataset, label=\"column_name_for_labels\")\n\n# 2. 传入模型输出的特征和预测概率，自动查找各类问题\n# 支持检测：标签错误、异常值、重复数据、类不平衡等\nlab.find_issues(features=feature_embeddings, pred_probs=pred_probs)\n\n# 3. 打印详细的问题诊断报告\nlab.report()\n```\n\n执行后，`lab.report()` 将输出数据集中存在的具体问题列表及严重程度，帮助你决定清洗哪些数据。\n\n### 2. 训练抗噪模型 (CleanLearning)\n\n如果你希望直接利用 Cleanlab 训练一个对噪声标签具有鲁棒性的分类器，可以使用 `CleanLearning` 包装器。它兼容任何 Scikit-learn 风格的分类器。\n\n```python\nfrom sklearn.linear_model import LogisticRegression\nimport cleanlab\n\n# 1. 包装你喜欢的任意分类器 (支持 PyTorch, XGBoost 等封装为 sklearn 接口的模型)\ncl = cleanlab.classification.CleanLearning(LogisticRegression())\n\n# 2. 一行代码找出标签错误\nlabel_issues = cl.find_label_issues(data=X, labels=y)\n\n# 3. 训练一个鲁棒的模型（自动处理噪声数据）\ncl.fit(X, y)\n\n# 4. 进行预测（相当于在无标签错误的数据上训练后的效果）\npredictions = cl.predict(X_test)\n\n# 5. 查看数据集整体健康度总结\ncleanlab.dataset.health_summary(y, confident_joint=cl.confident_joint)\n```\n\n通过上述步骤，你可以无需修改核心建模代码，仅通过改善数据质量来显著提升模型性能。","某电商团队正在构建一个基于用户评论图片的自动分类系统，旨在识别“商品破损”、“发错货”等售后问题，但训练数据由众包团队标注，存在大量噪声。\n\n### 没有 cleanlab 时\n- 模型在验证集上准确率始终卡在 78%，团队误以为是模型架构不够先进，盲目尝试更复杂的神经网络却收效甚微。\n- 人工抽检发现数据集中混入了约 15% 的标错样本（如将“正常商品”误标为“破损”），但面对十万级数据，人工逐条清洗耗时需数周。\n- 数据中存在大量重复图片和异常离群点（如全黑或无关广告图），导致模型过拟合噪声，泛化能力严重不足。\n- 团队无法量化标注质量，难以判断是哪些特定类别的标签最不可靠，优化工作如同“盲人摸象”。\n\n### 使用 cleanlab 后\n- cleanlab 利用现有模型的预测概率，自动计算出每个样本的“标签可信度”，在几分钟内精准定位出数千个潜在的标错样本。\n- 团队优先复核 cleanlab 排序靠前的可疑数据，仅用两天就完成了核心噪声清洗，使数据集质量显著提升。\n- 通过 cleanlab 检测出的重复项和离群点被自动剔除，模型不再学习无效特征，训练过程更加稳定收敛。\n- 生成的详细报告直观展示了各类别的标注一致性，指导团队针对性地重新标注高风险类别，而非盲目全量返工。\n\n经过一轮“诊断 - 清洗 - 重训”循环，在未改变任何模型代码的情况下，系统准确率从 78% 跃升至 92%，证明了高质量数据比复杂模型更能决定 AI 上限。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fcleanlab_cleanlab_29db5c4c.png","Cleanlab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fcleanlab_8a4bfd85.png","Tools to make your AI reliable and trustworthy.",null,"https:\u002F\u002Fcleanlab.ai\u002F","https:\u002F\u002Fgithub.com\u002Fcleanlab",[79],{"name":80,"color":81,"percentage":82},"Python","#3572A5",100,11435,886,"2026-04-18T13:19:06","Apache-2.0",1,"Linux, macOS, Windows","未说明",{"notes":91,"python":92,"dependencies":93},"该工具设计为与任何机器学习模型（如 PyTorch, TensorFlow, XGBoost, scikit-learn 等）配合使用，因此具体的 GPU、内存及额外依赖取决于用户所选用的底层模型框架。支持通过 pip, conda 或 uv 安装。代码经过并行化优化，具有可扩展性。","3.10+",[94,95,96,97],"numpy","pandas","scikit-learn","tqdm",[99,16,14],"其他",[101,102,103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120],"weak-supervision","data-cleaning","data-quality","noisy-labels","data-centric-ai","out-of-distribution-detection","outlier-detection","active-learning","data-labeling","data-profiling","data-validation","labeling","data-curation","annotation","datasets","exploratory-data-analysis","data-annotation","machine-learning","anomaly-detection","data-science","2026-03-27T02:49:30.150509","2026-04-19T15:38:52.409078",[124,129,134,139,144,149],{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},42444,"使用 get_noise_indices 时遇到 'IndexError: index out of bounds' 错误怎么办？","这通常是因为标签（labels）的数值不连续或未从 0 开始索引。例如，如果原始标签是 [1, 4]，cleanlab 可能会尝试访问不存在的索引。解决方法是将标签重新映射为从 0 开始的连续整数（如转换为 [0, 1]）。此外，最新版本的 cleanlab 在输入格式错误时会打印更详细的错误信息，建议升级版本并检查输入格式是否符合 FAQ 中的说明。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F40",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},42445,"处理大规模数据集（如 20 万样本、5000 类别）时出现内存溢出（OOM）如何解决？","内存问题可能与 numpy 的 .flatten() 操作或系统环境有关。建议首先尝试直接在输入矩阵 psx 上运行 psx.flatten() 而不调用 pruning.get_noise_indices()，以判断是否是 numpy 本身的问题。同时，请确认您的操作系统分布和版本（Windows, Ubuntu, macOS 等），因为不同平台下的内存管理行为可能不同。如果 psx 矩阵本身未占用超过 50% 内存却仍报错，可能是 numpy 配置问题。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F53",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},42446,"当数据集类别数量极多（如 3900 类）且 psx 矩阵稀疏时，cleanlab 效果不佳怎么办？","如果不提供完整的 psx 矩阵给 cleanlab，它将无法准确估计噪声阈值，从而遗漏大量标签错误。虽然在某些理想情况（如均匀随机噪声）下部分数据可能可行，但在现实场景中，必须提供完整的 psx 矩阵以便 cleanlab 发现所有错误并提供干净数据用于训练。对于超大类别数，请关注后续版本对大规模类别的支持优化。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F34",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},42447,"cleanlab 是否支持命名实体识别（NER）等 Token 级分类任务？","是的，cleanlab 正在开发新算法以高效支持数万类别的场景，适用于词级和 token 级分析（如 NER）。该功能计划作为 v2.1 版本的一部分发布。用户可关注官方更新，届时将能直接应用于包含数十万 token 的数据集。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F78",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},42448,"cleanlab 是否支持目标检测（Object Detection）任务的标签错误检测？","是的，已有 PR（#676）添加了对目标检测中标签错误检测的支持。该算法已 finalized，即将合并到主分支。用户可以提前试用该 PR 中的代码来检测目标检测数据集中的标签错误。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F21",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},42449,"在多标签分类（Multi-label Classification）中使用 cleanlab 时报错 'y should be a 1d array' 如何解决？","此问题在 cleanlab v2.2 中已通过重大重构解决。新版本大幅增强了多标签分类的标签错误检测能力，并简化了使用流程。请升级到 cleanlab v2.2 或更高版本，并参考官方教程：https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultilabel_classification.html。若仍遇到问题，请新开 Issue 并注明所使用的 cleanlab 版本。","https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fissues\u002F55",[155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250],{"id":156,"version":157,"summary_zh":158,"released_at":159},342179,"v2.9.0","# v2.9.0 - 依赖现代化与维护简化\n\n本次发布通过移除对 TensorFlow\u002FKeras 的支持，精简了 cleanlab 的依赖关系，从而减轻维护负担并提升与现代 Python 生态系统的兼容性。对于依赖 `cleanlab.models.keras` 的用户而言，这是一个破坏性更新，他们需要迁移到其他模型封装、自定义实现或直接使用 PyTorch。\n\n## 变更内容\n* 由 @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1289 中从文档和代码中移除 TensorFlow。\n* 由 @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1291 中修复过时的依赖，并添加 CI 新鲜度检查。\n* 由 @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1292 中使代码库兼容最新依赖。\n* 由 @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1293 中更新文档。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.8.0...v2.9.0","2026-01-13T17:39:04",{"id":161,"version":162,"summary_zh":163,"released_at":164},342180,"v2.8.0","# v2.8.0 - 支持 Python 3.12-3.14 并改进兼容性\n\n此版本将 Python 版本支持更新至 3.10-3.14，解决了与 Datasets 4.0.0+ 的兼容性问题，放宽了 NumPy 版本约束以提高灵活性，并简化了文档。此次更新确保 cleanlab 能够在当前及未来的 Python 版本上正常运行，同时为受支持的版本保持向后兼容性。\n\n## 变更内容\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1263 中更新了许可证\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1268 中移除了已停止维护的 Python 3.8 和 3.9 版本\n* @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1276 中将包的支持范围扩展到 Python 3.10-3.14\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1267 中处理了 Datasets 4.0.0+ 引入的新列类型\n* @maxbuchan 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1250 中移除了 agility chat\n* @maxbuchan 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1257 中更新了 README.md 中指向 Cleanlab 博客的链接\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1265 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1262 中更新了文档，移除了“简易模式”相关章节，并简化了 README\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1266 中缩短了机器学习优化教程\n* @ulya-tkch 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1278 中修复了 2.8.0 版本的文档\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1280 中更新了文档首页，移除了我们不再希望展示的引导内容\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1264 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1281 中缩短了常见问题解答，并更新了 FAQ 结尾处的支持行动号召\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.7.1...v2.8.0","2026-01-08T18:31:07",{"id":166,"version":167,"summary_zh":168,"released_at":169},342181,"v2.7.1","从 v2.7.0 升级时，此版本是**无破坏性**的，主要集中在文档和测试方面的改进。最值得关注的更新是：\n\n- **新的标识符列问题管理器**——可检测可能影响模型的顺序数值列。此功能目前处于预览阶段，需进行额外配置才能在 Datalab 中使用。\n\n\n**其他更新：**\n- 📖 **文档与 README**：清晰度得到提升。\n- 🛠 **测试套件**：稳定性与一致性进一步增强。\n\n## 变更内容\n* 添加了用于检测标识符列的问题管理器，由 @MaxJoas 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1120 中实现。\n* 修订了 Datalab 教程中的非 IID 部分，先展示整个数据集的得分，再呈现每个样本的详细信息，由 @gordon-lim 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1221 中完成。\n* 修复了与 numpy2 的兼容性问题，由 @GaetanLepage 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1224 中解决。\n* 将 CI 环境更新为 macOS 13，而非 12，由 @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1219 中完成。\n* 测试改进，分别由 @gordon-lim 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1218 中、@jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1220 中完成。\n* 文档及文档构建系统的通用更新，由 @misteroh 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1212、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1213、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1216 中完成；@maxbuchan 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1226、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1227 中完成；@elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1208、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1231 中完成。\n\n## 新贡献者\n* @MaxJoas 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1120 中完成了首次贡献。\n* @misteroh 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1212 中完成了首次贡献。\n* @GaetanLepage 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1224 中完成了首次贡献。\n* @maxbuchan 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1226 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.7.0...v2.7.1","2025-02-27T15:33:51",{"id":171,"version":172,"summary_zh":173,"released_at":174},342182,"v2.7.0","本次发布引入了多项新功能和改进，旨在帮助用户检测复杂数据集中的问题，并提升机器学习模型的鲁棒性。一如既往地，我们保持向后兼容性，因此从 v2.6.6 升级至此版本不会破坏现有代码。本版本继续支持 Python 3.8 至 3.11，但对 Python 3.8 的支持将在未来的次要版本中移除。\n\n\n## Datalab 中新增虚假相关性检测功能\n\n此次发布后，`Datalab` 默认会检测图像数据集中的**虚假相关性**，从而帮助用户识别可能导致过拟合或降低模型泛化能力的潜在误导性模式。\n\n所谓虚假相关性，是指模型在数据中捕捉到的并非真正有意义、而是偶然出现的模式。例如，模型可能会错误地将背景颜色与某一特定标签关联起来，从而导致在新数据上的泛化性能不佳。通过识别这些相关性，可以有效减少模型学习无关或误导性特征的风险，进而构建更加可靠的模型。\n\n在图像数据集中检测虚假相关性非常简单：\n\n```python\nfrom cleanlab import Datalab\n\nlab = Datalab(data=image_dataset, label_name=\"label_column\", image_key=\"image_column\")\n\nlab.find_issues()\n\nlab.report()\n```\n\n您可以在我们的文档中找到更详细的[查找虚假相关性的流程](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fdatalab\u002Fworkflows.html#Identify-Spurious-Correlations-in-Image-Datasets)。\n\n\n这一新的问题类型旨在为用户提供对数据的深入洞察，从而支持更稳健的模型开发。\n\n\n## 新教程：通过训练集和测试集的数据清洗提升机器学习性能\n\n我们推出了一篇新教程，演示如何利用 `Datalab` 对训练数据和测试数据进行精细化处理。这种方法有助于确保在噪声数据集上也能可靠地进行模型训练和评估。\n\n您可以在我们的文档中找到该教程：[通过训练集与测试集划分进行数据清洗以提升机器学习性能](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fimproving_ml_performance.html)。\n\n\n## 其他主要改进\n\n- **优化内部函数**：对多个内部函数进行了优化，包括 `clip_noise_rates`、`remove_noise_from_class` 和 `clip_values` 等，从而提升了 cleanlab 的整体效率。\n- **改进表现欠佳组检测**：针对所有表现欠佳的子集，优化了评分机制，能够更准确地识别有问题的数据子集。\n\n如果您有新功能的建议或发现了任何问题，欢迎在我们的 GitHub 仓库中提交 Issue 或 Pull Request！\n\n\n## 更改日志\n\n本版本的重要变更包括：\n\n* 增加虚假相关性功能，由 @allincowell 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1140、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1171、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1181、https:\u002F\u002Fgithub","2024-09-26T16:45:26",{"id":176,"version":177,"summary_zh":178,"released_at":179},342183,"v2.6.6","## 变更内容\n\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1100 中、@jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1136 中对问题类型指南进行了改进；\n* 由 @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1094 中改进了 token_classification\u002Fsummary.py 中的文档字符串；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1135 中更新了用于决定是否省略 underperforming_group_check 的字典；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1125、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1137 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1138 中添加了包含各种 Datalab 工作流的笔记本；\n* 由 @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1134 中、@elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1154 中更新了 Datalab 报告文本；\n* 由 @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1139 中、@elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1152 中更新了常见问题解答部分；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1144 中在 CI 中固定了 fasttext 版本；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1146 中改进了测试环境的搭建；\n* 由 @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1148 中更新了已过时的快速入门链接；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1142 中更新了 KNN 的 Shapley 分数计算；\n* 由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1155 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1163 中重构了问题管理器中 KNN 图的处理及异常值检测逻辑。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.5...v2.6.6","2024-06-25T23:10:38",{"id":181,"version":182,"summary_zh":183,"released_at":184},342184,"v2.6.5","## 变更内容\n* @allincowell 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1118 中，为 Datalab 快速入门教程末尾添加了端到端测试。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1117、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1119 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1129 中，将构建和修正 k-近邻图的现有功能集中到一个单独的模块中。\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1077 中，优化了 multiannotator.py 的性能。\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1073 中，针对存在缺失类别的场景优化了 value_counts 函数的性能。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1123 中，提升了 `CleanLearning` 中设置置信联合分布时的测试覆盖率。\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1096 中，将空值检查从 np.isnan 改为 pd.isna。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1126 中，更新了目标检测教程中的 pip 安装说明。\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1099 中，细化了对 `underperforming_group` 问题类型的处理。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1124 中，通过移除 LogisticRegression 中已弃用的 `multi_class` 参数，提高了与 sklearn 1.5 的兼容性。\n* @nelsonauner 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1128 中，在表格教程中动态显示完全重复的数据集。\n\n## 新贡献者\n* @allincowell 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1118 中完成了首次贡献。\n* @nelsonauner 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1128 中完成了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.4...v2.6.5","2024-05-24T23:38:29",{"id":186,"version":187,"summary_zh":188,"released_at":189},342185,"v2.6.4","## 变更内容\n\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1064、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1067、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1079、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1087、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1095、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1106、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1107 中进行了多项性能优化和测试改进。\n* @mturk24 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1066 中将文本分类和表格分类教程重构为 CleanLear…\n* @coding-famer 提供了面向用户的 cleanlab.datavaluation 模块，相关 PR 为 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1050。\n* @coding-famer 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1085 中修复了 datalab 问题类型的拼写错误。\n* @mturk24 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1084 中，以及 @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1088 中，为调用 plt.show() 的函数添加了 kwargs 参数。\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1089、https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1090 和 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1091 中更新了教程。\n* @desboisGIT 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1101 中，以及 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1086 中，细化了类型提示。\n* @mturk24 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1102 中更新了非独立同分布问题的 datalab 问题类型描述。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1108 中移除了图像教程中的 unsqueeze 调用。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1110 中因与 Python 3.8 和 3.9 不兼容，暂时将 CI 环境回退到 macOS 12。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1113 中修复了欧几里得距离度量的数值不稳定性。\n* @jwmueller 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1114 中，以及 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1116 中，避免了可能导致敏感问题的除法操作。\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1115 中完成了所有相同数据集的测试。\n\n## 新贡献者\n* @gogetron 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1064 中做出了首次贡献。\n* @desboisGIT 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1101 中做出了首次贡献。\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.3...v2.6.4","2024-05-07T18:23:22",{"id":191,"version":192,"summary_zh":193,"released_at":194},342186,"v2.6.3","从 v2.6.2 升级时，此版本不会破坏现有功能。\n\n## 变更内容\n\n* @sanjanag 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1048 中更新了 `image_key` 的文档\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1056 中优化了打分机制，并提升了包含完全相同样本的数据集的稳定性\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1057 中向文档添加了关于 TensorFlow 兼容性的警告信息\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.2...v2.6.3","2024-03-19T22:08:25",{"id":196,"version":197,"summary_zh":198,"released_at":199},342187,"v2.6.2","从 v2.6.1 升级时，此版本不会造成破坏性变更。\n\n## 变更内容\n* 在空值检查中，将 DataFrame 的特征转换为 NumPy 数组，由 @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1045 中实现。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.1...v2.6.2","2024-03-08T16:18:14",{"id":201,"version":202,"summary_zh":203,"released_at":204},342188,"v2.6.1","从 v2.6.0 升级时，本次发布不会破坏现有功能。一些值得关注的更新包括：\n\n1. `cleanlab.regression` 模块中的标签质量分数得到了改进，使其更易于人类理解。\n   - 这仅涉及对分数进行重新缩放，以显示更易解读的分数范围，而不会影响数据点在数据集中根据这些分数的排序方式。\n2. 更好地处理了 `Datalab.get_issues()` 中的一些边缘情况。\n\n## 变更内容\n* 由 @jwmueller 在 #1030、#1031、#1039 中更新的 README；@elisno 在 #1040 中更新的 README\n* @huiwengoh 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1032 中调整了回归任务中标签质量分数的范围\n* @elisno 在 #1025、#1026、#1028 中对 get_issues 方法进行了一些修复\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1023 中实现了在 Datalab 中支持将特征作为输入用于数据估值检查\n* @mturk24 在 #1029 中以及 @elisno 在 #1024、#1037 中修复并澄清了文档\n* @elisno 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1036 中进行了 CI\u002FCD 相关的更改\n\n## 新贡献者\n* @mturk24 在 https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F1029 中做出了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.6.0...v2.6.1","2024-03-07T14:02:20",{"id":206,"version":207,"summary_zh":208,"released_at":209},342189,"v2.6.0","This release is non-breaking when upgrading from v2.5.0, continuing our commitment to maintaining backward compatibility while introducing new features and improvements.\r\nHowever, this release drops support for Python 3.7 while adding support for Python 3.11.\r\n\r\n## Enhancements to Datalab\r\n\r\nIn this update, Datalab, our dataset analysis platform, enhances its ability to identify various types of issues within your datasets. With this release, Datalab now detects additional types of issues by default, offering users a more comprehensive analysis. Specifically, it can now:\r\n\r\n- Identify `null` values in your dataset.\r\n- Detect `class_imbalance`.\r\n- Highlight an `underperforming_group`, which refers to a subset of data points where your model exhibits poorer performance compared to others.\r\n  See our [FAQ](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Ffaq.html#How-do-I-specify-pre-computed-data-slices\u002Fclusters-when-detecting-the-Underperforming-Group-Issue?)\r\n  for more information on how to provide pre-defined groups for this issue type.\r\n\r\nAdditionally, Datalab can now optionally:\r\n\r\n- Assess the value of data points in your dataset using KNN-Shapley scores as a measure of `data_valuation`.\r\n\r\nIf you have ideas for new features or notice any bugs, we encourage you to open an Issue or Pull Request on our GitHub repository!\r\n\r\n## Expanded Datalab Support for New ML Tasks\r\n\r\nWith cleanlab v2.6.0, Datalab extends its support to new machine-learning tasks and introduces enhancements across the board.\r\nThis release introduces the `task` parameter in Datalab's API, enabling users to specify the type of machine learning task they are working on.\r\n\r\n```python\r\nfrom cleanlab import Datalab\r\n\r\nlab = Datalab(..., task=\"regression\")\r\n```\r\n\r\nThe `task`s currently supported are:\r\n\r\n- **classification** (*default*): Includes all previously supported issue-checking capabilities based on `pred_probs`, `features`, or a `knn_graph`, and the new features introduced earlier.\r\n- **regression** (*new*):\r\n    - Run specialized label error detection algorithms on regression datasets. You can see this in action in our updated [regression tutorial](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Fregression.html#5.-Other-ways-to-find-noisy-labels-in-regression-datasets).\r\n    - Find other issues utilizing `features` or a `knn_graph`.\r\n- **multilabel** (*new*):\r\n    - Detect label errors in multilabel classification datasets using `pred_probs` exclusively. Explore the updated capabilities in our [multilabel tutorial](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Fmultilabel_classification.html).\r\n    - Find various other types of issues based on `features` or a `knn_graph`.\r\n\r\n## Improved Object Detection Dataset Exploration\r\n\r\nNew functions have been introduced to enhance the exploration of object detection datasets, simplifying data comprehension and issue detection.\r\nLearn how to leverage some of these functions in our [object detection tutorial](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Fobject_detection.html#Exploratory-data-analysis).\r\n\r\n## Other Major Improvements\r\n\r\n- Rescaled Near Duplicate and Outlier Scores:\r\n  - Note that what matters for all cleanlab issue scores is not their absolute magnitudes but rather how these scores rank the data points from most to least severe instances of the issue. But based on user feedback, we have updated the near duplicate and outlier scores to display a more human-interpretable range of values. How these scores rank data points within a dataset remains unchanged.\r\n- Consistency in counting label issues:\r\n  - `cleanlab.dataset.health_summary()` now returns the same number of issues as `cleanlab.classification.find_label_issues()` and `cleanlab.count.num_label_issues()`.\r\n- Improved handling of non-iid issues:\r\n  - The non-iid issue check in Datalab now handles `pred_probs` as input.\r\n- Better reporting in Datalab:\r\n  - Simplified `Datalab.report()` now highlights only detected issue types. To view all checked issue types, use `Datalab.report(show_all_issues=True)`.\r\n- Enhanced Handling of Binary Classification Tasks:\r\n  - Examples with predicted probabilities close to 0.5 for both classes are no longer flagged as label errors, improving the handling of binary classification tasks.\r\n- Experimental Functionality:\r\n  - cleanlab now offers experimental functionality for detecting label issues in **span categorization** tasks with a single class, enhancing its applicability in natural language processing projects.\r\n\r\n## New Contributors\r\n\r\nWe're thrilled to welcome new contributors to the cleanlab community! Your contributions help us improve and grow cleanlab:\r\n\r\n* @smttsp made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F867\r\n* @abhijitpal1247 made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F856\r\n* @01PrathamS made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F893\r\n* @mglowacki100 made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanl","2024-02-16T06:21:15",{"id":211,"version":212,"summary_zh":213,"released_at":214},342190,"v2.5.0","This release is non-breaking when upgrading from v2.4.0 (except for certain methods in `cleanlab.experimental` that have been moved, especially utility methods related to Datalab).\r\n\r\n## New ML tasks supported\r\n\r\nCleanlab now supports all of the most common ML tasks!  This newest release adds dedicated support for the following types of datasets:\r\n- **regression** (finding errors in numeric data): see `cleanlab.regression` and the \"noisy labels in regression\" quickstart tutorial.\r\n- **object detection**: see `cleanlab.object_detection` and the \"Object Detection\" quickstart tutorial.\r\n- **image segmentation**: see `cleanlab.segmentation` and the \"Semantic Segmentation tutorial.\r\n\r\nCleanlab previously already supported: multi-class classification, multi-label classification (image\u002Fdocument tagging), token classification (entity recognition, sequence prediction). \r\n\r\nIf there is another ML task you'd like to see this package support, please let us know (or even better open a Pull Request)!\r\n\r\nSupporting these ML tasks properly required significant research and novel algorithms developed by our scientists. We have published papers on these for transparency and scientific rigor, check out the list in the README or learn more at:\r\nhttps:\u002F\u002Fcleanlab.ai\u002Fresearch\u002F\r\nhttps:\u002F\u002Fcleanlab.ai\u002Fblog\u002F\r\n\r\n## Improvements to Datalab\r\n\r\n[Datalab](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Fdatalab\u002F) is a general platform for detecting all sorts of common issues in real-world data, and the best place to get started for running this library on your datasets.\r\n\r\nThis release introduces major improvements and new functionalities in Datalab that include the ability to:\r\n\r\n- Detect low-quality images in computer vision data (blurry, over\u002Funder-exposed, low-information, ...) via the integration of [CleanVision](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Fcleanvision\u002F).\r\n- Detect label issues even without `pred_probs` from a ML model (you can instead just provide `features`). \r\n- Flag rare classes in imbalanced classification datasets.\r\n- Audit unlabeled datasets.\r\n\r\n## Other major improvements\r\n\r\n- 50x speedup in the cleanlab.multiannotator code for analyzing data labeled by multiple annotators.\r\n- Out-of-Distribution detection based on `pred_probs` via the [GEN algorithm](https:\u002F\u002Fopenaccess.thecvf.com\u002Fcontent\u002FCVPR2023\u002Fpapers\u002FLiu_GEN_Pushing_the_Limits_of_Softmax-Based_Out-of-Distribution_Detection_CVPR_2023_paper.pdf) which is particularly effective for datasets with tons of classes.\r\n- Many of the methods across the package to find label issues now support a `low_memory` option. When specified, it uses an approximate mini-batching algorithm that returns results much faster and requires much less RAM. \r\n\r\n## New Contributors\r\n\r\nTransforming cleanlab into the first universal data-centric AI platform is a major effort and we need your help!  Many easy ways to contribute are listed [on our github](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fwiki#ideas-for-contributing-to-cleanlab) or you can jump into the discussions on [Slack](https:\u002F\u002Fcleanlab.ai\u002Fslack).  We immensely appreciate all of the contributors who've helped build this package into what it is today, especially:\r\n\r\n* @gordon-lim made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F746\r\n* @tataganesh made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F751\r\n* @vdlad made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F677\r\n* @axl1313 made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F798\r\n* @coding-famer made their first contribution in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F800\r\n\r\n\r\n## Change Log\r\n\r\n* New feature: Label error detection in regression datasets by @krmayankb in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F572; by @huiwengoh in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F830\r\n\r\n* New feature: ObjectLab for detecting mislabeled images in objection detection datasets by @ulya-tkch in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F676, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F739, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F745, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F770, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F779, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F807, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F833; by @aditya1503 in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F750, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F804\r\n\r\n* New feature: Label error detection in segmentation datasets by @vdlad in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F677; by @ulya-tkch in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F754, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F756, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F759, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F772; by @elisno in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F775\r\n\r\n* New feature: CleanVision to detect low-quality images by @sanjanag in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F679, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F797\r\n\r\n* New image quickstart tutorial that uses Datalab by @sanjanag i","2023-09-11T14:44:32",{"id":216,"version":217,"summary_zh":218,"released_at":219},342191,"v2.4.0","Cleanlab has grown into a popular package used by thousands of data scientists to diagnose issues in diverse datasets and improve the data itself in order to fit more robust models.  Many new methods\u002Falgorithms were added in recent months to increase the capabilities of this data-centric AI library.\r\n\r\n## Introducing Datalab\r\n\r\nNow we've added a unified platform called `Datalab` for you to apply many of these capabilities in a single line of code! \r\nTo audit any classification dataset for issues, first use any trained ML model to produce `pred_probs` (predicted class probabilities) and\u002For `feature_embeddings` (numeric vector representations of each datapoint). Then, these few lines of code can detect many types of real-world issues in your dataset like label errors, outliers, near duplicates, etc:\r\n\r\n```python\r\nfrom cleanlab import Datalab\r\n\r\nlab = Datalab(data=dataset, label_name=\"column_name_for_labels\")\r\nlab.find_issues(features=feature_embeddings, pred_probs=pred_probs)\r\nlab.report()  # summarize the issues found, how severe they are, and other useful info about the dataset\r\n```\r\n\r\nFollow our [blog](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002F) to better understand how this works internally, many articles will be published there shortly! \r\nA detailed description of each type of issue `Datalab` can detect is provided in [this guide](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Fcleanlab\u002Fdatalab\u002Fguide\u002Fissue_type_description.html), but we recommend first starting with the tutorials which show you how easy it is to run on your own dataset.\r\n\r\n`Datalab` can be used to do things like find label issues with string class labels (whereas the prior `find_label_issues()` method required integer class indices). But you are still free to use all of the prior cleanlab methods you're used to! `Datalab` is also using these internally to detect data issues.\r\n\r\nOur goal is for `Datalab` to be an easy way to run a comprehensive suite of cleanlab capabilities on any dataset. This is an evolving paradigm, so be aware some `Datalab` APIs may change in subsequent package versions -- as noted in the documentation.\r\nYou can easily run the issue checks in `Datalab` together with a custom issue type you define outside of cleanlab. This customizability also makes it easy to contribute new data quality algorithms into `Datalab`. Help us build the best open-source platform for data-centric AI by adding your ideas or those from recent publications!  Feel free to reach out via [Slack](https:\u002F\u002Fcleanlab.ai\u002Fslack).\r\n\r\n## Revamped Tutorials\r\n\r\nWe've updated some of our existing tutorials with more interesting datasets and ML models. Regarding the basic tutorials on identifying label issues in classification data from various modalities (image, text, audio, tables), we have also created an analogous versions to detect issues in these same datasets with `Datalab` instead (see `Datalab Tutorials`). This should help existing users quickly ramp up on using `Datalab` to see how much more powerful this comprehensive data audit can be.\r\n\r\n## Improvements for Multi-label Classification\r\n\r\nTo provide a better experience for users with multi-label classification datasets, we have explicitly separated the functionality to work with these into the `cleanlab.multilabel_classification` module. So please start there rather than specifying the `multi_label=True` flag in certain methods outside of this module, as that option will be deprecated in the future. \r\n\r\nParticularly noteworthy are the new dataset-level issue summaries for multi-label classification datasets, available in the `cleanlab.multilabel_classification.dataset` module. \r\n\r\nWhile moving methods to the `cleanlab.multilabel_classification` module, we noticed some bugs in existing methods. We got rid of these methods entirely (replacing them with new ones in the `cleanlab.multilabel_classification` module), so some changes may appear to be backwards incompatible, even though the original code didn't function as intended in the first place.\r\n\r\n#### Backwards incompatible changes\r\n\r\nYour existing code will break if you do not upgrade to the new versions of these methods (the existing cleanlab v.2.3.1 code was probably producing bad results anyway based on some bugs that have been fixed).  Here are changes you must make in your code for it to work with newer cleanlab versions:\r\n\r\n1) `cleanlab.dataset.rank_classes_by_label_quality(..., multi_label=True)`\r\n→\r\n`cleanlab.multilabel_classification.dataset.rank_classes_by_label_quality(...)`\r\n\r\nThe `multi_label=False\u002FTrue` argument will be removed in the future from the former method.\r\n\r\n2) `cleanlab.dataset.find_overlapping_classes(..., multi_label=True)`\r\n→\r\n`cleanlab.multilabel_classification.dataset.common_multilabel_issues(...)`\r\n\r\nThe `multi_label=False\u002FTrue` argument will be removed in the future from the former method. The returned DataFrame is slightly different, please refer to the new method's documentation.\r\n\r\n3) `cleanlab.dataset.overall_label_health_score(...multi_label=","2023-05-13T01:49:25",{"id":221,"version":222,"summary_zh":223,"released_at":224},342192,"v2.3.1","This minor release primarily just improves the user experience when encountering various edge-cases in:\r\n- find_label_issues method\r\n- find_overlapping_issues method\r\n- cleanlab.multiannotator module\r\n\r\nThis release is non-breaking when upgrading from v2.3.0. Two noteworthy updates in the `cleanlab.multiannotator` module include a:\r\n1. better tie-breaking algorithm inside of `get_majority_vote_label()` to avoid diminishing the frequency of rarer classes (this only plays a role when `pred_probs` are not provided).\r\n2. better user-experience for `get_active_learning_scores()` to support scoring only unlabeled data or only labeled data. More of the arguments can now be `None`.\r\n\r\n\r\n## What's Changed\r\n* Readme updates by @jwmueller in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F645, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F650, https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F656\r\n* describe activelab in the documentation by @jwmueller in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F648\r\n* Added clipping to address issue #639 by @ulya-tkch in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F647\r\n* Fix for not specifying labels in find_overlapping_issues by @huiwengoh in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F652\r\n* Bug fixes + improvements to multiannotator module by @huiwengoh in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F654\r\n* FAQ question\u002Fanswer on handling label errors in train vs test data by @jwmueller in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F655\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fcompare\u002Fv2.3.0...v2.3.1","2023-03-28T06:37:56",{"id":226,"version":227,"summary_zh":228,"released_at":229},342193,"v2.3.0","Cleanlab was originally open-sourced as code to accompany a [research paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.00068) on label errors in classification tasks, to prove to skeptical researchers that it's possible to utilize ML models to discover mislabeled data and then train even better versions of these same models. We've been hard at work since then, turning this into an industry-grade library that helps you handle label errors in many ML tasks such as: entity recognition, image\u002Fdocument tagging, data labeled by multiple annotators, etc. While label errors are critical to deal with in real-world ML applications, data-centric AI involves utilizing trained ML models to improve the data in other ways as well.\r\n\r\nWith the newest release, cleanlab v2.3 can now automatically:\r\n- [find mislabeled data + train robust models](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Findepth_overview.html)\r\n- [detect outliers and out-of-distribution data](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Foutliers.html)\r\n- [estimate consensus + annotator-quality for multi-annotator datasets](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002Fmultiannotator.html)\r\n- [suggest which data is best to (re)label next](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb)\r\n\r\nAs always, the cleanlab library works with almost any ML model (no matter how it was trained) and type of data (image, text, tabular, audio, etc). We have user-friendly [5min tutorials](https:\u002F\u002Fdocs.cleanlab.ai\u002F) to get started with any of the above objectives and easily improve your data!\r\n\r\nWe're aiming for this library to provide all the key functionalities needed to practice data-centric AI. Much of this involves inventing new algorithms for data quality, and we transparently publish all of these algorithms in [scientific papers](https:\u002F\u002Fcleanlab.ai\u002Fresearch\u002F). Read these to understand how particular cleanlab methods work under the hood and see extensive benchmarks of how effective they are on real data.\r\n\r\n# Highlights of what’s new in 2.3.0:\r\n\r\nWe have added new functionality for active learning and easily making Keras models compatible with sklearn.  Label issues can now be estimated 10x faster and with much less memory using new methods added to help users with massive datasets.  This release is non-breaking when upgrading from v2.2.0 (except for certain methods in `cleanlab.experimental` that have been moved).\r\n\r\n## Active Learning with ActiveLab\r\n\r\nFor settings where you want to label more data to get better ML, active learning helps you train the best ML model with the least data labeling. Unfortunately data annotators often give imperfect labels, in which case we might sometimes prefer to have another annotator check an already-labeled example rather than labeling an entirely new example. [ActiveLab](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Factive-learning\u002F) is a new algorithm invented by our team that automatically answers the question: **which new data should I label or which of my current labels should be checked again?**  ActiveLab is highly practical — it runs quickly and works with: any type of ML model, batch settings where many examples are (re)labeled before model retraining, and settings where multiple annotators can label an example (or just one annotator). \r\n\r\nHere's all the code needed to determine active learning scores for examples in your unlabeled pool (no annotations yet) and labeled pool (at least one annotation already collected). \r\n\r\n```\r\nfrom cleanlab.multiannotator import get_active_learning_scores\r\n\r\nscores_labeled_pool, scores_unlabeled_pool = get_active_learning_scores(\r\n        multiannotator_labels, pred_probs, pred_probs_unlabeled\r\n    )\r\n```\r\n\r\nThe batch of examples with the lowest scores are those that are most informative to collect an additional label for (scores between labeled vs unlabeled pool are directly comparable). You can either have a new annotator label the batch of examples with lowest scores, or distribute them amongst your previous annotators as is most convenient. ActiveLab is also effective for: standard active learning where you collect at most one label per example (no re-labeling), as well as *active label cleaning* (with no unlabeled pool) where you only want to re-label examples to ensure 100% correct consensus labels (with the least amount of re-labeling).\r\n\r\nGet started running ActiveLab with our [tutorial notebook](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Fblob\u002Fmaster\u002Factive_learning_multiannotator\u002Factive_learning.ipynb) from our repo that has many other [examples](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002F).\r\n\r\n## KerasWrapper\r\n\r\nWe've introduced [one-line wrappers](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Fcleanlab\u002Fmodels\u002Fkeras.html) for TensorFlow\u002FKeras models that enable you to use TensorFlow models within scikit-learn workflows with features like `Pipeline`, `GridSearch` and more. Just change one line of code to make your existing Tensorflow\u002FKeras model compatible with scikit-learn’s rich ecosystem!  All you","2023-03-01T09:08:04",{"id":231,"version":232,"summary_zh":233,"released_at":234},342194,"v2.2.0","You asked, we listened! cleanlab v2.2.0 addresses two of the biggest pain points we often heard from our users:\r\n\r\n1. Lack of clarity around how cleanlab works for multi-label datasets and how to best utilize it.\r\n2. Not being usable for datasets with omitted classes (eg. rare classes dropped in a data split).\r\n\r\nThis release is non-breaking when upgrading from v2.1.0, but you will now get more accurate results (in all the datasets we tested) when finding label issues in multi-label classification datasets. \r\n\r\nThis release also adds a [new satisfyingly accurate algorithms](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002Fmultilabel\u002F) for finding label errors in multi-label data for improved multi-label classification tasks like text\u002Fimage tagging.\r\n\r\n# Highlights of what’s new in 2.2.0:\r\n\r\n## Multi-label support for applications like image\u002Fdocument\u002Ftext tagging\r\n\r\nThe newest version of cleanlab features a complete overhaul of cleanlab’s multi-label classification functionality: \r\n\r\n- We invented new algorithms for detecting label errors in multi-label datasets that are significantly more effective. These methods are formally described and extensively benchmarked in our [research paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.13895).\r\n- We added `cleanlab.multilabel_classification` module for label quality scoring.\r\n- We now offer an easy-to-follow [quickstart tutorial](https:\u002F\u002Fdocs.cleanlab.ai\u002Fstable\u002Ftutorials\u002F) for learning how to apply cleanlab to multi-label datasets.\r\n- We’ve created [example notebooks](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples\u002Ftree\u002Fmaster\u002Fmultilabel_classification) on using cleanlab to clean up image tagging datasets, and how to train a state-of-the-art Pytorch neural network for multi-label classification with any image dataset.\r\n- All of this multi-label functionality is now robustly tested via a comprehensive suite of unit tests to ensure it remains performant.\r\n\r\n## cleanlab now works when your labels have some classes missing relative to your predicted probabilities\r\n\r\nThe package now works for datasets in which some classes happen to not be present (but are present say in the `pred_probs` output by a model). This is useful when you:\r\n\r\n- Want to use a pretrained model that was fit with additional classes\r\n- Have rare classes and happen to split the data in an unlucky way\r\n- Are doing active learning or other dynamic modeling with data that are iteratively changing\r\n- Are analyzing multi-annotator datasets with `cleanlab.multiannotator` and some annotators occasionally select a really rare class.\r\n\r\n## Other major improvements\r\n\r\n(in addition to too many bugfixes to name):\r\n\r\n- Accuracy improvements to the algorithm used to estimate the number of label errors in a dataset via `count.num_label_issues()`.  — @ulya-tkch\r\n- Introduction of flake8 code linter to ensure the highest standards for our code. — @ilnarkz, @mohitsaxenaknoldus\r\n- More comprehensive mypy type annotations for cleanlab functions to make our code safer and more understandable. — @elisno, @ChinoCodeDemon, @anishathalye, @jwmueller, @huiwengoh, @ulya-tkch\r\n\r\nSpecial thanks to Po-He Tseng for helping with early tests of our improved multi-label algorithms and the research behind developing them.\r\n\r\n## Workflows of interest in cleanlab v2.2:\r\n\r\nFinding label issues in multi-label classification is done using the same code and inputs as before (and the same object is returned as before):\r\n```python\r\nfrom cleanlab.filter import find_label_issues\r\n\r\nranked_label_issues = find_label_issues(\r\n    labels=labels,\r\n    pred_probs=pred_probs,\r\n    multi_label=True,\r\n    return_indices_ranked_by=\"self_confidence\",\r\n)\r\n```\r\nWhere for a 3-class multi-label dataset with 4 examples, we might have say:\r\n```python\r\nlabels = [[0], [0, 1], [0, 2], [1]]\r\n\r\npred_probs = np.array(\r\n    [[0.9, 0.1, 0.1],\r\n     [0.9, 0.1, 0.8],\r\n     [0.9, 0.1, 0.6],\r\n     [0.2, 0.8, 0.3]]\r\n)\r\n```\r\n\r\n\r\nThe following code (in which class 1 is missing from the dataset) did not previously work but now runs without problem in cleanlab v2.2.0:\r\n```python\r\nfrom cleanlab.filter import find_label_issues\r\nimport numpy as np\r\n\r\nlabels = [0, 0, 2, 0, 2]\r\npred_probs = np.array(\r\n    [[0.8, 0.1, 0.1],\r\n     [0.7, 0.1, 0.2],\r\n     [0.3, 0.1, 0.6],\r\n     [0.5, 0.2, 0.3],\r\n     [0.1, 0.1, 0.8]]\r\n)\r\n\r\nlabel_issues = find_label_issues(\r\n    labels=labels,\r\n    pred_probs=pred_probs,\r\n)\r\n```\r\n\r\n## Looking forward\r\n\r\nThe next major release of this package will introduce a paradigm shift in the way people check their datasets. Today this involves significant manual labor, but software should be able to help!  Our research has developed algorithms that can automatically detect many types of common issues that plague real-world ML datasets. The next version of cleanlab will offer an easy-to-use line of code that runs all of our appropriate algorithms to help ensure a given dataset is issue-free and well-suited for supervised learning.\r\n\r\nTransforming cleanlab into the first universal data-centric AI platform is a major ","2022-11-28T16:28:13",{"id":236,"version":237,"summary_zh":238,"released_at":239},342195,"v2.1.0","v2.1.0 begins extending this library beyond **standard classification** tasks, taking initial steps toward the first tool that can detect label errors in data from *any* **Supervised Learning** task (leveraging *any* model trained for that task).  This release is **non-breaking** when upgrading from v2.0.0.\r\n\r\n## Highlights of what’s new in 2.1.0:\r\n\r\nMajor new functionalities:\r\n\r\n- **CROWDLAB algorithms** for analysis of data labeled by multiple annotators — @huiwengoh, @ulya-tkch, @jwmueller\r\n    - Accurately infer the best consensus label for each example\r\n    - Estimate the quality of each consensus label (how likely is it correct)\r\n    - Estimate the overall quality of each annotator (how trustworthy are their suggested labels)\r\n- **Out of Distribution Detection** based on either:\r\n    - feature values\u002Fembeddings — @ulya-tkch, @jwmueller, @JohnsonKuan\r\n    - predicted class probabilities — @ulya-tkch\r\n- **Label error detection for Token Classification** tasks (NLP \u002F text data)  — @ericwang1997, @elisno\r\n- **CleanLearning can now:**\r\n    - Run on non-array data types including: pandas Dataframe, pytorch\u002Ftensorflow Dataset objects, and many other types of data formats. — @jwmueller\r\n    - Allow base model’s fit() to utilize validation data in each fold during cross-validation (eg. for early-stopping or hyperparameter-optimization purposes). — @huiwengoh\r\n    - Train with custom sample weights for datapoints. — @rushic24, @jwmueller\r\n    - Utilize any Keras model (supporting both sequential or functional APIs) via cleanlab’s `KerasWrapperModel` , which makes these models compatible with sklearn and tensorflow Datasets. — @huiwengoh, @jwmueller\r\n\r\nMajor improvements (in addition to too many bugfixes to name):\r\n\r\n- Reduced dependencies: `scipy` is no longer needed — @anishathalye\r\n- Clearer error\u002Fwarning messages throughout package when data\u002Finputs are strangely  formatted — @cgnorthcutt, @jwmueller, @huiwengoh\r\n- FAQ section in tutorials with advice for commonly encountered issues — @huiwengoh, @ulya-tkch, @jwmueller, @cgnorthcutt\r\n- Many additional tutorial and example notebooks at: \r\n[docs.cleanlab.ai](http:\u002F\u002Fdocs.cleanlab.ai\u002F) and [https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples) — @ulya-tkch, @huiwengoh, @jwmueller, @ericwang1997\r\n- Static type annotations to ensure robust code — @anishathalye, @elisno\r\n\r\n\r\n# Examples of new workflows available in 2.1:\r\n\r\n## Out of Distribution and Outlier Detection\r\n\r\n1. Detect **out of distribution** examples in a dataset based on its numeric **feature embeddings**\r\n```python\r\nfrom cleanlab.outlier import OutOfDistribution\r\n\r\nood = OutOfDistribution()\r\n\r\n# To get outlier scores for train_data using feature matrix train_feature_embeddings\r\nood_train_feature_scores = ood.fit_score(features=train_feature_embeddings)\r\n\r\n# To get outlier scores for additional test_data using feature matrix test_feature_embeddings\r\nood_test_feature_scores = ood.score(features=test_feature_embeddings)\r\n```\r\n\r\n2. Detect **out of distribution** examples in a dataset based on **predicted class probabilities** from a trained classifier\r\n```python\r\nfrom cleanlab.outlier import OutOfDistribution\r\n\r\nood = OutOfDistribution()\r\n\r\n# To get outlier scores for train_data using predicted class probabilities (from a trained classifier) and given class labels\r\nood_train_predictions_scores = ood.fit_score(pred_probs=train_pred_probs, labels=labels)\r\n\r\n# To get outlier scores for additional test_data using predicted class probabilities\r\nood_test_predictions_scores = ood.score(pred_probs=test_pred_probs) \r\n```\r\n\r\n## Multi-annotator -- support data with multiple labels\r\n\r\n3. For data **labeled by multiple annotators** (stored as matrix `multiannotator_labels` whose rows correspond to examples, columns to each annotator’s chosen labels), cleanlab v2.1 can: find improved consensus labels, score their quality, and assess annotators, all by leveraging predicted class probabilities `pred_probs` from *any* trained classifier\r\n```python\r\nfrom cleanlab.multiannotator import get_label_quality_multiannotator\r\n\r\nget_label_quality_multiannotator(multiannotator_labels, pred_probs)\r\n```\r\n\r\n## Support Token Classification tasks\r\n\r\n4. Cleanlab v2.1 can now find label issues in **token classification** (text) data, where each word in a sentence is labeled with one of K classes (eg. entity recognition). This relies on three inputs:\r\n\r\n- `tokens`: List of tokenized sentences whose `i`th element is a list of strings corresponding to tokens of the `i`th sentence in dataset.\r\nExample: `[..., [\"I\", \"love\", \"cleanlab\"], ...]`\r\n- `labels`: List whose `i`th element is a list of integers corresponding to class labels of each token in the `i`th sentence. Example: `[..., [0, 0, 1], ...]`\r\n- `pred_probs`: List whose `i`th element is a np.ndarray of shape `(N_i, K)` corresponding to predicted class probabilities for each token in the `i`th sentence (assuming this sentence contains `N_i` tokens and dataset has `K` ","2022-09-17T07:37:43",{"id":241,"version":242,"summary_zh":243,"released_at":244},342196,"v2.0.0","# If you liked cleanlab v1.0.1, v2.0.0 will blow your mind! 💥🧠\r\n\r\ncleanlab 2.0 adds powerful new workflows and algorithms for data-centric AI, dataset curation, auto-fixing label issues in data, learning with noisy labels, and more. Nearly every module, method, parameter, and docstring has been touched by this release.\r\n\r\nIf you're coming from 1.0, here's a [migration guide](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Fmigrating\u002Fmigrate_v2.html).\r\n\r\n### A few highlights of new functionalities in cleanlab 2.0:\r\n\r\n1. rank every data point by label quality\r\n2. find label issues in any dataset.\r\n3. train any classifier on any dataset with label issues.\r\n4. find overlapping classes to merge and\u002For delete at the dataset-level\r\n5. yield an overall dataset health\r\n\r\nFor an in-depth overview of what cleanlab 2.0 can do, check out [this tutorial](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Findepth_overview.html).\r\n\r\n### To help you get started with 2.0, we've added:\r\n\r\n* [extensive documentation](https:\u002F\u002Fdocs.cleanlab.ai\u002F)\r\n* [tutorials](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Ftutorials\u002Fimage.html)\r\n* [updated examples](https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fexamples)\r\n* [powerful new workflows](https:\u002F\u002Fdocs.cleanlab.ai\u002Fmaster\u002Fmigrating\u002Fmigrate_v2.html)\r\n* [blogs](https:\u002F\u002Fcleanlab.ai\u002Fblog\u002F)\r\n\r\n# Change Log \r\n\r\nThis list is non-exhaustive! Assume every aspect of API has changed.\r\n\r\n## Module name changes or moves:\r\n- `classification.LearningWithNoisyLabels` class --> `classification.CleanLearning` class\r\n- `pruning.py` --> `filter.py`\r\n- `latent_estimation.py` --> `count.py`\r\n- `cifar_cnn.py` --> `experimental\u002Fcifar_cnn.py`\r\n- `coteaching.py` --> `experimental\u002Fcoteaching.py`\r\n- `fasttext.py` --> `experimental\u002Ffasttext.py`\r\n- `mnist_pytorch.py` --> `experimental\u002Ffmnist_pytorch.py`\r\n- `noise_generation.py` --> `benchmarking\u002Fnoise_generation.py`\r\n- `util.py` --> `internal\u002Futil.py`\r\n- `latent_algebra.py` --> `internal\u002Flatent_algebra.py`\r\n\r\n## Module Deletions:\r\n- removed `polyplex.py`\r\n- removed models\u002F` --> (moved content to experimental\u002F)\r\n\r\n## New module created:\r\n- `rank.py`\r\n  - moved all ranking and ordering functions from `pruning.py`\u002F`filter.py` to here\r\n- `dataset.py`\r\n  - brand new module supporting methods for dealing with data-level issues\r\n- `benchmarking.py`\r\n  - Future benchmarking modules go here. Moved `noise_generation.py` here.\r\n\r\n## Method name changes:\r\n- `pruning.get_noise_indices()` --> `filter.find_label_issues()`\r\n- `count.num_label_errors()` --> `count.num_label_issues()`\r\n\r\n## Methods added:\r\n- `rank.py` adds\r\n  - two ranking functions to rank data based on label quality for entire dataset (not just examples with label issues)\r\n  - `get_self_confidence_for_each_label()`\r\n  - `get_normalized_margin_for_each_label()`\r\n- `filter.py` adds\r\n  - two more methods added to `filter.find_label_issues()` (select method using the `filter_by` parameter)\r\n    - `confident_learning`, which has been shown to work very well and may become the default in the future, and\r\n    - `predicted_neq_given`, which is useful for benchmarking a simple baseline approach, but underperformant relative to the other filter_by methods)\r\n- `classification.py` adds\r\n  - `ClearnLearning.get_label_issues()`\r\n    - for a canonical one-line of code use:`CleanLearning().fit(X, y).get_label_issues()` \r\n    - no need to compute predicted probabilities in advance\r\n  - `CleanLearning.find_label_issues()`\r\n    - returns a dataframe with label issues (instead of just a mask)\r\n\r\n\r\n## Naming conventions changed in method names, comments, parameters, etc.\r\n- `s` -> `labels` \r\n- `psx` -> `pred_probs`\r\n- `label_errors` --> `label_issues`\r\n- `noise_mask` --> `label_issues_mask`\r\n- `label_errors_bool` --> `label_issues_mask`\r\n- `prune_method` --> `filter_by`\r\n- `prob_given_label` --> `self_confidence`\r\n- `pruning` --> `filtering`\r\n\r\n## Parameter re-ordering:\r\n- re-ordered (`labels`, `pred_probs`) parameters to be consistent (in that order) in all methods.\r\n- re-ordered parameters (e.g. `frac_noise`) in filter.find_label_issues()\r\n\r\n## Parameter changes:\r\n- in `order_label_issues()`\r\n  - param: `sorted_index_method` --> `rank_by`\r\n- in `find_label_issues()`\r\n  - param: `sorted_index_method` --> `return_indices_ranked_by`\r\n  - param: `prune_method` --> `filter_by`\r\n\r\n## Global variables changed:\r\n- `filter.py`\r\n  - Only require 1 example to be left in each class\r\n  - `MIN_NUM_PER_CLASS = 5` --> `MIN_NUM_PER_CLASS = 1`\r\n  - enables cleanlab to work for toy-sized datasets\r\n\r\n## Dependencies added\r\n- pandas=1.0.0\r\n\r\n\r\n\r\n## Way-too-detailed Change Log\r\n* convert readme to markdown for pypi release. by @cgnorthcutt in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F126\r\n* Add EditorConfig by @anishathalye in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F129\r\n* Major API change. Introducing Cleanlab 2.0 by @cgnorthcutt in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F128\r\n* Standardize code style to Black by @anishathalye in https:\u002F\u002Fgithub.com\u002Fcleanlab\u002Fcleanlab\u002Fpull\u002F107\r\n* Redirect RTD site to docs.clean","2022-04-20T19:08:48",{"id":246,"version":247,"summary_zh":248,"released_at":249},342197,"v1.0.1","\r\n* **The primary purpose of this release** is to preserve the functionality of cleanlab (all versions up to 1.0.1) in the new docs prior to the launch of cleanlab 2.0 which significantly change the API.\r\n* Launched in preparation for Cleanlab 2.0.\r\n* Mostly superficial.\r\n\r\n## For users (+ sometimes developers):\r\n- This releases the new sphinx docs for cleanlab 1.0 documentation (in preparation for CL 2.0)\r\n- Several superficial bug fixes (reduce error printing, fix broken urls, clarify links)\r\n- Extensive docs\u002FREADME updates\r\n- Support was added for Conda Installation\r\n- Moved to AGPL-3 license\r\n- Added tutorials and a learning section for Cleanlab\r\n\r\n##  For developers:\r\n- Moved to GitHub Actions CI\r\n- Significantly shrunk the clone size to a few MB from 100MB+","2022-03-03T01:17:06",{"id":251,"version":252,"summary_zh":253,"released_at":254},342198,"v1.0","The cleanlab community has grown over the years. Today, we are excited to release cleanlab 1.0 as the standard package for machine learning with noisy labels and finding errors in datasets.\r\n\r\n`If you're coming from the research side (e.g. the confident learning or label errors paper) -- use this version of cleanlab.\r\n`\r\n## cleanlab 1.0\r\n\r\ncleanlab 1.0 supports the most common versions of python (2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.) and operating systems (linux, macOS, Windows). It works with any deep learning or machine learning library by working with model outputs, regardless of where they come from. cleanlab also has built-in support now for new research from other scientists (e.g. Co-Teaching) outside of our group at MIT.\r\n\r\n## More details about new features of cleanlab 1.0 below:\r\n\r\n- Added Amazon Reviews NLP to cleanlab\u002Fexamples\r\n- cleanlab now supports python 2, 2.7, 3.4, 3.5, 3.6, 3.7, 3.8.\r\n- Users have used cleanlab with python version 3.9 (use at your own risk!)\r\n- Added more testing. All tests pass on windows\u002Flinux\u002FmacOS.\r\n- Update to GNU GPL-3+ License.\r\n- Added documentation: https:\u002F\u002Fcleanlab.readthedocs.io\u002F\r\n- The cleanlab \"confident learning\" paper is published in the Journal of AI Research: https:\u002F\u002Fjair.org\u002Findex.php\u002Fjair\u002Farticle\u002Fview\u002F12125\r\n- Added funding, community and contributing guidelines\r\n- Fixed several errors in cleanlab\u002Fexamples\r\n- cleanlab now supports Windows, macOS, Linux, and unix systems\r\n- Many examples added to the README and docs\r\n- cleanlab now natively supports Co-Teaching for learning with noisy labels (reqs python3, PyTorch 1.4)\r\n- cleanlab built in support with handwritten datasets (besides MNIST)\r\n- cleanlab built in support for CIFAR dataset\r\n- Multiprocessing fixed for windows systems\r\n- Adhered all core modules to PEP-8 styling.\r\n- cleanlab is now installable via conda (besides pip).\r\n- Extensive benchmarking of cleanlab methods published.\r\n- Cleanlab now provides future features planned in cleanlab\u002Fversion.py\r\n- Added confidentlearning-reproduce as a separate repo to reproduce state-of-the-art results.","2021-04-18T19:52:49"]