[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-huggingface--datasets":3,"tool-huggingface--datasets":62},[4,18,26,36,46,54],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",159636,2,"2026-04-17T23:33:34",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":42,"last_commit_at":43,"category_tags":44,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,45],"插件",{"id":47,"name":48,"github_repo":49,"description_zh":50,"stars":51,"difficulty_score":32,"last_commit_at":52,"category_tags":53,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":55,"name":56,"github_repo":57,"description_zh":58,"stars":59,"difficulty_score":32,"last_commit_at":60,"category_tags":61,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[45,13,15,14],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":73,"owner_website":78,"owner_url":79,"languages":80,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":42,"env_os":93,"env_gpu":94,"env_ram":95,"env_deps":96,"category_tags":105,"github_topics":107,"view_count":121,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":122,"updated_at":123,"faqs":124,"releases":153},8805,"huggingface\u002Fdatasets","datasets","🤗 The largest hub of ready-to-use datasets for AI models with fast, easy-to-use and efficient data manipulation tools","datasets 是 Hugging Face 推出的开源数据库，旨在为 AI 模型训练提供一站式的数据解决方案。它汇集了全球规模最大的公开数据集资源，涵盖文本、图像、音频等多种模态，支持 400 多种语言。\n\n在 AI 开发中，数据的下载、清洗和格式化往往耗时且容易出错。datasets 通过极简的代码接口解决了这一痛点：用户只需一行命令即可加载并预处理海量数据，无需手动编写复杂的读取脚本。它不仅支持本地常见的 CSV、JSON 等格式，还能直接对接 Hugging Face 社区托管的数千个数据集，让数据准备过程变得快速且可复现。\n\n这款工具特别适合机器学习工程师、数据科学家及研究人员使用。其核心技术亮点在于基于 Apache Arrow 的内存映射机制，能够突破物理内存限制，轻松处理超大规模数据集而不会导致内存溢出。此外，它还具备智能缓存功能，避免重复处理数据，并原生支持与 PyTorch、TensorFlow、JAX 等主流深度学习框架无缝协作。无论是快速验证想法还是构建大型生产级模型，datasets 都能帮助用户将精力从繁琐的数据工程中解放出来，专注于算法创新。","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-dark.svg\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-light.svg\">\n    \u003Cimg alt=\"Hugging Face Datasets Library\" src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-light.svg\" width=\"352\" height=\"59\" style=\"max-width: 100%;\">\n  \u003C\u002Fpicture>\n  \u003Cbr\u002F>\n  \u003Cbr\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Factions\u002Fworkflows\u002Fci.yml?query=branch%3Amain\">\u003Cimg alt=\"Build\" src=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Fdatasets.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex.html\">\u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite\u002Fhttp\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex.html.svg?down_color=red&down_message=offline&up_message=online\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Freleases\">\u003Cimg alt=\"GitHub release\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Fdatasets.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002F\">\u003Cimg alt=\"Number of datasets\" src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen\">\u003C\u002Fa>\n    \u003Ca href=\"CODE_OF_CONDUCT.md\">\u003Cimg alt=\"Contributor Covenant\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.0-4baaaa.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fzenodo.org\u002Fbadge\u002Flatestdoi\u002F250213286\">\u003Cimg src=\"https:\u002F\u002Fzenodo.org\u002Fbadge\u002F250213286.svg\" alt=\"DOI\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n🤗 Datasets is a lightweight library providing **two** main features:\n\n- **one-line dataloaders for many public datasets**: one-liners to download and pre-process any of the ![number of datasets](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen) major public datasets (image datasets, audio datasets, text datasets in 467 languages and dialects, etc.) provided on the [HuggingFace Datasets Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets). With a simple command like `squad_dataset = load_dataset(\"rajpurkar\u002Fsquad\")`, get any of these datasets ready to use in a dataloader for training\u002Fevaluating a ML model (Numpy\u002FPandas\u002FPyTorch\u002FTensorFlow\u002FJAX),\n- **efficient data pre-processing**: simple, fast and reproducible data pre-processing for the public datasets as well as your own local datasets in CSV, JSON, text, PNG, JPEG, WAV, MP3, Parquet, HDF5, etc. With simple commands like `processed_dataset = dataset.map(process_example)`, efficiently prepare the dataset for inspection and ML model evaluation and training.\n\n[🎓 **Documentation**](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002F) [🔎 **Find a dataset in the Hub**](https:\u002F\u002Fhuggingface.co\u002Fdatasets) [🌟 **Share a dataset on the Hub**](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fshare)\n\n\u003Ch3 align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhf.co\u002Fcourse\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_datasets_readme_d40d18c65161.png\">\u003C\u002Fa>\n\u003C\u002Fh3>\n\n🤗 Datasets is designed to let the community easily add and share new datasets.\n\n🤗 Datasets has many additional interesting features:\n\n- Thrive on large datasets: 🤗 Datasets naturally frees the user from RAM memory limitation, all datasets are memory-mapped using an efficient zero-serialization cost backend (Apache Arrow).\n- Smart caching: never wait for your data to process several times.\n- Lightweight and fast with a transparent and pythonic API (multi-processing\u002Fcaching\u002Fmemory-mapping).\n- Built-in interoperability with NumPy, PyTorch, TensorFlow 2, JAX, Pandas, Polars and more.\n- Native support for audio, image and video data.\n- Enable streaming mode to save disk space and start iterating over the dataset immediately.\n\n🤗 Datasets originated from a fork of the awesome [TensorFlow Datasets](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fdatasets) and the HuggingFace team want to deeply thank the TensorFlow Datasets team for building this amazing library.\n\n# Installation\n\n## With pip\n\n🤗 Datasets can be installed from PyPi and has to be installed in a virtual environment (venv or conda for instance)\n\n```bash\npip install datasets\n```\n\n## With conda\n\n🤗 Datasets can be installed using conda as follows:\n\n```bash\nconda install -c huggingface -c conda-forge datasets\n```\n\nFollow the installation pages of TensorFlow and PyTorch to see how to install them with conda.\n\nFor more details on installation, check the installation page in the documentation: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Finstallation\n\n## Installation to use with Machine Learning & Data frameworks frameworks\n\nIf you plan to use 🤗 Datasets with PyTorch (2.0+), TensorFlow (2.6+) or JAX (0.4+) you should also install PyTorch, TensorFlow or JAX.\n🤗 Datasets is also well integrated with data frameworks like PyArrow, Pandas, Polars and Spark, which should be installed separately.\n\nFor more details on using the library with these frameworks, check the quick start page in the documentation: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fquickstart\n\n# Usage\n\n🤗 Datasets is made to be very simple to use - the API is centered around a single function, `datasets.load_dataset(dataset_name, **kwargs)`, that instantiates a dataset.\n\nThis library can be used for text\u002Fimage\u002Faudio\u002Fetc. datasets. Here is an example to load a text dataset:\n\nHere is a quick example:\n\n```python\nfrom datasets import load_dataset\n\n# Print all the available datasets\nfrom huggingface_hub import list_datasets\nprint([dataset.id for dataset in list_datasets(limit=20)])\n\n# Load a dataset and print the first example in the training set\nsquad_dataset = load_dataset('rajpurkar\u002Fsquad')\nprint(squad_dataset['train'][0])\n\n# Process the dataset - add a column with the length of the context texts\ndataset_with_length = squad_dataset.map(lambda x: {\"length\": len(x[\"context\"])})\n\n# Process the dataset - tokenize the context texts (using a tokenizer from the 🤗 Transformers library)\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained('bert-base-cased')\n\ntokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)\n```\n\nIf your dataset is bigger than your disk or if you don't want to wait to download the data, you can use streaming:\n\n```python\n# If you want to use the dataset immediately and efficiently stream the data as you iterate over the dataset\nimage_dataset = load_dataset('timm\u002Fimagenet-1k-wds', streaming=True)\nfor example in image_dataset[\"train\"]:\n    break\n```\n\nFor more details on using the library, check the quick start page in the documentation: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fquickstart and the specific pages on:\n\n- Loading a dataset: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading\n- What's in a Dataset: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Faccess\n- Processing data with 🤗 Datasets: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fprocess\n    - Processing audio data: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Faudio_process\n    - Processing image data: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fimage_process\n    - Processing text data: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fnlp_process\n- Streaming a dataset: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fstream\n- etc.\n\n# Add a new dataset to the Hub\n\nWe have a very detailed step-by-step guide to add a new dataset to the ![number of datasets](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen) datasets already provided on the [HuggingFace Datasets Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets).\n\nYou can find:\n- [how to upload a dataset to the Hub using your web browser or Python](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fupload_dataset) and also\n- [how to upload it using Git](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fshare).\n\n# Disclaimers\n\nYou can use 🤗 Datasets to load datasets based on versioned git repositories maintained by the dataset authors. For reproducibility reasons, we ask users to pin the `revision` of the repositories they use.\n\nIf you're a dataset owner and wish to update any part of it (description, citation, license, etc.), or do not want your dataset to be included in the Hugging Face Hub, please get in touch by opening a discussion or a pull request in the Community tab of the dataset page. Thanks for your contribution to the ML community!\n\n## BibTeX\n\nIf you want to cite our 🤗 Datasets library, you can use our [paper](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2109.02846):\n\n```bibtex\n@inproceedings{lhoest-etal-2021-datasets,\n    title = \"Datasets: A Community Library for Natural Language Processing\",\n    author = \"Lhoest, Quentin  and\n      Villanova del Moral, Albert  and\n      Jernite, Yacine  and\n      Thakur, Abhishek  and\n      von Platen, Patrick  and\n      Patil, Suraj  and\n      Chaumond, Julien  and\n      Drame, Mariama  and\n      Plu, Julien  and\n      Tunstall, Lewis  and\n      Davison, Joe  and\n      {\\v{S}}a{\\v{s}}ko, Mario  and\n      Chhablani, Gunjan  and\n      Malik, Bhavitvya  and\n      Brandeis, Simon  and\n      Le Scao, Teven  and\n      Sanh, Victor  and\n      Xu, Canwen  and\n      Patry, Nicolas  and\n      McMillan-Major, Angelina  and\n      Schmid, Philipp  and\n      Gugger, Sylvain  and\n      Delangue, Cl{\\'e}ment  and\n      Matussi{\\`e}re, Th{\\'e}o  and\n      Debut, Lysandre  and\n      Bekman, Stas  and\n      Cistac, Pierric  and\n      Goehringer, Thibault  and\n      Mustar, Victor  and\n      Lagunas, Fran{\\c{c}}ois  and\n      Rush, Alexander  and\n      Wolf, Thomas\",\n    booktitle = \"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = nov,\n    year = \"2021\",\n    address = \"Online and Punta Cana, Dominican Republic\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2021.emnlp-demo.21\",\n    pages = \"175--184\",\n    abstract = \"The scale, variety, and quantity of publicly-available NLP datasets has grown rapidly as researchers propose new tasks, larger models, and novel benchmarks. Datasets is a community library for contemporary NLP designed to support this ecosystem. Datasets aims to standardize end-user interfaces, versioning, and documentation, while providing a lightweight front-end that behaves similarly for small datasets as for internet-scale corpora. The design of the library incorporates a distributed, community-driven approach to adding datasets and documenting usage. After a year of development, the library now includes more than 650 unique datasets, has more than 250 contributors, and has helped support a variety of novel cross-dataset research projects and shared tasks. The library is available at https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets.\",\n    eprint={2109.02846},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n}\n```\n\nIf you need to cite a specific version of our 🤗 Datasets library for reproducibility, you can use the corresponding version Zenodo DOI from this [list](https:\u002F\u002Fzenodo.org\u002Fsearch?q=conceptrecid:%224817768%22&sort=-version&all_versions=True).\n","\u003Cp align=\"center\">\n  \u003Cpicture>\n    \u003Csource media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-dark.svg\">\n    \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-light.svg\">\n    \u003Cimg alt=\"Hugging Face 数据集库\" src=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhuggingface\u002Fdocumentation-images\u002Fraw\u002Fmain\u002Fdatasets-logo-light.svg\" width=\"352\" height=\"59\" style=\"max-width: 100%;\">\n  \u003C\u002Fpicture>\n  \u003Cbr\u002F>\n  \u003Cbr\u002F>\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Factions\u002Fworkflows\u002Fci.yml?query=branch%3Amain\">\u003Cimg alt=\"构建\" src=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Factions\u002Fworkflows\u002Fci.yml\u002Fbadge.svg?branch=main\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fblob\u002Fmain\u002FLICENSE\">\u003Cimg alt=\"GitHub\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flicense\u002Fhuggingface\u002Fdatasets.svg?color=blue\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex.html\">\u003Cimg alt=\"文档\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite\u002Fhttp\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Findex.html.svg?down_color=red&down_message=离线&up_message=在线\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Freleases\">\u003Cimg alt=\"GitHub 发布\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Fhuggingface\u002Fdatasets.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002F\">\u003Cimg alt=\"数据集数量\" src=\"https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen\">\u003C\u002Fa>\n    \u003Ca href=\"CODE_OF_CONDUCT.md\">\u003Cimg alt=\"贡献者公约\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.0-4baaaa.svg\">\u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fzenodo.org\u002Fbadge\u002Flatestdoi\u002F250213286\">\u003Cimg src=\"https:\u002F\u002Fzenodo.org\u002Fbadge\u002F250213286.svg\" alt=\"DOI\">\u003C\u002Fa>\n\u003C\u002Fp>\n\n🤗 Datasets 是一个轻量级的库，提供 **两项** 主要功能：\n\n- **针对众多公开数据集的一行式数据加载器**：只需一行代码即可下载并预处理 [HuggingFace 数据集中心](https:\u002F\u002Fhuggingface.co\u002Fdatasets) 上提供的 ![数据集数量](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen) 种主要公开数据集（图像数据集、音频数据集、涵盖 467 种语言和方言的文本数据集等）。通过简单的命令如 `squad_dataset = load_dataset(\"rajpurkar\u002Fsquad\")`，即可将这些数据集准备好，直接用于机器学习模型的训练或评估（NumPy\u002FPandas\u002FPyTorch\u002FTensorFlow\u002FJAX）。\n- **高效的数据预处理**：无论是公开数据集还是您本地的 CSV、JSON、文本、PNG、JPEG、WAV、MP3、Parquet、HDF5 等格式的数据集，都能进行简单、快速且可重复的数据预处理。只需使用类似 `processed_dataset = dataset.map(process_example)` 的命令，即可高效地准备数据集，以便检查以及用于机器学习模型的评估和训练。\n\n[🎓 **文档**](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002F) [🔎 **在中心查找数据集**](https:\u002F\u002Fhuggingface.co\u002Fdatasets) [🌟 **在中心分享数据集**](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fshare)\n\n\u003Ch3 align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fhf.co\u002Fcourse\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_datasets_readme_d40d18c65161.png\">\u003C\u002Fa>\n\u003C\u002Fh3>\n\n🤗 Datasets 的设计宗旨是让社区能够轻松添加和分享新的数据集。\n\n🤗 Datasets 还具备许多其他有趣的功能：\n\n- 适合处理大规模数据集：🤗 Datasets 自然解决了用户受 RAM 内存限制的问题，所有数据集都采用高效的零序列化开销后端（Apache Arrow）进行内存映射。\n- 智能缓存：无需多次等待数据处理。\n- 轻量且快速，API 设计透明且符合 Python 风格（多进程\u002F缓存\u002F内存映射）。\n- 内置与 NumPy、PyTorch、TensorFlow 2、JAX、Pandas、Polars 等框架的互操作性。\n- 原生支持音频、图像和视频数据。\n- 支持流式模式，节省磁盘空间，并可立即开始遍历数据集。\n\n🤗 Datasets 最初源自优秀的 [TensorFlow Datasets](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fdatasets) 的分支，HuggingFace 团队在此向 TensorFlow Datasets 团队致以深深的感谢，感谢他们打造了这一出色的库。\n\n# 安装\n\n## 使用 pip\n\n🤗 Datasets 可从 PyPI 安装，建议在虚拟环境（如 venv 或 conda）中进行安装。\n\n```bash\npip install datasets\n```\n\n## 使用 conda\n\n也可以通过 conda 安装 🤗 Datasets，具体命令如下：\n\n```bash\nconda install -c huggingface -c conda-forge datasets\n```\n\n请参考 TensorFlow 和 PyTorch 的安装页面，了解如何使用 conda 安装它们。\n\n更多安装细节，请参阅文档中的安装页面：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Finstallation\n\n## 与机器学习及数据框架一起使用时的安装\n\n如果您计划将 🤗 Datasets 与 PyTorch（2.0+）、TensorFlow（2.6+）或 JAX（0.4+）结合使用，还需分别安装 PyTorch、TensorFlow 或 JAX。此外，🤗 Datasets 也与 PyArrow、Pandas、Polars 和 Spark 等数据框架良好集成，这些框架需单独安装。\n\n有关如何将该库与这些框架配合使用的详细信息，请参阅文档中的快速入门页面：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fquickstart\n\n# 使用方法\n\n🤗 Datasets 的设计非常简单易用——其 API 以单个函数为核心，即 `datasets.load_dataset(dataset_name, **kwargs)`，用于实例化数据集。\n\n该库可用于文本、图像、音频等各类数据集。以下是一个加载文本数据集的示例：\n\n以下是一个快速示例：\n\n```python\nfrom datasets import load_dataset\n\n# 打印所有可用的数据集\nfrom huggingface_hub import list_datasets\nprint([dataset.id for dataset in list_datasets(limit=20)])\n\n# 加载一个数据集并打印训练集中的第一个样本\nsquad_dataset = load_dataset('rajpurkar\u002Fsquad')\nprint(squad_dataset['train'][0])\n\n# 处理数据集——添加一列显示上下文文本的长度\ndataset_with_length = squad_dataset.map(lambda x: {\"length\": len(x[\"context\"])})\n\n# 处理数据集——对上下文文本进行分词（使用 🤗 Transformers 库中的分词器）\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained('bert-base-cased')\n\ntokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)\n```\n\n如果您的数据集超出磁盘容量，或者您不想等待数据下载完成，可以使用流式处理：\n\n```python\n\n# 如果您希望立即使用数据集，并在遍历数据集时高效地进行流式传输\nimage_dataset = load_dataset('timm\u002Fimagenet-1k-wds', streaming=True)\nfor example in image_dataset[\"train\"]:\n    break\n```\n\n有关该库的更多使用详情，请参阅文档中的快速入门页面：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fquickstart，以及以下特定页面：\n\n- 加载数据集：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading\n- 数据集的内容：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Faccess\n- 使用 🤗 Datasets 处理数据：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fprocess\n  - 处理音频数据：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Faudio_process\n  - 处理图像数据：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fimage_process\n  - 处理文本数据：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fnlp_process\n- 流式传输数据集：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fstream\n- 等等。\n\n# 向 Hub 添加新数据集\n\n我们提供了一份非常详细的分步指南，用于将新数据集添加到 ![数据集数量](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fhuggingface.co\u002Fapi\u002Fshields\u002Fdatasets&color=brightgreen) 已经在 [HuggingFace 数据集 Hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets) 上提供的数据集中。\n\n您可以找到：\n- [如何使用网页浏览器或 Python 将数据集上传到 Hub](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fupload_dataset)，以及\n- [如何使用 Git 上传数据集](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fshare)。\n\n# 免责声明\n\n您可以使用 🤗 Datasets 加载基于数据集作者维护的版本化 Git 仓库的数据集。出于可重复性的考虑，我们要求用户固定所使用的仓库的 `revision`。\n\n如果您是数据集的所有者，并希望更新其任何部分（描述、引用、许可证等），或者不希望您的数据集被纳入 Hugging Face Hub，请通过在数据集页面的“社区”选项卡中发起讨论或拉取请求与我们联系。感谢您对机器学习社区的贡献！\n\n## BibTeX\n\n如果您想引用我们的 🤗 Datasets 库，可以使用我们的 [论文](https:\u002F\u002Fhuggingface.co\u002Fpapers\u002F2109.02846)：\n\n```bibtex\n@inproceedings{lhoest-etal-2021-datasets,\n    title = \"Datasets: A Community Library for Natural Language Processing\",\n    author = \"Lhoest, Quentin  and\n      Villanova del Moral, Albert  and\n      Jernite, Yacine  and\n      Thakur, Abhishek  and\n      von Platen, Patrick  and\n      Patil, Suraj  and\n      Chaumond, Julien  and\n      Drame, Mariama  and\n      Plu, Julien  and\n      Tunstall, Lewis  and\n      Davison, Joe  and\n      {\\v{S}}a{\\v{s}ko}, Mario  and\n      Chhablani, Gunjan  and\n      Malik, Bhavitvya  and\n      Brandeis, Simon  and\n      Le Scao, Teven  and\n      Sanh, Victor  and\n      Xu, Canwen  and\n      Patry, Nicolas  and\n      McMillan-Major, Angelina  and\n      Schmid, Philipp  and\n      Gugger, Sylvain  and\n      Delangue, Cl{\\'e}ment  and\n      Matussi{\\`e}re, Th{\\'e}o  and\n      Debut, Lysandre  and\n      Bekman, Stas  and\n      Cistac, Pierric  and\n      Goehringer, Thibault  and\n      Mustar, Victor  and\n      Lagunas, Fran{\\c{c}}ois  and\n      Rush, Alexander  and\n      Wolf, Thomas\",\n    booktitle = \"Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations\",\n    month = nov,\n    year = \"2021\",\n    address = \"Online and Punta Cana, Dominican Republic\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2021.emnlp-demo.21\",\n    pages = \"175--184\",\n    abstract = \"随着研究人员提出新的任务、更大的模型和新颖的基准，公开可用的 NLP 数据集的数量、种类和规模迅速增长。Datasets 是一个面向当代 NLP 的社区库，旨在支持这一生态系统。Datasets 致力于标准化最终用户的接口、版本管理和文档，同时提供一个轻量级的前端，无论处理小型数据集还是互联网规模的语料库，其行为都保持一致。该库的设计采用了分布式、社区驱动的方式来添加数据集并记录使用方法。经过一年的开发，该库现已包含 650 多个独特的数据集，拥有 250 多位贡献者，并已帮助支持了多种跨数据集的创新研究项目和共享任务。该库可在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets 上获取。\",\n    eprint={2109.02846},\n    archivePrefix={arXiv},\n    primaryClass={cs.CL},\n}\n```\n\n如果您需要为可重复性引用 🤗 Datasets 库的特定版本，可以使用此 [列表](https:\u002F\u002Fzenodo.org\u002Fsearch?q=conceptrecid:%224817768%22&sort=-version&all_versions=True) 中对应的 Zenodo DOI。","# 🤗 Datasets 快速上手指南\n\n`datasets` 是 Hugging Face 提供的一个轻量级库，旨在让开发者能够轻松加载、预处理和共享各种公共数据集（文本、图像、音频等）。它支持内存映射技术，可高效处理超出 RAM 容量的大型数据集，并原生兼容 PyTorch、TensorFlow、JAX、Pandas 等主流框架。\n\n## 环境准备\n\n- **操作系统**：Linux, macOS, Windows\n- **Python 版本**：建议 Python 3.8+\n- **前置依赖**：\n  - 推荐使用虚拟环境（如 `venv` 或 `conda`）\n  - 若需与深度学习框架配合使用，请预先安装 PyTorch、TensorFlow 或 JAX\n  - 可选依赖：`pyarrow`, `pandas`, `polars` 等用于数据操作\n\n> 💡 提示：国内用户可使用清华或阿里镜像源加速安装过程。\n\n## 安装步骤\n\n### 方式一：使用 pip（推荐）\n\n```bash\npip install datasets -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 方式二：使用 conda\n\n```bash\nconda install -c huggingface -c conda-forge datasets\n```\n\n如需配合特定机器学习框架，请另行安装对应包，例如：\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n```\n\n## 基本使用\n\n### 1. 加载公开数据集\n\n```python\nfrom datasets import load_dataset\n\n# 加载 SQuAD 数据集\nsquad_dataset = load_dataset('rajpurkar\u002Fsquad')\n\n# 查看训练集第一条数据\nprint(squad_dataset['train'][0])\n```\n\n### 2. 数据预处理\n\n```python\n# 添加新字段：上下文长度\ndataset_with_length = squad_dataset.map(lambda x: {\"length\": len(x[\"context\"])})\n\n# 使用 Transformers 进行分词\nfrom transformers import AutoTokenizer\ntokenizer = AutoTokenizer.from_pretrained('bert-base-cased')\n\ntokenized_dataset = squad_dataset.map(lambda x: tokenizer(x['context']), batched=True)\n```\n\n### 3. 流式加载（适合超大数据集）\n\n```python\n# 启用流模式，无需等待下载完成即可开始迭代\nimage_dataset = load_dataset('timm\u002Fimagenet-1k-wds', streaming=True)\nfor example in image_dataset[\"train\"]:\n    break\n```\n\n### 4. 浏览可用数据集\n\n```python\nfrom huggingface_hub import list_datasets\n\n# 打印前 20 个可用数据集名称\nprint([dataset.id for dataset in list_datasets(limit=20)])\n```\n\n---\n\n更多详细用法请参考官方文档：\n- [加载数据集](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading)\n- [数据处理](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fprocess)\n- [流式读取](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fstream)","某初创公司的算法工程师正在构建一个支持 467 种语言的多语言情感分析模型，急需整合分散在全球各地的公开文本数据。\n\n### 没有 datasets 时\n- **数据获取繁琐**：需要手动编写复杂的爬虫或下载脚本，逐个访问不同网站下载 CSV、JSON 等格式各异的数据集，耗时且容易出错。\n- **内存经常爆炸**：尝试加载 GB 级的大规模语料时，因直接读入内存导致服务器 RAM 耗尽，程序频繁崩溃，不得不花费大量时间优化分块读取逻辑。\n- **预处理重复造轮子**：针对每种数据格式都要单独编写清洗和分词代码，缺乏统一的映射（map）接口，导致代码冗余且难以复现。\n- **框架适配困难**：为了将数据喂给 PyTorch 或 TensorFlow 模型，需要额外编写大量的胶水代码进行格式转换和张量对齐。\n\n### 使用 datasets 后\n- **一行代码加载**：通过 `load_dataset(\"multi_lang_corpus\")` 即可一键下载并准备好数千个公共数据集，自动处理多种文件格式和语言方言。\n- **零序列化内存映射**：借助 Apache Arrow 后端实现内存映射，无需担心内存限制，可流畅地在普通机器上操作超大规模数据集。\n- **高效统一预处理**：利用简洁的 `dataset.map()` 接口快速执行分词和清洗，内置智能缓存机制，避免重复计算，显著提升迭代效率。\n- **原生框架互通**：处理后的数据可直接作为迭代器输入 NumPy、PyTorch 或 TensorFlow，无缝衔接训练流程，彻底告别格式转换烦恼。\n\ndatasets 将原本需要数天搭建的数据流水线压缩为几分钟的标准化操作，让开发者能专注于模型创新而非数据搬运。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhuggingface_datasets_235288ef.png","huggingface","Hugging Face","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhuggingface_90da21a4.png","The AI community building the future.",null,"https:\u002F\u002Fhuggingface.co\u002F","https:\u002F\u002Fgithub.com\u002Fhuggingface",[81,85],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,{"name":86,"color":87,"percentage":88},"Makefile","#427819",0,21411,3176,"2026-04-17T18:58:44","Apache-2.0","Linux, macOS, Windows","未说明","未说明（支持内存映射，可处理超出 RAM 大小的数据集）",{"notes":97,"python":94,"dependencies":98},"建议安装在虚拟环境（venv 或 conda）中。该库通过 Apache Arrow 后端实现内存映射，能有效突破物理内存限制处理大型数据集。若需与 PyTorch、TensorFlow 或 JAX 配合使用，需单独安装相应框架。支持音频、图像和视频数据的原生处理，并具备流式加载模式以节省磁盘空间。",[99,100,101,102,103,104],"pyarrow","pandas","polars","torch>=2.0 (可选)","tensorflow>=2.6 (可选)","jax>=0.4 (可选)",[106,35,15,13,16,14],"音频",[108,65,109,110,100,111,112,113,114,115,116,117,118,119,120,73],"nlp","pytorch","tensorflow","numpy","natural-language-processing","computer-vision","machine-learning","deep-learning","speech","ai","artificial-intelligence","llm","dataset-hub",6,"2026-03-27T02:49:30.150509","2026-04-18T09:19:28.697948",[125,130,135,140,145,149],{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},39486,"处理大型数据集时数据加载速度缓慢怎么办？","可以通过并行内存映射（mmap）Arrow 分片来加速加载。一种有效的方法是使用脚本预先对 Arrow 文件进行 mmap 缓存，利用操作系统缓存机制。例如，使用多进程（16 个进程）预加载可将时间从约 1865 秒减少到约 43 秒。此外，社区正在讨论在 `arrow_reader.py` 和 `arrow_dataset.py` 中集成多线程支持以进一步优化 `load_dataset` 和 `load_from_disk` 的性能。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fissues\u002F2252",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},39487,"使用 text 格式加载数据集进行 RoBERTa 预训练时遇到内存溢出（OOM）或 DataLoader 报错如何解决？","如果遇到 OOM 问题，建议检查 PyTorch 的 RandomSampler 实现，社区已有 PR 提出替换为不使用大量内存的算法。对于 DataLoader 报错，确保正确设置了数据集格式（如 `dataset.set_format(type='torch', columns=[\"text\"])`）。如果问题依然存在，可以参考社区提供的修改版脚本或 gist 来调整数据加载逻辑。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fissues\u002F610",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},39488,"使用 load_dataset 加载本地文本文件时报错或卡住怎么办？","该问题在早期版本中存在于 Linux 和 Windows 上（如报错 CSV 解析错误或进程挂起），但在后续版本中已修复。如果遇到类似问题，请确保升级到最新版本的 datasets 库。如果是 CSV 文件加载报错（如列数不匹配），由于 Text 和 CSV 加载逻辑在 1.1.2 版本后已分离，请针对 CSV 单独提交问题并提供复现用的 Colab 链接以便快速定位。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fissues\u002F622",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},39489,"为什么数据集的缓存文件存在，但预处理缓存没有被复用？","对于某些分词器（如 RoBERTa），即使 `.arrow` 缓存文件存在于缓存目录中，前几次运行时可能仍未完全复用预处理缓存。这通常与缓存指纹（fingerprint）的计算方式有关，确保代码环境、分词器版本及预处理函数完全一致有助于触发缓存复用。如果问题持续，请检查是否更新了库版本导致缓存机制变化。","https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fissues\u002F3847",{"id":146,"question_zh":147,"answer_zh":148,"source_url":139},39490,"如何在加载文本数据集时指定自定义配置或解决参数类型错误？","在使用 `load_dataset(\"text\", data_files=...)` 时，`data_files` 参数虽然文档示例中有时显示为字符串，但其正式签名是 `Union[Dict, List]`。建议始终将文件路径包裹在列表中（如 `data_files=[\"data.txt\"]`）以避免潜在的类型解析问题。如果在不同操作系统上行为不一致，请尝试显式指定文件列表而非单个字符串。",{"id":150,"question_zh":151,"answer_zh":152,"source_url":139},39491,"遇到 CSV 文件加载错误（如列数不匹配）该如何处理？","从 1.1.2 版本开始，Text 和 CSV 数据集的加载和解析逻辑已不再共享。如果遇到 `ArrowInvalid: CSV parse error: Expected 1 columns, got 2` 等错误，请检查 `delimiter`（分隔符）和 `column_names` 参数是否正确匹配文件格式。建议提供包含复现步骤的 Google Colab 链接给维护者，以便更快获得针对 CSV 加载器的修复方案。",[154,159,164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239,244,249],{"id":155,"version":156,"summary_zh":157,"released_at":158},315424,"4.8.4","## 变更内容\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8087 中添加了对最新版 torchvision 的支持\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8086 中修复了加载 JSON 时，单个文件仅包含一个对象而导致的回归问题\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.8.3...4.8.4","2026-03-23T14:21:52",{"id":160,"version":161,"summary_zh":162,"released_at":163},315425,"4.8.3","## 变更内容\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8081 中修复了 split_dataset_by_node 步骤。\n* 由 @albertvillanova 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8080 中修复了 Json.cast_storage 的文档字符串。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.8.2...4.8.3","2026-03-19T17:44:39",{"id":165,"version":166,"summary_zh":167,"released_at":168},315426,"4.8.2","## 变更内容\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8074 中为空结构体添加了 Json 类型\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.8.1...4.8.2","2026-03-17T01:10:32",{"id":170,"version":171,"summary_zh":172,"released_at":173},315427,"4.8.1","## 变更内容\n* 由 @HaukurPall 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8063 中修复了格式化迭代器箭头中的双重 `yield` 问题。\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.8.0...4.8.1","2026-03-17T00:13:34",{"id":175,"version":176,"summary_zh":177,"released_at":178},315428,"4.8.0","## 数据集特性\n* 从 [HF 存储桶](https:\u002F\u002Fhuggingface.co\u002Fstorage) 读取（和写入）：加载原始数据，进行处理并保存到数据集仓库中，由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8064 中实现。\n\n  ```python\n  from datasets import load_dataset\n  # 从 HF 上的存储桶加载原始数据\n  ds = load_dataset(\"buckets\u002Fusername\u002Fdata-bucket\", data_files=[\"*.jsonl\"])\n  # 或者手动使用 hf:\u002F\u002F 路径\n  ds = load_dataset(\"json\", data_files=[\"hf:\u002F\u002Fbuckets\u002Fusername\u002Fdata-bucket\u002F*.jsonl\"])\n  # 处理、过滤\n  ds = ds.map(...).filter(...)\n  # 发布适合 AI 训练的数据集\n  ds.push_to_hub(\"username\u002Fmy-dataset-ready-for-training\")\n  ```\n\n  这一改进还修复了 macOS 上多进程 `push_to_hub` 导致的段错误问题（现在改用 `spawn` 而不是 `fork`）。同时，升级了 `dill` 和 `multiprocess` 的版本以支持 Python 3.14。\n  \n* @Michael-RDev 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8068 中对流式可迭代数据集包进行了改进和修复：\n  * 为 `IterableDataset.push_to_hub` 添加了 `max_shard_size` 参数（不过需要遍历两次才能确定整个数据集的大小——欢迎提出改进建议）；\n  * 为 `IterableDataset` 提供更多原生 Arrow 的可迭代操作；\n  * 更好地支持归档文件中的 glob 模式，例如 `zip:\u002F\u002F*.jsonl::hf:\u002F\u002Fdatasets\u002Fusername\u002Fdataset-name\u002Fdata.zip`；\n  * 修复了 `to_pandas`、`videofolder` 以及 `load_dataset_builder` 的关键字参数相关问题。\n\n## 变更内容\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8061 中修复了 `reshard_data_sources` 问题；\n* @kushalkkb 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8060 中改进了无效 `data_files` 模式格式的错误信息；\n* @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8069 中修复了 JSONL 文件中缺失列的空值填充问题。\n\n## 新贡献者\n* @kushalkkb 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8060 中完成了首次贡献；\n* @Michael-RDev 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8068 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.7.0...4.8.0","2026-03-16T23:52:47",{"id":180,"version":181,"summary_zh":182,"released_at":183},315429,"4.7.0","## 数据集特性\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8027 中添加了 `Json()` 类型\n  * 现在支持包含任意 JSON 对象的 JSON Lines 文件，例如工具调用数据集。当某个字段或子字段包含混合类型（例如字符串、整数、浮点数、字典、列表的混合，或者键为任意值的字典）时，可以使用 `Json()` 类型来存储这些通常不被 Arrow\u002FParquet 支持的数据。\n  * 可以在任何数据集的 `Features()` 中使用 `Json()` 类型；它也适用于所有接受 `features=` 参数的函数，如 `load_dataset()`、`.map()`、`.cast()`、`.from_dict()` 和 `.from_list()`。\n  * 使用 `on_mixed_types=\"use_json\"` 参数，可以在 `.from_dict()`、`.from_list()` 和 `.map()` 中自动将混合类型的字段设置为 `Json()` 类型。\n\n示例：\n\n你可以使用 `on_mixed_types=\"use_json\"`，或者直接在 `features=` 中指定 [`Json`] 类型：\n\n```python\n>>> ds = Dataset.from_dict({\"a\": [0, \"foo\", {\"subfield\": \"bar\"}]})\nTraceback (most recent call last):\n  ...\n  File \"pyarrow\u002Ferror.pxi\", line 92, in pyarrow.lib.check_status\npyarrow.lib.ArrowInvalid: 无法将类型为 str 的 'foo' 转换为 int64\n\n>>> features = Features({\"a\": Json()})\n>>> ds = Dataset.from_dict({\"a\": [0, \"foo\", {\"subfield\": \"bar\"}]}, features=features)\n>>> ds.features\n{'a': Json()}\n>>> list(ds[\"a\"])\n[0, \"foo\", {\"subfield\": \"bar\"}]\n```\n\n这对于键和值都为任意类型的字典列表也非常有用，可以避免用 `None` 填充缺失的字段：\n\n```python\n>>> ds = Dataset.from_dict({\"a\": [[{\"b\": 0}, {\"c\": 0}]]})\n>>> ds.features\n{'a': List({'b': Value('int64'), 'c': Value('int64')})}\n>>> list(ds[\"a\"])\n[[{'b': 0, 'c': None}, {'b': None, 'c': 0}]]  # 缺失的字段被填充为 None\n\n>>> features = Features({\"a\": List(Json())})\n>>> ds = Dataset.from_dict({\"a\": [[{\"b\": 0}, {\"c\": 0}]]}, features=features)\n>>> ds.features\n{'a': List(Json())}\n>>> list(ds[\"a\"])\n[[{'b': 0}, {'c': 0}]]  # 没问题\n```\n\n另一个使用工具调用数据和 `on_mixed_types=\"use_json\"` 参数的例子（这样就不需要手动指定 `features=`）：\n\n```python\n>>> messages = [\n...     {\"role\": \"user\", \"content\": \"打开客厅的灯，并播放我的电子音乐播放列表。\"},\n...     {\"role\": \"assistant\", \"tool_calls\": [\n...         {\"type\": \"function\", \"function\": {\n...             \"name\": \"control_light\",\n...             \"arguments\": {\"room\": \"living room\", \"state\": \"on\"}\n...         }},\n...         {\"type\": \"function\", \"function\": {\n...             \"name\": \"play_music\",\n...             \"arguments\": {\"playlist\": \"electronic\"}  # 这里是混合类型，因为键 [\"playlist\"] 和 [\"room\", \"state\"] 不同\n...         }}\n...     }],\n...     {\"role\": \"tool\", \"name\": \"control_light\", \"content\": \"客厅的灯现在已经打开了。\"},\n...     {\"role\": \"tool\", \"name\": \"play_music\", \"content\": \"音乐现在正在播放。\"},\n...     {\"role\": \"assistant\", \"content\": \"完成了！\"}\n... ]\n>>> ds = Dataset.from_","2026-03-09T19:09:25",{"id":185,"version":186,"summary_zh":187,"released_at":188},315430,"4.6.1","## 错误修复\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F8030 中移除推送至 Hub 时生成的临时文件\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.6.0...4.6.1","2026-02-27T23:27:26",{"id":190,"version":191,"summary_zh":192,"released_at":193},315431,"4.6.0","## 数据集特性\n* 支持 Lance 数据集中图像、视频和音频类型\n  * 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7966 中实现，可根据 lance blob 自动推断类型\n  \n  ```python\n  >>> from datasets import load_dataset\n  >>> ds = load_dataset(\"lance-format\u002FOpenvid-1M\", streaming=True, split=\"train\")\n  >>> ds.features\n  {'video_blob': Video(),\n   'video_path': Value('string'),\n   'caption': Value('string'),\n   'aesthetic_score': Value('float64'),\n   'motion_score': Value('float64'),\n   'temporal_consistency_score': Value('float64'),\n   'camera_motion': Value('string'),\n   'frame': Value('int64'),\n   'fps': Value('float64'),\n   'seconds': Value('float64'),\n   'embedding': List(Value('float32'), length=1024)}\n  ```\n* 推送至 Hub 现已支持视频类型\n  * 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7971 中实现的 push_to_hub() 方法，可用于视频\n  \n  ```python\n   >>> from datasets import Dataset, Video\n  >>> ds = Dataset.from_dict({\"video\": [\"path\u002Fto\u002Fvideo.mp4\"]})\n  >>> ds = ds.cast_column(\"video\", Video())\n  >>> ds.push_to_hub(\"username\u002Fmy-video-dataset\")\n  ```\n* 在 `push_to_hub()` 中以原样（PLAIN）将图像\u002F音频\u002F视频 blob 写入 Parquet 文件，由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7976 中实现\n  * 这使得跨格式的 Xet 去重成为可能，例如在 Lance、WebDataset、Parquet 文件以及普通视频文件之间进行视频去重，从而加快上传和下载速度到 Hugging Face\n  * 例如，如果你将一个 Lance 视频数据集转换为 Hugging Face 上的 Parquet 视频数据集，由于视频无需重新上传，上传过程会快得多。在底层，Xet 存储会复用 Lance 格式视频中的二进制块来存储 Parquet 格式的视频\n  * 更多信息请参阅：https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fxet\u002Fdeduplication\n  \n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fen\u002Fxet\u002Fdeduplication\">\n\u003Cimg height=\"200\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fdd0de6a2-24a1-4945-8d25-44b763c1151e\" \u002F>\n\u003C\u002Fa>\n\u003C\u002Fp>\n\n* 添加 `IterableDataset.reshard()`，由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7992 中实现\n\n  如果可能的话，对数据集进行重新分片，即把当前的分片进一步拆分成更多的分片。\n  这会增加分片的数量，最终的数据集分片数将大于或等于之前的分片数。\n  如果没有分片可以进一步拆分，则分片数保持不变。\n\n  重新分片的机制取决于数据集的文件格式：\n\n    * Parquet：按行组而非按文件进行分片\n    * 其他格式：尚未实现（欢迎贡献！）\n\n  ```python\n  >>> from datasets import load_dataset\n  >>> ds = load_dataset(\"fancyzhx\u002Famazon_polarity\", split=\"train\", streaming=True)\n  >>> ds\n  IterableDataset({\n      features: ['label', 'title', 'content'],\n      num_shards: 4\n  })\n  >>> ds.reshard()\n  IterableDataset({\n      features: ['label', 'title', 'content'],\n      num_shards: 3600\n  })\n  ```\n\n## 变更内容\n* 修复 l","2026-02-25T12:15:45",{"id":195,"version":196,"summary_zh":197,"released_at":198},315432,"4.5.0","## 数据集特性\n* 由 @eddyxu 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7913 中添加了 Lance 格式支持\n  * 同时支持 Lance 数据集（包括元数据和清单文件）以及独立的 .lance 文件\n  * 例如，使用 [lance-format\u002Ffineweb-edu](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Flance-format\u002Ffineweb-edu)\n\n  ```python\n  from datasets import load_dataset\n\n  ds = load_dataset(\"lance-format\u002Ffineweb-edu\", streaming=True)\n  for example in ds[\"train\"]:\n      ...\n  ```\n\n## 变更内容\n\n* 由 @Scott-Simmons 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7929 中实现，在 `load_dataset` 中对无效的 `revision` 提前报错\n* 由 @CloseChoice 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7912 中修复了一个低概率但影响较大的示例索引错误\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7938 中修复了从文件对象中获取属性的方法\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7942 中添加了 `_OverridableIOWrapper`\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7943 中添加了 `_generate_shards`\n\n## 新贡献者\n* @eddyxu 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7913 中完成了首次贡献\n* @Scott-Simmons 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7929 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.4.2...4.5.0","2026-01-14T18:33:15",{"id":200,"version":201,"summary_zh":202,"released_at":203},315433,"4.4.2","## Bug 修复\n* 由 @CloseChoice 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7853 中修复了嵌入存储中的 Nifti 格式问题\n* 由 @qgallouedec 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7855 中将 ArXiv 数据集迁移至 HF Papers\n* 由 @julien-c 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7859 中修复了一些损坏的链接\n* 由 @CloseChoice 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7874 中添加了 Nifti 可视化支持\n* 由 @CloseChoice 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7878 中将 papaya 替换为 niivue\n* 由 @sajmaru 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7884 中修复了 #7846：`add_column` 和 `add_item` 方法错误地（？）要求提供 `new_fingerprint` 参数\n* 由 @ada-ggf25 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7891 中修复了指纹相关问题：将 TMPDIR 视为严格 API 并报错（Issue #7877）\n* 由 @CloseChoice 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7892 中实现了在延迟上传时正确编码 Nifti 文件\n* 由 @The-Obstacle-Is-The-Way 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7887 中修复了 Nifti 相关问题：为 Nifti1ImageWrapper 启用延迟加载\n\n## 小幅新增功能\n* 由 @Aditya2755 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7888 中为 `load_dataset` 添加类型重载，以提升静态类型推断能力\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7899 中添加了对 inspect_ai 评估日志的支持\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7897 中保存了输入分片的长度\n* 由 @lhoestq 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7906 中为了向后兼容，默认不保存 `original_shard_lengths`\n\n## 新贡献者\n* @sajmaru 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7884 中完成了首次贡献\n* @Aditya2755 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7888 中完成了首次贡献\n* @ada-ggf25 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7891 中完成了首次贡献\n* @The-Obstacle-Is-The-Way 在 https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7887 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.4.1...4.4.2","2025-12-19T15:05:34",{"id":205,"version":206,"summary_zh":207,"released_at":208},315434,"4.4.1","## Bug fixes and improvements\r\n* Better streaming retries (504 and 429) by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7847\r\n* DOC: remove mode parameter in docstring of pdf and video feature by @CloseChoice in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7848\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.4.0...4.4.1","2025-11-05T16:01:38",{"id":210,"version":211,"summary_zh":212,"released_at":213},315435,"4.4.0","## Dataset Features\r\n* Add nifti support by @CloseChoice in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7815\r\n  * Load medical imaging datasets from Hugging Face:\r\n\r\n  ```python\r\n  ds = load_dataset(\"username\u002Fmy_nifti_dataset\")\r\n  ds[\"train\"][0]  # {\"nifti\": \u003Cnibabel.nifti1.Nifti1Image>}\r\n  ```\r\n  * Load medical imaging datasets from your disk:\r\n\r\n  ```python\r\n  files = [\"\u002Fpath\u002Fto\u002Fscan_001.nii.gz\", \"\u002Fpath\u002Fto\u002Fscan_002.nii.gz\"]\r\n  ds = Dataset.from_dict({\"nifti\": files}).cast_column(\"nifti\", Nifti())\r\n  ds[\"train\"][0]  # {\"nifti\": \u003Cnibabel.nifti1.Nifti1Image>}\r\n  ```\r\n\r\n  * Documentation: https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Fnifti_dataset\r\n* Add num channels to audio by @CloseChoice in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7840\r\n\r\n```python\r\n# samples have shape (num_channels, num_samples)\r\nds = ds.cast_column(\"audio\", Audio())  # default, use all channels\r\nds = ds.cast_column(\"audio\", Audio(num_channels=2))  # use stereo\r\nds = ds.cast_column(\"audio\", Audio(num_channels=1))  # use mono\r\n```\r\n\r\n* Python 3.14 support by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7836\r\n\r\n## What's Changed\r\n* Fix random seed on shuffle and interleave_datasets by @CloseChoice in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7823\r\n* fix ci compressionfs by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7830\r\n* fix: better args passthrough for `_batch_setitems()` by @sghng in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7817\r\n* Fix: Properly render [!TIP] block in stream.shuffle documentation by @art-test-stack in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7833\r\n* resolves the ValueError: Unable to avoid copy while creating an array by @ArjunJagdale in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7831\r\n* fix column with transform by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7843\r\n* support fsspec 2025.10.0 by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7844\r\n\r\n## New Contributors\r\n* @sghng made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7817\r\n* @art-test-stack made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7833\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.3.0...4.4.0","2025-11-04T10:42:47",{"id":215,"version":216,"summary_zh":217,"released_at":218},315436,"4.3.0","## Dataset Features\r\n\r\nEnable large scale distributed dataset streaming:\r\n\r\n* Keep hffs cache in workers when streaming by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7820\r\n* Retry open hf file by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7822\r\n\r\nThese improvements require `huggingface_hub>=1.1.0` to take full effect\r\n\r\n## What's Changed\r\n* fix conda deps by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7810\r\n* Add pyarrow's binary view to features by @delta003 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7795\r\n* Fix polars cast column image by @CloseChoice in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7800\r\n* Allow streaming hdf5 files by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7814\r\n* Fix batch_size default description in to_polars docstrings by @albertvillanova in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7824\r\n* docs: document_dataset PDFs & OCR by @ethanknights in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7812\r\n* Add custom fingerprint support to `from_generator` by @simonreise in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7533\r\n* picklable batch_fn by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7826\r\n\r\n## New Contributors\r\n* @delta003 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7795\r\n* @CloseChoice made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7800\r\n* @ethanknights made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7812\r\n* @simonreise made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7533\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.2.0...4.3.0","2025-10-23T16:33:59",{"id":220,"version":221,"summary_zh":222,"released_at":223},315437,"4.2.0","## Dataset Features\r\n* Sample without replacement option when interleaving datasets by @radulescupetru in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7786\r\n\r\n  ```python\r\n  ds = interleave_datasets(datasets, stopping_strategy=\"all_exhausted_without_replacement\")\r\n  ```\r\n\r\n* Parquet: add `on_bad_files` argument to error\u002Fwarn\u002Fskip bad files by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7806\r\n\r\n  ```python\r\n  ds = load_dataset(parquet_dataset_id, on_bad_files=\"warn\")\r\n  ```\r\n\r\n* Add parquet scan options and docs by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7801\r\n\r\n  * docs to select columns and filter data efficiently\r\n\r\n  ```python\r\n  ds = load_dataset(parquet_dataset_id, columns=[\"col_0\", \"col_1\"])\r\n  ds = load_dataset(parquet_dataset_id, filters=[(\"col_0\", \"==\", 0)])\r\n  ```\r\n  * new argument to control buffering and caching when streaming\r\n\r\n  ```python\r\n  fragment_scan_options = pyarrow.dataset.ParquetFragmentScanOptions(cache_options=pyarrow.CacheOptions(prefetch_limit=1, range_size_limit=128 \u003C\u003C 20))\r\n  ds = load_dataset(parquet_dataset_id, streaming=True, fragment_scan_options=fragment_scan_options)\r\n  ```\r\n\r\n## What's Changed\r\n* Document HDF5 support by @klamike in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7740\r\n* update tips in docs by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7790\r\n* feat: avoid some copies in torch formatter by @drbh in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7787\r\n* Support huggingface_hub v0.x and v1.x by @Wauplin in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7783\r\n* Define CI future by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7799\r\n* More Parquet streaming docs by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7803\r\n* Less api calls when resolving data_files by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7805\r\n* typo by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7807\r\n\r\n## New Contributors\r\n* @drbh made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7787\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.1.1...4.2.0","2025-10-09T16:18:22",{"id":225,"version":226,"summary_zh":227,"released_at":228},315438,"4.1.1","## What's Changed\r\n* fix iterate nested field by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7775\r\n* Add support for arrow iterable when concatenating or interleaving by @radulescupetru in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7771\r\n* fix empty dataset to_parquet by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7779\r\n\r\n## New Contributors\r\n* @radulescupetru made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7771\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F4.1.0...4.1.1","2025-09-18T13:15:08",{"id":230,"version":231,"summary_zh":232,"released_at":233},315439,"4.1.0","## Dataset Features\r\n* feat: use content defined chunking  by @kszucs in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7589\r\n  * Parquet datasets are now [Optimized Parquet](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Fdatasets-libraries#optimized-parquet-files) !\r\n   \u003Cimg width=\"462\" height=\"103\" alt=\"image\" src=\"https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F43703a47-0964-421b-8f01-1a790305de79\" \u002F>\r\n\r\n  * internally uses `use_content_defined_chunking=True` when writing Parquet files\r\n  * this enables fast deduped uploads to Hugging Face !\r\n  \r\n  ```python\r\n  # Now faster thanks to content defined chunking\r\n  ds.push_to_hub(\"username\u002Fdataset_name\")\r\n  ```\r\n  * this optimizes Parquet for Xet, the dedupe-based storage backend of Hugging Face. It allows to not have to upload data that already exist somewhere on HF (on an other file \u002F version for example). Parquet content defined chunking defines Parquet pages boundaries based on the content of the data, in order to detect duplicate data easily.\r\n  * with this change, the new default row group size for Parquet is set to 100MB\r\n  * `write_page_index=True` is also used to enable fast random access for the Dataset Viewer and tools that need it\r\n* Concurrent push_to_hub by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7708\r\n* Concurrent IterableDataset push_to_hub by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7710\r\n* HDF5 support by @klamike in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7690\r\n  * load HDF5 datasets in one line of code\r\n  ```python\r\n  ds = load_dataset(\"username\u002Fdataset-with-hdf5-files\")\r\n  ```\r\n  * each (possibly nested) field in the HDF5 file is parsed a a column, with the first dimension used for rows\r\n\r\n## Other improvements and bug fixes\r\n* Convert to string when needed + faster .zstd by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7683\r\n* fix audio cast storage from array + sampling_rate by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7684\r\n* Fix misleading add_column() usage example in docstring by @ArjunJagdale in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7648\r\n* Allow dataset row indexing with np.int types (#7423) by @DavidRConnell in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7438\r\n* Update fsspec max version to current release 2025.7.0 by @rootAvish in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7701\r\n* Update dataset_dict push_to_hub by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7711\r\n* Retry intermediate commits too by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7712\r\n* num_proc=0 behave like None, num_proc=1 uses one worker (not main process) and clarify num_proc documentation by @tanuj-rai in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7702\r\n* Update cli.mdx to refer to the new \"hf\" CLI by @evalstate in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7713\r\n* fix num_proc=1 ci test by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7714\r\n* Docs: Use Image(mode=\"F\") for PNG\u002FJPEG depth maps  by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7715\r\n* typo by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7716\r\n* fix largelist repr by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7735\r\n* Grammar fix: correct \"showed\" to \"shown\" in fingerprint.py by @brchristian in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7730\r\n* Fix type hint `train_test_split` by @qgallouedec in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7736\r\n* fix(webdataset): don't .lower() field_name by @YassineYousfi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7726\r\n* Refactor HDF5 and preserve tree structure by @klamike in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7743\r\n* docs: Add column overwrite example to batch mapping guide by @Sanjaykumar030 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7737\r\n* Audio: use TorchCodec instead of Soundfile for encoding by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7761\r\n* Support pathlib.Path for feature input by @Joshua-Chin in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7755\r\n* add support for pyarrow string view in features by @onursatici in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7718\r\n* Fix typo in error message for cache directory deletion by @brchristian in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7749\r\n* update torchcodec in ci by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7764\r\n* Bump dill to 0.4.0 by @Bomme in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7763\r\n\r\n## New Contributors\r\n* @DavidRConnell made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7438\r\n* @rootAvish made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7701\r\n* @tanuj-rai made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7702\r\n* @evalstate made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7713\r\n* @brchristian made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7730\r\n* @klamike made th","2025-09-15T16:41:46",{"id":235,"version":236,"summary_zh":237,"released_at":238},315440,"4.0.0","## New Features\r\n* Add `IterableDataset.push_to_hub()` by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7595\r\n\r\n  ```python\r\n  # Build streaming data pipelines in a few lines of code !\r\n  from datasets import load_dataset\r\n\r\n  ds = load_dataset(..., streaming=True)\r\n  ds = ds.map(...).filter(...)\r\n  ds.push_to_hub(...)\r\n  ```\r\n\r\n* Add `num_proc=` to `.push_to_hub()` (Dataset and IterableDataset) by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7606\r\n\r\n  ```python\r\n  # Faster push to Hub ! Available for both Dataset and IterableDataset\r\n  ds.push_to_hub(..., num_proc=8)\r\n  ```\r\n\r\n* New `Column` object\r\n  - Implementation of iteration over values of a column in an IterableDataset object by @TopCoder2K in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7564\r\n  - Lazy column by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7614\r\n\r\n  ```python\r\n  # Syntax:\r\n  ds[\"column_name\"]  # datasets.Column([...]) or datasets.IterableColumn(...)\r\n\r\n  # Iterate on a column:\r\n  for text in ds[\"text\"]:\r\n      ...\r\n\r\n  # Load one cell without bringing the full column in memory\r\n  first_text = ds[\"text\"][0]  # equivalent to ds[0][\"text\"]\r\n  ```\r\n* Torchcodec decoding by @TyTodd in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7616\r\n  - Enables streaming only the ranges you need ! \r\n\r\n  ```python\r\n  # Don't download full audios\u002Fvideos when it's not necessary\r\n  # Now with torchcodec it only streams the required ranges\u002Fframes:\r\n  from datasets import load_dataset\r\n\r\n  ds = load_dataset(..., streaming=True)\r\n  for example in ds:\r\n      video = example[\"video\"]\r\n      frames = video.get_frames_in_range(start=0, stop=6, step=1)  # only stream certain frames\r\n  ```\r\n\r\n  - Requires `torch>=2.7.0` and FFmpeg >= 4\r\n  - Not available for Windows yet but it is [coming soon](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ftorchcodec\u002Fissues\u002F640) - in the meantime please use `datasets\u003C4.0`\r\n  - Load audio data with `AudioDecoder`:\r\n\r\n  ```python\r\n  audio = dataset[0][\"audio\"]  # \u003Cdatasets.features._torchcodec.AudioDecoder object at 0x11642b6a0>\r\n  samples = audio.get_all_samples()  # or use get_samples_played_in_range(...)\r\n  samples.data  # tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00,  ...,  2.3447e-06, -1.9127e-04, -5.3330e-05]]\r\n  samples.sample_rate  # 16000\r\n\r\n  # old syntax is still supported\r\n  array, sr = audio[\"array\"], audio[\"sampling_rate\"]\r\n  ```\r\n\r\n  - Load video data with `VideoDecoder`:\r\n\r\n  ```python\r\n  video = dataset[0][\"video\"] \u003Ctorchcodec.decoders._video_decoder.VideoDecoder object at 0x14a61d5a0>\r\n  first_frame = video.get_frame_at(0)\r\n  first_frame.data.shape  # (3, 240, 320)\r\n  first_frame.pts_seconds  # 0.0\r\n  frames = video.get_frames_in_range(0, 6, 1)\r\n  frames.data.shape  # torch.Size([5, 3, 240, 320])\r\n  ```\r\n\r\n## Breaking changes\r\n* Remove scripts altogether by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7592\r\n  - `trust_remote_code` is no longer supported \r\n* Torchcodec decoding by @TyTodd in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7616\r\n  - torchcodec replaces soundfile for audio decoding\r\n  - torchcodec replaces decord for video decoding\r\n* Replace Sequence by List by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7634\r\n  - Introduction of the `List` type\r\n\r\n  ```python\r\n  from datasets import Features, List, Value\r\n\r\n  features = Features({\r\n      \"texts\": List(Value(\"string\")),\r\n      \"four_paragraphs\": List(Value(\"string\"), length=4)\r\n  })\r\n  ```\r\n\r\n  - `Sequence` was a legacy type from tensorflow datasets which converted list of dicts to dicts of lists. It is no longer a type but it becomes a utility that returns a `List` or a `dict` depending on the subfeature\r\n\r\n  ```python\r\n  from datasets import Sequence\r\n\r\n  Sequence(Value(\"string\"))  # List(Value(\"string\"))\r\n  Sequence({\"texts\": Value(\"string\")})  # {\"texts\": List(Value(\"string\"))}\r\n  ```\r\n\r\n## Other improvements and bug fixes\r\n* Refactor `Dataset.map` to reuse cache files mapped with different `num_proc` by @ringohoffman in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7434\r\n* fix string_to_dict test by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7571\r\n* Preserve formatting in concatenated IterableDataset by @francescorubbo in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7522\r\n* Fix typos in PDF and Video documentation by @AndreaFrancis in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7579\r\n* fix: Add embed_storage in Pdf feature by @AndreaFrancis in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7582\r\n* load_dataset splits typing by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7587\r\n* Fixed typos by @TopCoder2K in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7572\r\n* Fix regex library warnings by @emmanuel-ferdman in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7576\r\n* [MINOR:TYPO] Update save_to_disk docstring by @cakiki in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7575\r\n* Add missing property on `RepeatExamplesIterable` by @SilvanCodes in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F758","2025-07-09T14:54:50",{"id":240,"version":241,"summary_zh":242,"released_at":243},315441,"3.6.0","## Dataset Features\r\n* Enable xet in push to hub by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7552\r\n  * Faster downloads\u002Fuploads with Xet storage\r\n  * more info: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fissues\u002F7526\r\n\r\n## Other improvements and bug fixes\r\n* Add try_original_type to DatasetDict.map by @yoshitomo-matsubara in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7544\r\n* Avoid global umask for setting file mode. by @ryan-clancy in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7547\r\n* Rebatch arrow iterables before formatted iterable by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7553\r\n* Document the HF_DATASETS_CACHE environment variable in the datasets cache documentation by @Harry-Yang0518 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7532\r\n* fix regression by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7558\r\n* fix: Image Feature in Datasets Library Fails to Handle bytearray Objects from Spark DataFrames (#7517) by @giraffacarp in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7521\r\n* Remove `aiohttp` from direct dependencies by @akx in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7294\r\n\r\n## New Contributors\r\n* @ryan-clancy made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7547\r\n* @Harry-Yang0518 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7532\r\n* @giraffacarp made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7521\r\n* @akx made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7294\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F3.5.1...3.6.0","2025-05-07T15:17:49",{"id":245,"version":246,"summary_zh":247,"released_at":248},315442,"3.5.1","## Bug fixes\r\n* support pyarrow 20 by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7540\r\n  * Fix pyarrow error `TypeError: ArrayExtensionArray.to_pylist() got an unexpected keyword argument 'maps_as_pydicts'`\r\n* Write pdf in map by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7487\r\n\r\n## Other improvements\r\n* update fsspec 2025.3.0 by @peteski22 in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7478\r\n* Support underscore int read instruction by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7488\r\n* Support skip_trying_type  by @yoshitomo-matsubara in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7483\r\n* pdf docs fixes by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7519\r\n* Remove conditions for Python \u003C 3.9 by @cyyever in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7474\r\n* mention av in video docs by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7523\r\n* correct use with polars example by @SiQube in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7524\r\n* chore: fix typos by @afuetterer in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7436\r\n\r\n## New Contributors\r\n* @peteski22 made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7478\r\n* @yoshitomo-matsubara made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7483\r\n* @SiQube made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7524\r\n* @afuetterer made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7436\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F3.5.0...3.5.1","2025-04-28T14:02:58",{"id":250,"version":251,"summary_zh":252,"released_at":253},315443,"3.5.0","## Datasets Features\r\n* Introduce PDF support (#7318) by @yabramuvdi in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7325\r\n\r\n```python\r\n>>> from datasets import load_dataset, Pdf\r\n>>> repo = \"path\u002Fto\u002Fpdf\u002Ffolder\"  # or username\u002Fdataset_name on Hugging Face\r\n>>> dataset = load_dataset(repo, split=\"train\")\r\n>>> dataset[0][\"pdf\"]\r\n\u003Cpdfplumber.pdf.PDF at 0x1075bc320>\r\n>>> dataset[0][\"pdf\"].pages[0].extract_text()\r\n...\r\n```\r\n\r\n## What's Changed\r\n* Fix local pdf loading by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7466\r\n* Minor fix for metadata files in extension counter by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7464\r\n* Priotitize json by @lhoestq in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7476\r\n\r\n## New Contributors\r\n* @yabramuvdi made their first contribution in https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fpull\u002F7325\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets\u002Fcompare\u002F3.4.1...3.5.0","2025-03-27T16:38:30"]