[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-meta-pytorch--data":3,"tool-meta-pytorch--data":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":100,"env_os":101,"env_gpu":101,"env_ram":101,"env_deps":102,"category_tags":107,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":108,"updated_at":109,"faqs":110,"releases":140},454,"meta-pytorch\u002Fdata","data","A PyTorch repo for data loading and utilities to be shared by the PyTorch domain libraries.","TorchData 是一个由 PyTorch 官方维护的数据加载工具库，旨在对 PyTorch 原有的 DataLoader 和 Dataset 功能进行迭代增强，提供更具扩展性和高性能的数据加载解决方案。\n\nTorchData 核心解决了深度学习训练中“状态保存”与“断点续训”的痛点。原生的 PyTorch DataLoader 在训练中断后往往难以精确恢复现场，而 TorchData 引入了 Stateful DataLoader，作为一个可直接替换原生 DataLoader 的组件，它支持 `state_dict` 和 `load_state_dict` 接口。这意味着开发者可以在训练周期的中途（mid-epoch）保存进度，精确记录迭代位置及随机数状态，从而在故障恢复后实现无缝衔接，极大提升了长时间训练任务的稳定性。此外，TorchData 还包含 torchdata.nodes 模块，提供了一系列可组合的迭代器，允许开发者像搭积木一样构建流式数据预处理管道，进一步优化了数据流转效率。\n\nTorchData 适合所有使用 PyTorch 框架的开发者和研究人员，特别是那些需要处","TorchData 是一个由 PyTorch 官方维护的数据加载工具库，旨在对 PyTorch 原有的 DataLoader 和 Dataset 功能进行迭代增强，提供更具扩展性和高性能的数据加载解决方案。\n\nTorchData 核心解决了深度学习训练中“状态保存”与“断点续训”的痛点。原生的 PyTorch DataLoader 在训练中断后往往难以精确恢复现场，而 TorchData 引入了 Stateful DataLoader，作为一个可直接替换原生 DataLoader 的组件，它支持 `state_dict` 和 `load_state_dict` 接口。这意味着开发者可以在训练周期的中途（mid-epoch）保存进度，精确记录迭代位置及随机数状态，从而在故障恢复后实现无缝衔接，极大提升了长时间训练任务的稳定性。此外，TorchData 还包含 torchdata.nodes 模块，提供了一系列可组合的迭代器，允许开发者像搭积木一样构建流式数据预处理管道，进一步优化了数据流转效率。\n\nTorchData 适合所有使用 PyTorch 框架的开发者和研究人员，特别是那些需要处理大规模数据集、进行长时间模型训练，或者对数据加载流程有高度定制化需求的用户。通过简单的 pip 或 conda 安装即可集成到现有项目中，帮助用户更高效地管理数据流。","# TorchData\n\n[**What is TorchData?**](#what-is-torchdata) | [**Stateful DataLoader**](#stateful-dataloader) |\n[**Install guide**](#installation) | [**Contributing**](#contributing) | [**License**](#license)\n\n##\n\n## What is TorchData?\n\nThe TorchData project is an iterative enhancement to the PyTorch torch.utils.data.DataLoader and\ntorch.utils.data.Dataset\u002FIterableDataset to make them scalable, performant dataloading solutions. We will be iterating\non the enhancements under [the torchdata repo](torchdata).\n\nOur first change begins with adding checkpointing to torch.utils.data.DataLoader, which can be found in\n[stateful_dataloader, a drop-in replacement for torch.utils.data.DataLoader](torchdata\u002Fstateful_dataloader), by defining\n`load_state_dict` and `state_dict` methods that enable mid-epoch checkpointing, and an API for users to track custom\niteration progress, and other custom states from the dataloader workers such as token buffers and\u002For RNG states.\n\n## Stateful DataLoader\n\n`torchdata.stateful_dataloader.StatefulDataLoader` is a drop-in replacement for torch.utils.data.DataLoader which\nprovides state_dict and load_state_dict functionality. See\n[the Stateful DataLoader main page](torchdata\u002Fstateful_dataloader) for more information and examples. Also check out the\nexamples\n[in this Colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1tonoovEd7Tsi8EW8ZHXf0v3yHJGwZP8M?usp=sharing).\n\n## torchdata.nodes\n\ntorchdata.nodes is a library of composable iterators (not iterables!) that let you chain together common dataloading and\npre-proc operations. It follows a streaming programming model, although \"sampler + Map-style\" can still be configured if\nyou desire. See [torchdata.nodes main page](torchdata\u002Fnodes) for more details. Stay tuned for tutorial on\ntorchdata.nodes coming soon!\n\n## Installation\n\n### Version Compatibility\n\nThe following is the corresponding `torchdata` versions and supported Python versions.\n\n| `torch`              | `torchdata`        | `python`          |\n| -------------------- | ------------------ | ----------------- |\n| `master` \u002F `nightly` | `main` \u002F `nightly` | `>=3.9`, `\u003C=3.13` |\n| `2.6.0`              | `0.11.0`           | `>=3.9`, `\u003C=3.13` |\n| `2.5.0`              | `0.10.0`           | `>=3.9`, `\u003C=3.12` |\n| `2.5.0`              | `0.9.0`            | `>=3.9`, `\u003C=3.12` |\n| `2.4.0`              | `0.8.0`            | `>=3.8`, `\u003C=3.12` |\n| `2.0.0`              | `0.6.0`            | `>=3.8`, `\u003C=3.11` |\n| `1.13.1`             | `0.5.1`            | `>=3.7`, `\u003C=3.10` |\n| `1.12.1`             | `0.4.1`            | `>=3.7`, `\u003C=3.10` |\n| `1.12.0`             | `0.4.0`            | `>=3.7`, `\u003C=3.10` |\n| `1.11.0`             | `0.3.0`            | `>=3.7`, `\u003C=3.10` |\n\n### Local pip or conda\n\nFirst, set up an environment. We will be installing a PyTorch binary as well as torchdata. If you're using conda, create\na conda environment:\n\n```bash\nconda create --name torchdata\nconda activate torchdata\n```\n\nIf you wish to use `venv` instead:\n\n```bash\npython -m venv torchdata-env\nsource torchdata-env\u002Fbin\u002Factivate\n```\n\nInstall torchdata:\n\nUsing pip:\n\n```bash\npip install torchdata\n```\n\nUsing conda:\n\n```bash\nconda install -c pytorch torchdata\n```\n\n### From source\n\n```bash\npip install .\n```\n\nIn case building TorchData from source fails, install the nightly version of PyTorch following the linked guide on the\n[contributing page](CONTRIBUTING.md#install-pytorch-nightly).\n\n### From nightly\n\nThe nightly version of TorchData is also provided and updated daily from main branch.\n\nUsing pip:\n\n```bash\npip install --pre torchdata --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcpu\n```\n\nUsing conda:\n\n```bash\nconda install torchdata -c pytorch-nightly\n```\n\n## Contributing\n\nWe welcome PRs! See the [CONTRIBUTING](CONTRIBUTING.md) file.\n\n## Beta Usage and Feedback\n\nWe'd love to hear from and work with early adopters to shape our designs. Please reach out by raising an issue if you're\ninterested in using this tooling for your project.\n\n## License\n\nTorchData is BSD licensed, as found in the [LICENSE](LICENSE) file.\n","# TorchData\n\n[**TorchData 是什么？**](#what-is-torchdata) | [**有状态的 DataLoader**](#stateful-dataloader) |\n[**安装指南**](#installation) | [**贡献**](#contributing) | [**许可证**](#license)\n\n##\n\n## TorchData 是什么？\n\nTorchData 项目是对 PyTorch `torch.utils.data.DataLoader` 和 `torch.utils.data.Dataset`\u002F`IterableDataset` 的迭代增强，旨在使其成为可扩展、高性能的数据加载解决方案。我们将要在 [torchdata 仓库](torchdata)中对这些增强功能进行迭代。\n\n我们的第一个改动始于为 `torch.utils.data.DataLoader` 添加检查点功能，该功能位于 [stateful_dataloader](torchdata\u002Fstateful_dataloader) 中，它是 `torch.utils.data.DataLoader` 的直接替代品。通过定义 `load_state_dict` 和 `state_dict` 方法，它实现了 mid-epoch checkpointing（Epoch 中途检查点），并提供 API 供用户跟踪自定义迭代进度，以及来自 dataloader workers（数据加载器工作进程）的其他自定义状态，例如 token buffers（令牌缓冲区）和\u002F或 RNG states（随机数生成器状态）。\n\n## Stateful DataLoader\n\n`torchdata.stateful_dataloader.StatefulDataLoader` 是 `torch.utils.data.DataLoader` 的直接替代品，它提供了 `state_dict` 和 `load_state_dict` 功能。请参阅 [Stateful DataLoader 主页](torchdata\u002Fstateful_dataloader) 获取更多信息和示例。也可以查看 [此 Colab notebook](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1tonoovEd7Tsi8EW8ZHXf0v3yHJGwZP8M?usp=sharing) 中的示例。\n\n## torchdata.nodes\n\ntorchdata.nodes 是一个由可组合 iterators（迭代器，注意不是 iterables\u002F可迭代对象！）组成的库，允许您将常见的数据加载和预处理操作链接在一起。它遵循流式编程模型，不过如果您需要，仍然可以配置 \"sampler + Map-style\"（采样器 + Map 风格）模式。有关更多详细信息，请参阅 [torchdata.nodes 主页](torchdata\u002Fnodes)。敬请期待即将推出的 torchdata.nodes 教程！\n\n## 安装\n\n### 版本兼容性\n\n以下是相应的 `torchdata` 版本和支持的 Python 版本。\n\n| `torch`              | `torchdata`        | `python`          |\n| -------------------- | ------------------ | ----------------- |\n| `master` \u002F `nightly` | `main` \u002F `nightly` | `>=3.9`, `\u003C=3.13` |\n| `2.6.0`              | `0.11.0`           | `>=3.9`, `\u003C=3.13` |\n| `2.5.0`              | `0.10.0`           | `>=3.9`, `\u003C=3.12` |\n| `2.5.0`              | `0.9.0`            | `>=3.9`, `\u003C=3.12` |\n| `2.4.0`              | `0.8.0`            | `>=3.8`, `\u003C=3.12` |\n| `2.0.0`              | `0.6.0`            | `>=3.8`, `\u003C=3.11` |\n| `1.13.1`             | `0.5.1`            | `>=3.7`, `\u003C=3.10` |\n| `1.12.1`             | `0.4.1`            | `>=3.7`, `\u003C=3.10` |\n| `1.12.0`             | `0.4.0`            | `>=3.7`, `\u003C=3.10` |\n| `1.11.0`             | `0.3.0`            | `>=3.7`, `\u003C=3.10` |\n\n### 本地 pip 或 conda\n\n首先，设置一个环境。我们将安装 PyTorch 二进制文件以及 torchdata。如果您使用 conda，请创建一个 conda 环境：\n\n```bash\nconda create --name torchdata\nconda activate torchdata\n```\n\n如果您希望改用 `venv`：\n\n```bash\npython -m venv torchdata-env\nsource torchdata-env\u002Fbin\u002Factivate\n```\n\n安装 torchdata：\n\n使用 pip：\n\n```bash\npip install torchdata\n```\n\n使用 conda：\n\n```bash\nconda install -c pytorch torchdata\n```\n\n### 从源码安装\n\n```bash\npip install .\n```\n\n如果从源码构建 TorchData 失败，请按照 [贡献页面](CONTRIBUTING.md#install-pytorch-nightly) 上的链接指南安装 PyTorch 的 nightly（每夜构建）版本。\n\n### 从 nightly 版本安装\n\nTorchData 的 nightly 版本也已提供，并每天从主分支更新。\n\n使用 pip：\n\n```bash\npip install --pre torchdata --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fnightly\u002Fcpu\n```\n\n使用 conda：\n\n```bash\nconda install torchdata -c pytorch-nightly\n```\n\n## 贡献\n\n我们欢迎 PR（Pull Request）！请参阅 [CONTRIBUTING](CONTRIBUTING.md) 文件。\n\n## Beta 版本使用与反馈\n\n我们非常乐意听取早期采用者的意见并与之合作，以完善我们的设计。如果您有兴趣在项目中使用此工具，请通过提交 issue 与我们联系。\n\n## 许可证\n\nTorchData 采用 BSD 许可证，详见 [LICENSE](LICENSE) 文件。","# TorchData 快速上手指南\n\nTorchData 是 PyTorch `DataLoader` 和 `Dataset` 的迭代增强库，旨在提供可扩展、高性能的数据加载解决方案，支持状态保存与恢复。\n\n## 环境准备\n\n在安装 TorchData 前，请确保您的环境满足以下依赖要求：\n\n*   **Python**：建议 3.9 及以上版本（不同 TorchData 版本对 Python 支持不同，最新版支持至 3.13）。\n*   **PyTorch**：必须预先安装，且版本需与 TorchData 严格匹配。\n\n**版本兼容参考（部分）：**\n\n| `torch` 版本 | `torchdata` 版本 |\n| :--- | :--- |\n| `2.6.0` | `0.11.0` |\n| `2.5.0` | `0.10.0` \u002F `0.9.0` |\n| `2.4.0` | `0.8.0` |\n\n## 安装步骤\n\n推荐在虚拟环境中进行安装。\n\n**1. 创建并激活虚拟环境**\n\n使用 Conda（推荐）：\n```bash\nconda create --name torchdata\nconda activate torchdata\n```\n\n或使用 venv：\n```bash\npython -m venv torchdata-env\nsource torchdata-env\u002Fbin\u002Factivate\n```\n\n**2. 安装 TorchData**\n\n使用 pip 安装：\n```bash\npip install torchdata\n```\n\n或使用 Conda 安装：\n```bash\nconda install -c pytorch torchdata\n```\n\n## 基本使用\n\nTorchData 的核心组件是 `StatefulDataLoader`，它是 PyTorch 原生 `DataLoader` 的直接替代品，增加了 `state_dict` 和 `load_state_dict` 接口，支持断点续训。\n\n**示例：使用 StatefulDataLoader**\n\n```python\nimport torch\nfrom torchdata.stateful_dataloader import StatefulDataLoader\n\n# 1. 准备数据集\ndataset = torch.arange(100)\n\n# 2. 初始化 StatefulDataLoader (用法同 torch.utils.data.DataLoader)\ndataloader = StatefulDataLoader(dataset, batch_size=10)\n\n# 3. 模拟训练过程\nfor batch in dataloader:\n    print(batch)\n    # 4. 在训练过程中保存状态 (例如每个 epoch 结束或中断时)\n    state = dataloader.state_dict()\n    \n    # 5. 恢复状态 (例如从检查点恢复训练)\n    # dataloader.load_state_dict(state)\n```","某互联网大厂算法工程师正在训练一个基于海量用户行为数据的推荐模型，单个 Epoch 耗时长达 48 小时，且训练环境为竞价实例，随时可能被中断。\n\n### 没有 TorchData 时\n- 训练任务在 Epoch 进行到 90% 时因实例被抢占而中断，不得不从 Epoch 开头重新加载数据，直接浪费了近两天的昂贵算力成本。\n- 为了实现断点续训，需要手写复杂的逻辑来保存和恢复数据读取位置、随机数生成器（RNG）状态及 Worker 缓冲区，开发维护成本极高。\n- 重启后数据顺序难以完美复现，可能导致部分样本被重复训练或遗漏，破坏了训练过程的可复现性。\n\n### 使用 TorchData 后\n- 使用 `StatefulDataLoader` 替代原生 DataLoader，中断后可直接通过 `load_state_dict` 恢复到中断前的精确位置，算力浪费降为零。\n- 无需手写繁琐的状态管理代码，TorchData 自动处理迭代进度、多进程 Worker 状态和随机状态的保存与恢复，代码简洁健壮。\n- 完美复现中断前的数据流状态，保证训练数据顺序的一致性，确保模型训练过程的严谨性。\n\nTorchData 通过原生支持“Epoch 中途检查点”功能，彻底解决了长周期训练任务因意外中断导致的数据重载难题，显著降低了算力成本与开发复杂度。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmeta-pytorch_data_05534980.png","meta-pytorch","Meta PyTorch","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmeta-pytorch_1dfd3f76.jpg","",null,"https:\u002F\u002Fpytorch.org","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch",[84,88,92],{"name":85,"color":86,"percentage":87},"Python","#3572A5",98.3,{"name":89,"color":90,"percentage":91},"Shell","#89e051",1.7,{"name":93,"color":94,"percentage":95},"Batchfile","#C1F12E",0.1,1253,174,"2026-04-03T14:22:42","BSD-3-Clause",1,"未说明",{"notes":103,"python":104,"dependencies":105},"TorchData 是 PyTorch DataLoader 的增强库，支持状态保存和检查点功能。安装时需注意 torch 与 torchdata 的版本强对应关系（如 torch 2.6.0 对应 torchdata 0.11.0）。支持 pip 和 conda 安装，若从源码编译失败需安装 PyTorch nightly 版本。",">=3.7, \u003C=3.13 (不同版本对应的 Python 范围不同，最新版需 >=3.9)",[106],"torch (版本需与 torchdata 严格对应，详见 README 版本兼容表)",[51,13],"2026-03-27T02:49:30.150509","2026-04-06T05:17:11.954466",[111,116,120,125,130,135],{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},1757,"为什么在同一个 IterDataPipe 实例上执行多个操作会报错？","为了保证确定性和快照功能，IterDataPipe 的迭代器被设计为单例模式。如果需要在图中多次引用同一个 DataPipe（例如分别进行不同的 map 操作后再 zip），必须使用 `.fork()` 方法创建多个独立的实例，否则会抛出 RuntimeError。\n\n错误示例：\n```python\nmain_dp = IterableWrapper(range(10))\ndp1 = main_dp.map(lambda x: x + 1)\ndp2 = main_dp.map(lambda x: x * 2)\nfinal_dp = dp1.zip(dp2)  # 这会报错\n```\n\n正确做法：\n```python\nmain_dp = IterableWrapper(range(10))\nmain_dp1, main_dp2 = main_dp.fork(num_instances=2)\ndp1 = main_dp1.map(lambda x: x + 1)\ndp2 = main_dp2.map(lambda x: x * 2)\nfinal_dp = dp1.zip(dp2)\n```","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fdata\u002Fissues\u002F45",{"id":117,"question_zh":118,"answer_zh":119,"source_url":115},1758,"如何设计自定义 IterDataPipe 以支持快照和多进程？","建议在自定义 DataPipe 中实现 `__next__` 方法，并在 `__iter__` 中返回 `self`。相比于使用生成器函数（yield），这种方式将迭代状态附加到实例本身，使得实例变得可 pickle，从而更容易支持多进程和快照功能，也便于追踪 RNG（随机数生成器）和迭代次数等内部状态。",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},1759,"使用 DataLoader2 或 FullSyncDataPipe 时程序挂起如何解决？","这通常是由于进程组初始化超时或配置问题。建议使用 `DataLoader2` 并配置 `PrototypeMultiprocessingReadingService`。如果问题依旧，可以尝试将 DataLoader 包装在 IterableWrapper 中作为临时解决方案。\n\n推荐代码：\n```python\nrs = PrototypeMultiprocessingReadingService(num_workers=NUM_WORKERS)\ndl = DataLoader2(datapipe, reading_service=rs)\n```\n\n临时解决方案：\n```python\ndef loader():\n    pipe = pipes.IterableWrapper(list(range(2000)))\n    pipe = pipe.batch(5)\n    dl = DataLoader(pipe, batch_size=None, num_workers=5)\n    pipe = IterableWrapper(dl).fullsync()\n    return pipe\n```","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fdata\u002Fissues\u002F868",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},1760,"在 Windows 上导入 torchdata 时出现 'ImportError: DLL load failed' 怎么办？","该错误通常与 `portalocker` 依赖及其对 `win32api` 的调用有关。`portalocker` 是 Windows 上实现文件锁功能的硬依赖，而 `win32api` 模块缺失可能导致 DLL 加载失败。请检查是否安装了必要的系统依赖或 Visual C++ 运行库。如果在 CI 环境中遇到此问题，可能是特定运行环境的配置缺失。","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fdata\u002Fissues\u002F441",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},1761,"下载数据集时出现 SSL 错误或连接异常如何解决？","这类问题通常与网络代理配置或 SSL 协议版本有关（例如代理环境下的 SSL 握手失败）。建议尝试更新 `torchdata` 到最新版本（或 nightly 版本），因为新版本改进了 S3 加载器的错误处理和网络请求的稳定性，能够更好地处理代理和 SSL 相关异常。","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fdata\u002Fissues\u002F355",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},1762,"自定义 IterDataPipe 在图遍历时出现 RecursionError 怎么办？","如果自定义 DataPipe 内部组合了其他 DataPipe（例如 `self._dp = IterableWrapper...`），在调用 `traverse` 时可能会因为循环引用导致 RecursionError。该问题已在 PyTorch 的相关 PR（如 #74984）中修复。请确保升级 PyTorch 和 torchdata 到最新版本以解决此问题。","https:\u002F\u002Fgithub.com\u002Fmeta-pytorch\u002Fdata\u002Fissues\u002F237",[141,146,151,156,161,166,171,176,181,186,191,196,201],{"id":142,"version":143,"summary_zh":144,"released_at":145},101261,"v0.11.0","## What's Changed\r\n\r\n* Addition of `Unbatcher` node (#1416 )\r\n* `max_concurrent` instance check and error message fix (#1420 )\r\n* Out of order implementation in StatefulDataloader (#1423 )\r\n* Addition of `CYCLE_FOREVER` stop criterion for `MultiNodeWeightedSampler` node (#1424 )\r\n* `generator` initial seed setting in RandomSampler (#1441 )\r\n* `StatefulDataloader` restart behavior when reloaded from a state_dict saved at end of epoch with `num_workers=0` (#1439 )\r\n\r\n## New Contributors\r\n* @FightingZhen made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1413\r\n* @kit1980 made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1419\r\n* @michael-diggin made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1423\r\n* @mirceamironenco made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1420\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcompare\u002Fv0.10.1...v0.11.0\r\n@andrewkho @divyanshk @scotts @atalman ","2025-02-21T22:41:46",{"id":147,"version":148,"summary_zh":149,"released_at":150},101262,"v0.10.1","## What's Changed\r\n\r\nThis release introduces 3 major changes:\r\n1) Introducing [`torchdata.nodes`](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Fwhat_is_torchdata_nodes.html), a library of extensible and composable iterators that lets you chain together common dataloading and pre-proc operations! This initial release includes the following features, with more on the way:\r\n\r\n    * [Multi-threaded paralellism](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Ftorchdata.nodes.html#torchdata.nodes.ParallelMapper), and experimental support for [Free-Threaded (No-GIL) Python](https:\u002F\u002Fpeps.python.org\u002Fpep-0703\u002F), in addition to the typical Multi-process parallelism.\r\n        * Note: FT Python support is experimental, requires Python 3.13t and [torch>=2.5.0](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002Fmain\u002FRELEASE.md), and is currently only tested for Linux\r\n    * [Multi-dataset weighted sampling](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Ftorchdata.nodes.html#torchdata.nodes.MultiNodeWeightedSampler) \r\n    * [State Management through state_dict\u002Fload_state_dict methods](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Ftorchdata.nodes.html#torchdata.nodes.Loader)\r\n    * [Near-feature-parity](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Fmigrate_to_nodes_from_utils.html) with torch.utils.data.DataLoader, with full support for existing torch.utils.data.Dataset (IterableDataset and persistent_workers coming soon!).\r\n    * Refer to the [`torchdata.nodes` docs](https:\u002F\u002Fpytorch.org\u002Fdata\u002Fmain\u002Fwhat_is_torchdata_nodes.html) for more details.\r\n\r\n2) This release drops support for DataPipes and DataLoader2. Release v0.9 was the last stable release which includes them. Please see [this issue](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F1196) for more details.\r\n\r\n3) PyTorch's [official conda channel is deprecated](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fissues\u002F138506). TorchData has removed its conda builds as well. TorchData will be available for installation through pip, on PyPI and download.pytorch.org.\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcompare\u002Fv0.9.0...v0.10.1\r\n\r\n@divyanshk @ramanishsingh @andrewkho ","2024-12-13T18:22:41",{"id":152,"version":153,"summary_zh":154,"released_at":155},101263,"v0.9.0","## What's Changed\r\nThis was a relatively small release compared to previous. This will notably be the last stable release to feature DataPipes and DataLoader2!\r\n* Drop Python 3.8 support\r\n* Make DistributedSampler stateful by @ramanishsingh in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1315\r\n\r\n## New Contributors\r\n* @jovianjaison made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1314\r\n* @ramanishsingh made their first contribution in https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1315\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcommits\u002Fv0.9.0","2024-10-21T22:06:33",{"id":157,"version":158,"summary_zh":159,"released_at":160},101264,"v0.8.0","[TorchData 0.8.0](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Freleases\u002Ftag\u002Fv0.8.0) [Latest](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Freleases\u002Flatest)\r\n\r\n# Highlights\r\n\r\nWe are excited to announce the release of TorchData 0.8.0. This first release of [StatefulDataLoader](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Ftree\u002Frelease\u002F0.8\u002Ftorchdata\u002Fstateful_dataloader#statefuldataloader), which is a drop-in replacement for [torch.utils.data.DataLoader](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fblob\u002Fmain\u002Ftorch\u002Futils\u002Fdata\u002Fdataloader.py), offering state_dict\u002Fload_state_dict methods for handling mid-epoch checkpointing.  \r\n\r\n\r\n# Deprecations\r\n\r\n⚠️ June 2024 Status Update: Removing DataPipes and DataLoader V2\r\n\r\nWe are re-focusing the torchdata repo to be an iterative enhancement of torch.utils.data.DataLoader. We do not plan on continuing development or maintaining the [DataPipes] and [DataLoaderV2] solutions, and they will be removed from the torchdata repo. We'll also be revisiting the DataPipes references in pytorch\u002Fpytorch. In release torchdata==0.8.0 (July 2024) they will be marked as deprecated, and in 0.9.0 (Oct 2024) they will be deleted. Existing users are advised to pin to torchdata==0.8.0 or an older version until they are able to migrate away. Subsequent releases will not include DataPipes or DataLoaderV2. The old version of this README is [available here](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fblob\u002Fv0.7.1\u002FREADME.md). Please reach out if you suggestions or comments (please use https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F1196 for feedback).\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcommits\u002Fv0.8.0","2024-08-08T20:34:46",{"id":162,"version":163,"summary_zh":164,"released_at":165},101265,"v0.7.1","[TorchData 0.7.1](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Freleases\u002Ftag\u002Fv0.7.1) [Latest](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Freleases\u002Flatest)\r\n\r\nCurrent status\r\n⚠️ As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F1196 for feedback).\r\n\r\nThis is a patch release, which is compatible with [PyTorch 2.1.1](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Freleases\u002Ftag\u002Fv2.1.1). There are no new features added. ","2023-11-15T23:23:43",{"id":167,"version":168,"summary_zh":169,"released_at":170},101266,"v0.7.0","# Current status\r\n\r\n**:warning: As of July 2023, we have paused active development on TorchData and have paused new releases. We have learnt a lot from building it and hearing from users, but also believe we need to re-evaluate the technical design and approach given how much the industry has changed since we began the project. During the rest of 2023 we will be re-evaluating our plans in this space. Please reach out if you suggestions or comments (please use [#1196](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F1196) for feedback).**\r\n\r\n# Bug Fixes\r\n\r\n- MPRS request\u002Fresponse cycle for workers (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcommit\u002F40dd648bdd2b7b9c078ba3d2f47316b6dd4446d3)\r\n- Sequential reading service checkpointing (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fcommit\u002F8d452cf4d0688fdce478089fe77cba52fc27e1c3)\r\n- Cancel future object and always run callback in FullSync during shutdown (#1171)\r\n- DataPipe, Ensures Prefetcher shuts down properly (#1166)\r\n- DataPipe, Fix FullSync shutdown hanging issue while paused (#1153)\r\n- DataPipe, Fix a word in WebDS DataPipe (#1156)\r\n- DataPipe, Add handler argument to iopath DataPipes (#1154)\r\n- Prevent in_memory_cache from yielding from source_dp when it's fully cache (#1160)\r\n- Fix pin_memory to support single-element batch (#1158)\r\n- DataLoader2, Removing delegation for 'pause', 'limit', and 'resume' (#1067)\r\n- DataLoader2, Handle MapDataPipe by converting to IterDataPipe internally by default (#1146)\r\n\r\n# New Features\r\n\r\n- Implement InProcessReadingService  (#1139)\r\n- Enable miniepoch for MultiProcessingReadingService (#1170)\r\n- DataPipe, Implement pause\u002Fresume for FullSync (#1130)\r\n- DataLoader2, Saving and restoring initial seed generator (#998)\r\n- Add ThreadPoolMapper (#1052)\r\n","2023-10-12T20:24:32",{"id":172,"version":173,"summary_zh":174,"released_at":175},101267,"v0.6.1","# TorchData 0.6.1 Beta Release Notes\r\n\r\n\r\n# Highlights\r\n\r\nThis minor release is aligned with PyTorch 2.0.1 and primarily fixes bugs that are introduced in the 0.6.0 release. We sincerely thank our users and contributors for spotting various bugs and helping us to fix them.\r\n\r\n\r\n# Bug Fixes\r\n\r\n\r\n## DataLoader2\r\n\r\n\r\n\r\n* Properly clean up processes and queues for MPRS and Fix pause for prefetch (#1096)\r\n* Fix DataLoader2 `seed = 0` bug (#1098)\r\n    * Previously, if `seed = 0` was passed into `DataLoader2`, the `seed` value in `DataLoader2` would not be set and the seed would be unused. This change fixes that and allow `seed = 0` to be used normally.\r\n* Fix `worker_init_fn` to update DataPipe graph and move worker prefetch to the end of Worker pipeline (#1100)\r\n\r\n\r\n## DataPipe\r\n\r\n\r\n\r\n* Fix `pin_memory_fn` to support `namedtuple` (#1086)\r\n* Fix typo for `portalocker` at import time (#1099)\r\n\r\n\r\n# Improvements\r\n\r\n\r\n## DataPipe\r\n\r\n\r\n\r\n* Skip `FullSync` operation when `world_size == 1` (#1065)\r\n\r\n\r\n# Docs\r\n\r\n\r\n\r\n* Add long project description to `setup.py` for display on PyPI (#1094)\r\n\r\n\r\n# Beta Usage Note\r\n\r\nThis library is currently in the Beta stage and currently does not have a fully stable release. The API may change based on user feedback or performance. We are committed to bring this library to stable release, but future changes may not be completely backward compatible. If you install from source or use the nightly version of this library, use it along with the PyTorch nightly binaries. If you have suggestions on the API or use cases you'd like to be covered, please open a GitHub issue. We'd love to hear thoughts and feedback. As always, we welcome new contributors to our repo.\r\n","2023-05-08T20:15:18",{"id":177,"version":178,"summary_zh":179,"released_at":180},101268,"v0.6.0","# TorchData 0.6.0 Beta Release Notes\r\n\r\n\r\n# Highlights\r\n\r\nWe are excited to announce the release of TorchData 0.6.0. This release is composed of about 130 commits since 0.5.0, made by 27 contributors. We want to sincerely thank our community for continuously improving TorchData.\r\n\r\nTorchData 0.6.0 updates are primarily focused on `DataLoader2`. We graduate some of its APIs from the prototype stage and introduce additional features. Highlights include:\r\n\r\n\r\n\r\n* Graduation of `MultiProcessingReadingService` from prototype to beta\r\n    * This is the default `ReadingService` that we expect most users to use; it closely aligns with the functionalities of old `DataLoader` with improvements\r\n    * With this graduation, we expect the APIs and behaviors to be mostly stable going forward. We will continue to add new features as they become ready.\r\n* Introduction of Sequential ReadingService\r\n    * Enables the usage of multiple `ReadingService`s at the same time\r\n* Adding comprehensive tutorial of `DataLoader2` and its subcomponents\r\n\r\n\r\n# Backwards Incompatible Change\r\n\r\n\r\n## DataLoader2\r\n\r\n\r\n\r\n* Officially graduate PrototypeMultiProcessingReadingService to MultiProcessingReadingService ([#1009]([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1009](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F1009)))\r\n    * The APIs of `MultiProcessingReadingService` as well as the internal implementation have changed. Overall, this should provide a better user experience.\r\n    * Please refer to [our documentation]([https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.6\u002Fdataloader2.html#readingservice](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.6\u002Fdataloader2.html#readingservice)) for details.\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Ctr>\u003Cth>0.5.0\u003C\u002Fth>\u003Cth>0.6.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub> It previously took the following arguments:\r\n\u003Cpre lang=\"python\">\r\nMultiProcessingReadingService(\r\n    num_workers: int = 0,\r\n    pin_memory: bool = False,\r\n    timeout: float = 0,\r\n    worker_init_fn: Optional[Callable[[int], None]] = None,\r\n    multiprocessing_context=None,\r\n    prefetch_factor: Optional[int] = None,\r\n    persistent_workers: bool = False,\r\n)\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub> The new version takes these arguments: \u003Cpre lang=\"python\">\r\nMultiProcessingReadingService(\r\n    num_workers: int = 0,\r\n    multiprocessing_context: Optional[str] = None,\r\n    worker_prefetch_cnt: int = 10,\r\n    main_prefetch_cnt: int = 10,\r\n    worker_init_fn: Optional[Callable[[DataPipe, WorkerInfo], DataPipe]] = None,\r\n    worker_reset_fn: Optional[Callable[[DataPipe, WorkerInfo, SeedGenerator], DataPipe]] = None,\r\n)\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C\u002Fp>\r\n\r\n* Deep copy ReadingService during `DataLoader2` initialization ([#746]([https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F746](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F746)))\r\n    * Within `DataLoader2`, a deep copy of the passed-in `ReadingService` object is created during initialization and will be subsequently used.\r\n    * This prevents multiple `DataLoader2`s from accidentally sharing states when the same `ReadingService` object is passed into them.\r\n\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Ctr>\u003Cth>0.5.0\u003C\u002Fth>\u003Cth>0.6.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub> Previously, a ReadingService object that is used in multiple DataLoader2 shared state among them.\r\n\u003Cpre lang=\"python\">\r\n>>> dp = IterableWrapper([0, 1, 2, 3, 4])\r\n>>> rs = MultiProcessingReadingService(num_workers=2)\r\n>>> dl1 = DataLoader2(dp, reading_service=rs)\r\n>>> dl2 = DataLoader2(dp, reading_service=rs)\r\n>>> next(iter(dl1))\r\n>>> print(f\"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}\")\r\n# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2\r\n>>> next(iter(dl2))\r\n# Note that we are still examining `dl1.read_service` below\r\n>>> print(f\"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}\")\r\n# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 4\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub> DataLoader2 now deep copies the ReadingService object during initialization and the ReadingService state is no longer shared.\r\n\u003Cpre lang=\"python\">\r\n>>> dp = IterableWrapper([0, 1, 2, 3, 4])\r\n>>> rs = MultiProcessingReadingService(num_workers=2)\r\n>>> dl1 = DataLoader2(dp, reading_service=rs)\r\n>>> dl2 = DataLoader2(dp, reading_service=rs)\r\n>>> next(iter(dl1))\r\n>>> print(f\"Number of processes that exist in `dl1`'s RS after initializing `dl1`: {len(dl1.reading_service._worker_processes)}\")\r\n# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2\r\n>>> next(iter(dl2))\r\n# Note that we are still examining `dl1.read_service` below\r\n>>> print(f\"Number of processes that exist in `dl1`'s RS after initializing `dl2`: {len(dl1.reading_service._worker_processes)}\")\r\n# Number of processes that exist in `dl1`'s RS after initializing `dl1`: 2\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C","2023-03-15T19:38:45",{"id":182,"version":183,"summary_zh":184,"released_at":185},101269,"v0.5.1","This is a minor release to update PyTorch dependency from `1.13.0` to `1.13.1`. Please check the [release note](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Freleases\u002Ftag\u002Fv0.5.0) of TorchData `0.5.0` major release for more detail.","2022-12-16T15:19:31",{"id":187,"version":188,"summary_zh":189,"released_at":190},101270,"v0.5.0","# TorchData 0.5.0 Release Notes\r\n\r\n* Highlights\r\n* Backwards Incompatible Change\r\n* Deprecations\r\n* New Features\r\n* Improvements\r\n* Bug Fixes\r\n* Performance\r\n* Documentation\r\n* Future Plans\r\n* Beta Usage Note\r\n\r\n# Highlights\r\n\r\nWe are excited to announce the release of TorchData 0.5.0. This release is composed of about 236 commits since 0.4.1, including ones from PyTorch Core since 1.12.1, made by more than 35 contributors. We want to sincerely thank our community for continuously improving TorchData.\r\n\r\nTorchData 0.5.0 updates are focused on consolidating the `DataLoader2` and `ReadingService` APIs and benchmarking. Highlights include:\r\n* Added support to load data from more cloud storage providers, now covering AWS, Google Cloud Storage, and Azure. Detailed tutorial can be found [here](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.5\u002Ftutorial.html#working-with-cloud-storage-providers) \r\n  * AWS S3 Benchmarking [result](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fblob\u002Fmain\u002Fbenchmarks\u002Fcloud\u002Faws_s3_results.md)\r\n* Consolidated API for `DataLoader2` and provided a few `ReadingServices`, with detailed documentation now [available here](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.5\u002Fdataloader2.html) \r\n* Provided more comprehensive `DataPipe` operations, e.g., `random_split`, `repeat`, `set_length`, and `prefetch`.\r\n* Provided pre-compiled torchdata binaries for arm64 Apple Silicon\r\n\r\n# Backwards Incompatible Change\r\n\r\n## DataPipe\r\n\r\n### Changed the returned value of `MapDataPipe.shuffle` to an `IterDataPipe` (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F83202)\r\n`IterDataPipe` is used to to preserve data order\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">MapDataPipe.shuffle\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n    \u003C\u002Fthead>\r\n    \u003Ctr>\u003Cth>0.4.1\u003C\u002Fth>\u003Cth>0.5.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> from torch.utils.data import IterDataPipe, MapDataPipe\r\n>>> from torch.utils.data.datapipes.map import SequenceWrapper\r\n>>> dp = SequenceWrapper(list(range(10))).shuffle()\r\n>>> isinstance(dp, MapDataPipe)\r\nTrue\r\n>>> isinstance(dp, IterDataPipe)\r\nFalse\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> from torch.utils.data import IterDataPipe, MapDataPipe\r\n>>> from torch.utils.data.datapipes.map import SequenceWrapper\r\n>>> dp = SequenceWrapper(list(range(10))).shuffle()\r\n>>> isinstance(dp, MapDataPipe)\r\nFalse\r\n>>> isinstance(dp, IterDataPipe)\r\nTrue\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C\u002Fp>\r\n\r\n### `on_disk_cache` now doesn’t accept generator functions for the argument of `filename_fn` (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F810) \r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">on_disk_cache\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n    \u003C\u002Fthead>\r\n    \u003Ctr>\u003Cth>0.4.1\u003C\u002Fth>\u003Cth>0.5.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> url_dp = IterableWrapper([\"https:\u002F\u002Fpath\u002Fto\u002Ffilename\", ])\r\n>>> def filepath_gen_fn(url):\r\n…     yield from [url + f”\u002F{i}” for i in range(3)]\r\n>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> url_dp = IterableWrapper([\"https:\u002F\u002Fpath\u002Fto\u002Ffilename\", ])\r\n>>> def filepath_gen_fn(url):\r\n…     yield from [url + f”\u002F{i}” for i in range(3)]\r\n>>> cache_dp = url_dp.on_disk_cache(filepath_fn=filepath_gen_fn)\r\n# AssertionError\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\r\n## DataLoader2\r\n\r\n### Imposed single iterator constraint on `DataLoader2` (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F700)\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">DataLoader2 with a single iterator\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n    \u003C\u002Fthead>\r\n    \u003Ctr>\u003Cth>0.4.1\u003C\u002Fth>\u003Cth>0.5.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> dl = DataLoader2(IterableWrapper(range(10)))\r\n>>> it1 = iter(dl)\r\n>>> print(next(it1))\r\n0\r\n>>> it2 = iter(dl)  # No reset here\r\n>>> print(next(it2))\r\n1\r\n>>> print(next(it1))\r\n2\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> dl = DataLoader2(IterableWrapper(range(10)))\r\n>>> it1 = iter(dl)\r\n>>> print(next(it1))\r\n0\r\n>>> it2 = iter(dl)  # DataLoader2 resets with the creation of a new iterator\r\n>>> print(next(it2))\r\n0\r\n>>> print(next(it1))\r\n# Raises exception, since it1 is no longer valid\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C\u002Fp>\r\n\r\n### Deep copy `DataPipe` during `DataLoader2` initialization or restoration (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F786, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fpull\u002F833)\r\nPreviously, if a DataPipe is being passed to multiple DataLoaders, the DataPipe's state can be altered by any of those DataLoaders. In some cases, that may raise an exception due to the single iterator constraint; in other cases, some behaviors can be changed due to the adapters (e.g. shuffling) of another DataLoader.\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">Deep copy DataPipe during DataLoader2 constructor\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n","2022-10-27T17:08:20",{"id":192,"version":193,"summary_zh":194,"released_at":195},101271,"v0.4.1","# TorchData 0.4.1 Release Notes\r\n\r\n# Bug fixes\r\n- Fixed `DataPipe` working with `DataLoader` in the distributed environment (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F80348, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F81071, https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F81071)\r\n\r\n# Documentation\r\n- Updated TorchData tutorial (#675, #688, #715)\r\n\r\n# Releng\r\n- Provided pre-compiled `torchdata` binaries for arm64 Apple Silicon (#692)\r\n  - Python [3.8~3.10]","2022-08-05T20:47:37",{"id":197,"version":198,"summary_zh":199,"released_at":200},101272,"v0.4.0","# TorchData 0.4.0 Release Notes\r\n\r\n* Highlights\r\n* Backwards Incompatible Change\r\n* Deprecations\r\n* New Features\r\n* Improvements\r\n* Performance\r\n* Documentation\r\n* Future Plans\r\n* Beta Usage Note\r\n\r\n# Highlights\r\n\r\nWe are excited to announce the release of TorchData 0.4.0. This release is composed of about 120 commits since 0.3.0, made by 23 contributors. We want to sincerely thank our community for continuously improving TorchData.\r\n\r\nTorchData 0.4.0 updates are focused on consolidating the `DataPipe` APIs and supporting more remote file systems. Highlights include:\r\n\r\n* DataPipe graph is now backward compatible with `DataLoader` regarding dynamic sharding and shuffle determinism in single-process, multiprocessing, and distributed environments. Please check the tutorial [here](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.4.0\u002Ftutorial.html#working-with-dataloader).\r\n* [`AWSSDK`](https:\u002F\u002Fgithub.com\u002Faws\u002Faws-sdk-cpp) is integrated to support listing\u002Floading files from AWS S3.\r\n* Adding support to read from `TFRecord` and Hugging Face Hub.\r\n* `DataLoader2` became available in prototype mode. For more details, please check our [future plans](#Future-Plans).\r\n\r\n# Backwards Incompatible Change\r\n\r\n## DataPipe\r\n\r\n### Updated `Multiplexer` (functional API `mux`) to stop merging multiple `DataPipes` whenever the shortest one is exhausted (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F77145)\r\nPlease use `MultiplexerLongest` (functional API `mux_longgest`) to achieve the previous functionality.\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Ctr>\u003Cth>0.3.0\u003C\u002Fth>\u003Cth>0.4.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> dp1 = IterableWrapper(range(3))\r\n>>> dp2 = IterableWrapper(range(10, 15))\r\n>>> dp3 = IterableWrapper(range(20, 25))\r\n>>> output_dp = dp1.mux(dp2, dp3)\r\n>>> list(output_dp)\r\n[0, 10, 20, 1, 11, 21, 2, 12, 22, 3, 13, 23, 4, 14, 24]\r\n>>> len(output_dp)\r\n13\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> dp1 = IterableWrapper(range(3))\r\n>>> dp2 = IterableWrapper(range(10, 15))\r\n>>> dp3 = IterableWrapper(range(20, 25))\r\n>>> output_dp = dp1.mux(dp2, dp3)\r\n>>> list(output_dp)\r\n[0, 10, 20, 1, 11, 21, 2, 12, 22]\r\n>>> len(output_dp)\r\n9\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C\u002Fp>\r\n\r\n### Enforcing single valid iterator for `IterDataPipes` w\u002Fwo multiple outputs https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F70479, (https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fpytorch\u002Fpull\u002F75995)\r\nIf you need to reference the same `IterDataPipe` multiple times, please apply `.fork()` on the `IterDataPipe` instance.\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">IterDataPipe with a single output\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n    \u003C\u002Fthead>\r\n    \u003Ctr>\u003Cth>0.3.0\u003C\u002Fth>\u003Cth>0.4.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> source_dp = IterableWrapper(range(10))\r\n>>> it1 = iter(source_dp)\r\n>>> list(it1)\r\n[0, 1, ..., 9]\r\n>>> it1 = iter(source_dp)\r\n>>> next(it1)\r\n0\r\n>>> it2 = iter(source_dp)\r\n>>> next(it2)\r\n0\r\n>>> next(it1)\r\n1\r\n# Multiple references of DataPipe\r\n>>> source_dp = IterableWrapper(range(10))\r\n>>> zip_dp = source_dp.zip(source_dp)\r\n>>> list(zip_dp)\r\n[(0, 0), ..., (9, 9)]\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> source_dp = IterableWrapper(range(10))\r\n>>> it1 = iter(source_dp)\r\n>>> list(it1)\r\n[0, 1, ..., 9]\r\n>>> it1 = iter(source_dp)  # This doesn't raise any warning or error\r\n>>> next(it1)\r\n0\r\n>>> it2 = iter(source_dp)\r\n>>> next(it2)  # Invalidates `it1`\r\n0\r\n>>> next(it1)\r\nRuntimeError: This iterator has been invalidated because another iterator has been created from the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))\r\nThis may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.\r\nFor feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F45.\r\n# Multiple references of DataPipe\r\n>>> source_dp = IterableWrapper(range(10))\r\n>>> zip_dp = source_dp.zip(source_dp)\r\n>>> list(zip_dp)\r\nRuntimeError: This iterator has been invalidated because another iterator has been createdfrom the same IterDataPipe: IterableWrapperIterDataPipe(deepcopy=True, iterable=range(0, 10))\r\nThis may be caused multiple references to the same IterDataPipe. We recommend using `.fork()` if that is necessary.\r\nFor feedback regarding this single iterator per IterDataPipe constraint, feel free to comment on this issue: https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fissues\u002F45.\r\n      \u003C\u002Fpre>\u003C\u002Fsub>\u003C\u002Ftd>\r\n    \u003C\u002Ftr>\r\n  \u003C\u002Ftable>\r\n\u003C\u002Fp>\r\n\r\n\u003Cp align=\"center\">\r\n  \u003Ctable align=\"center\">\r\n    \u003Cthead>\r\n      \u003Ctr>\r\n        \u003Cth colspan=\"2\">IterDataPipe with multiple outputs\u003C\u002Fth>\r\n      \u003C\u002Ftr>\r\n    \u003C\u002Fthead>\r\n    \u003Ctr>\u003Cth>0.3.0\u003C\u002Fth>\u003Cth>0.4.0\u003C\u002Fth>\u003C\u002Ftr>\r\n    \u003Ctr valign=\"top\">\r\n      \u003Ctd>\u003Csub>\u003Cpre lang=\"python\">\r\n>>> source_dp = IterableWrapper(range(10))\r\n>>> cdp1, cdp2 = source_dp.fork(num_instances=2)\r\n>>> it1, it2 = iter(cdp1), iter(","2022-06-28T18:30:02",{"id":202,"version":203,"summary_zh":204,"released_at":205},101273,"v0.3.0","# 0.3.0 Release Notes\r\n\r\nWe are delighted to present the Beta release of [TorchData](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata). This is a library of common modular data loading primitives for easily constructing flexible and performant data pipelines. Based on community feedback, we have found that the existing DataLoader bundled too many features together and can be difficult to extend. Moreover, different use cases often have to rewrite the same data loading utilities over and over again. The goal here is to enable composable data loading through Iterable-style and Map-style building blocks called [“DataPipes”](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata#what-are-datapipes) that work well out of the box with the PyTorch’s [`DataLoader`](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdata.html#torch.utils.data.DataLoader).\r\n\r\n* Highlights\r\n  * What are DataPipes?\r\n  * Usage Example\r\n* New Features\r\n* Documentation\r\n* Usage in Domain Libraries\r\n* Future Plans\r\n* Beta Usage Note\r\n\r\n## Highlights\r\n\r\nWe are releasing DataPipes - there are Iterable-style DataPipe ([`IterDataPipe`](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.3.0\u002Ftorchdata.datapipes.iter.html)) and Map-style DataPipe ([`MapDataPipe`](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.3.0\u002Ftorchdata.datapipes.map.html)). \r\n\r\n### What are DataPipes?\r\n\r\nEarly on, we observed widespread confusion between the PyTorch `DataSets` which represented reusable loading tooling (e.g. [TorchVision's `ImageFolder`](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fvision\u002Fblob\u002Fmain\u002Ftorchvision\u002Fdatasets\u002Ffolder.py#L272)), and those that represented pre-built iterators\u002Faccessors over actual data corpora (e.g. [TorchVision's `ImageNet`](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fvision\u002Fblob\u002Fmain\u002Ftorchvision\u002Fdatasets\u002Fimagenet.py#L21)). This led to an unfortunate pattern of siloed inheritance of data tooling rather than composition.\r\n\r\n`DataPipe` is simply a renaming and repurposing of the PyTorch `DataSet` for composed usage. A `DataPipe` takes in some access function over Python data structures, `__iter__` for `IterDataPipes` and `__getitem__` for `MapDataPipes` , and returns a new access function with a slight transformation applied. For example, take a look at this `JsonParser`, which accepts an IterDataPipe over file names and raw streams, and produces a new iterator over the filenames and deserialized data:\r\n\r\n```py\r\nimport json\r\n\r\nclass JsonParserIterDataPipe(IterDataPipe):\r\n    def __init__(self, source_datapipe, **kwargs) -> None:\r\n        self.source_datapipe = source_datapipe\r\n        self.kwargs = kwargs\r\n\r\n    def __iter__(self):\r\n        for file_name, stream in self.source_datapipe:\r\n            data = stream.read()\r\n            yield file_name, json.loads(data)\r\n\r\n    def __len__(self):\r\n        return len(self.source_datapipe) \r\n```\r\n\r\nYou can see in this example how DataPipes can be easily chained together to compose graphs of transformations that reproduce sophisticated data pipelines, with streamed operation as a first-class citizen.\r\n\r\nUnder this naming convention, `DataSet` simply refers to a graph of `DataPipes`, and a dataset module like `ImageNet` can be rebuilt as a factory function returning the requisite composed `DataPipes`.\r\n\r\n## Usage Example\r\n\r\nIn this example, we have a compressed TAR archive file stored in Google Drive and accessible via an URL. We demonstrate how you can use DataPipes to download the archive, cache the result, decompress the archive, filter for specific files, parse and return the CSV content. The full example with detailed explanation is [included in the example folder](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Fdata\u002Fblob\u002Frelease\u002F0.3.0\u002Fexamples\u002Ftext\u002Famazonreviewpolarity.py).\r\n\r\n```py\r\nurl_dp = IterableWrapper([URL])\r\ncache_compressed_dp = GDriveReader(cache_compressed_dp)\r\n# cache_decompressed_dp = ... # See source file for full code example\r\n# Opens and loads the content of the TAR archive file.\r\ncache_decompressed_dp = FileOpener(cache_decompressed_dp, mode=\"b\").load_from_tar()\r\n# Filters for specific files based on the file name.\r\ncache_decompressed_dp = cache_decompressed_dp.filter(\r\n    lambda fname_and_stream: _EXTRACTED_FILES[split] in fname_and_stream[0]\r\n)\r\n# Saves the decompressed file onto disk.\r\ncache_decompressed_dp = cache_decompressed_dp.end_caching(mode=\"wb\", same_filepath_fn=True)\r\ndata_dp = FileOpener(cache_decompressed_dp, mode=\"b\")\r\n# Parses content of the decompressed CSV file and returns the result line by line. return \r\nreturn data_dp.parse_csv().map(fn=lambda t: (int(t[0]), \" \".join(t[1:])))\r\n```\r\n\r\n## New Features\r\n\r\n[Beta] [IterDataPipe](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.3.0\u002Ftorchdata.datapipes.iter.html)\r\n\r\nWe have implemented over 50 Iterable-style DataPipes across 10 different categories. They cover different functionalities, such as opening files, parsing texts, transforming samples, caching, shuffling, and batching. For users who are interested in connecting to cloud providers (such as Google Drive or AWS S3), the [fsspec and iopath DataPipes](https:\u002F\u002Fpytorch.org\u002Fdata\u002F0.3.0\u002Ftorchdata.datapipes.iter.html#io-datapipes) will ","2022-03-10T18:43:54"]