[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mosaicml--streaming":3,"tool-mosaicml--streaming":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",149489,2,"2026-04-10T11:32:46",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":32,"env_os":97,"env_gpu":97,"env_ram":97,"env_deps":98,"category_tags":105,"github_topics":106,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":112,"updated_at":113,"faqs":114,"releases":135},6262,"mosaicml\u002Fstreaming","streaming","A Data Streaming Library for Efficient Neural Network Training","streaming 是一款专为高效神经网络训练打造的数据流式传输库。在人工智能大模型训练中，面对海量数据集，传统方式往往需要先将数据完整下载到本地存储，这不仅耗时耗力，还极大地增加了存储成本和管理复杂度。streaming 正是为了解决这一痛点而生，它允许用户直接从云端存储（如 AWS S3、Google Cloud、Azure 等）按需“流式”读取训练数据，无需预先下载全量数据。\n\n这款工具特别适合从事大模型研发的 AI 工程师、数据科学家以及学术研究人员。无论是处理图像、文本、视频还是多模态数据，streaming 都能轻松应对。其核心亮点在于专为多节点分布式训练设计，在确保数据读取正确性和一致性的前提下，最大化了训练性能。作为 PyTorch IterableDataset 的无缝替代品，它能帮助团队在任何地点高效开展训练任务，彻底摆脱数据物理位置的束缚，让大规模模型训练变得更加快速、经济且易于扩展。","\u003Cbr \u002F>\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming#gh-light-mode-only\" class=\"only-light\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_a1ce0ef06e72.png\" width=\"50%\"\u002F>\n    \u003C\u002Fa>\n    \u003C!--pypi website does not support dark mode and does not understand GitHub tag. Hence, it renders both the images.\n    The below tag is being used to remove the dark mode image on pypi website.-->\n    \u003C!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming#gh-dark-mode-only\" class=\"only-dark\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_3dd555c3ef87.png\" width=\"50%\"\u002F>\n    \u003C\u002Fa>\n    \u003C!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_END -->\n\u003C\u002Fp>\n\n\u003Ch2>\u003Cp align=\"center\">Fast, accurate streaming of training data from cloud storage\u003C\u002Fp>\u003C\u002Fh2>\n\n\u003Ch4>\u003Cp align='center'>\n\u003Ca href=\"https:\u002F\u002Fwww.mosaicml.com\">[Website]\u003C\u002Fa>\n- \u003Ca href=\"https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Flatest\u002Fgetting_started\u002Fquick_start.html\">[Quick Start]\u003C\u002Fa>\n- \u003Ca href=\"https:\u002F\u002Fstreaming.docs.mosaicml.com\u002F\">[Docs]\n- \u003Ca href=\"https:\u002F\u002Fwww.databricks.com\u002Fcompany\u002Fcareers\u002Fopen-positions?department=Mosaic%20AI&location=all\">[We're Hiring!]\u003C\u002Fa>\n\u003C\u002Fp>\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fmosaicml-streaming\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi Package Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fmosaicml-streaming\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Factions?query=workflow%3ATest\">\n        \u003Cimg alt=\"Unit test\" src=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Factions\u002Fworkflows\u002Fpytest.yaml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi Downloads\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_ae1d798aadb7.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fstreaming.docs.mosaicml.com\">\n        \u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_13d664e1afd7.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdub.sh\u002Fmcomm\">\n        \u003Cimg alt=\"Chat @ Slack\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fslack-chat-2eb67d.svg?logo=slack\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002FLICENSE\">\n        \u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg?logo=slack\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgurubase.io\u002Fg\u002Fstreaming\">\n        \u003Cimg alt=\"License\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGurubase-Ask%20Streaming%20Guru-006BFF\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cbr \u002F>\n\n# 👋 Welcome\n\nWe built StreamingDataset to make training on large datasets from cloud storage as fast, cheap, and scalable as possible.\n\nIt’s specially designed for multi-node, distributed training for large models—maximizing correctness guarantees, performance, and ease of use. Now, you can efficiently train anywhere, independent of your training data location. Just stream in the data you need, when you need it. To learn more about why we built StreamingDataset, read our [announcement blog](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaicml-streamingdataset).\n\nStreamingDataset is compatible with any data type, including **images, text, video, and multimodal data**.\n\nWith support for major cloud storage providers ([AWS](https:\u002F\u002Faws.amazon.com\u002Fs3\u002F), [OCI](https:\u002F\u002Fwww.oracle.com\u002Fcloud\u002Fstorage\u002Fobject-storage\u002F), [GCS](https:\u002F\u002Fcloud.google.com\u002Fstorage), [Azure](https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fproducts\u002Fstorage\u002Fblobs), [Databricks](https:\u002F\u002Fdocs.databricks.com\u002Fen\u002Fstorage\u002Findex.html), and any S3 compatible object store such as [Cloudflare R2](https:\u002F\u002Fwww.cloudflare.com\u002Fproducts\u002Fr2\u002F), [Coreweave](https:\u002F\u002Fdocs.coreweave.com\u002Fstorage\u002Fobject-storage), [Backblaze b2](https:\u002F\u002Fwww.backblaze.com\u002Fb2\u002Fcloud-storage.html), etc. ) and designed as a drop-in replacement for your PyTorch [IterableDataset](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdata.html#torch.utils.data.IterableDataset) class, StreamingDataset seamlessly integrates into your existing training workflows.\n\n![The flow of samples from shards in the cloud to devices in your cluster](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_49119deb8df3.gif)\n\n# 🚀 Getting Started\n\n## 💾 Installation\n\nStreaming can be installed with `pip`:\n\n\u003C!--pytest.mark.skip-->\n```bash\npip install mosaicml-streaming\n```\n\n## 🏁 Quick Start\n\n### 1. Prepare Your Data\n\nConvert your raw dataset into one of our supported streaming formats:\n\n- MDS (Mosaic Data Shard) format which can encode and decode any Python object\n- CSV \u002F TSV\n- JSONL\n\n\u003C!--pytest.mark.skip-->\n```python\nimport numpy as np\nfrom PIL import Image\nfrom streaming import MDSWriter\n\n# Local or remote directory in which to store the compressed output files\ndata_dir = 'path-to-dataset'\n\n# A dictionary mapping input fields to their data types\ncolumns = {\n    'image': 'jpeg',\n    'class': 'int'\n}\n\n# Shard compression, if any\ncompression = 'zstd'\n\n# Save the samples as shards using MDSWriter\nwith MDSWriter(out=data_dir, columns=columns, compression=compression) as out:\n    for i in range(10000):\n        sample = {\n            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),\n            'class': np.random.randint(10),\n        }\n        out.write(sample)\n```\n\n### 2. Upload Your Data to Cloud Storage\n\nUpload your streaming dataset to the cloud storage of your choice ([AWS](https:\u002F\u002Faws.amazon.com\u002Fs3\u002F), [OCI](https:\u002F\u002Fwww.oracle.com\u002Fcloud\u002Fstorage\u002Fobject-storage\u002F), or [GCP](https:\u002F\u002Fcloud.google.com\u002Fstorage)). Below is one example of uploading a directory to an S3 bucket using the [AWS CLI](https:\u002F\u002Faws.amazon.com\u002Fcli\u002F).\n\n\u003C!--pytest.mark.skip-->\n```bash\n$ aws s3 cp --recursive path-to-dataset s3:\u002F\u002Fmy-bucket\u002Fpath-to-dataset\n```\n\n### 3. Build a StreamingDataset and DataLoader\n\n\u003C!--pytest.mark.skip-->\n```python\nfrom torch.utils.data import DataLoader\nfrom streaming import StreamingDataset\n\n# Remote path where full dataset is persistently stored\nremote = 's3:\u002F\u002Fmy-bucket\u002Fpath-to-dataset'\n\n# Local working dir where dataset is cached during operation\nlocal = '\u002Ftmp\u002Fpath-to-dataset'\n\n# Create streaming dataset\ndataset = StreamingDataset(local=local, remote=remote, shuffle=True)\n\n# Let's see what is in sample #1337...\nsample = dataset[1337]\nimg = sample['image']\ncls = sample['class']\n\n# Create PyTorch DataLoader\ndataloader = DataLoader(dataset)\n```\n\n### 📚 What next?\n\nGetting started guides, examples, API references, and other useful information can be found in our [docs](https:\u002F\u002Fstreaming.docs.mosaicml.com\u002F).\n\nWe have end-to-end tutorials for training a model on:\n\n- [CIFAR-10](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fhow_to_guides\u002Fcifar10.html)\n- [FaceSynthetics](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fexamples\u002Ffacesynthetics.ipynb)\n- [SyntheticNLP](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fhow_to_guides\u002Fsynthetic_nlp.html)\n\nWe also have starter code for the following popular datasets, which can be found in the `streaming` [directory](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming):\n\n| Dataset | Task | Read | Write |\n| --- | --- | --- | --- |\n| LAION-400M | Text and image | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fdiffusion-benchmark\u002Fblob\u002Fmain\u002Fdata.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fconvert\u002Flaion\u002Flaion400m) |\n| WebVid | Text and video | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fwebvid.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fconvert\u002Fwebvid.py) |\n| C4 | Text | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fc4.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fc4.py) |\n| EnWiki | Text | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fenwiki.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fenwiki) |\n| Pile | Text | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fpile.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fpile.py)\n| ADE20K | Image segmentation | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fade20k.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fade20k.py)\n| CIFAR10 | Image classification | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fcifar10.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fcifar10.py) |\n| COCO | Image classification | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fcoco.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fcoco.py) |\n| ImageNet | Image classification | [Read](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fimagenet.py) | [Write](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fimagenet.py) |\n\n**To start training on these datasets:**\n\n1. Convert raw data into .mds format using the corresponding script from the `convert` directory.\n\nFor example:\n\n\u003C!--pytest.mark.skip-->\n```bash\n$ python -m streaming.multimodal.convert.webvid --in \u003CCSV file> --out \u003CMDS output directory>\n```\n\n2. Import dataset class to start training the model.\n\n\u003C!--pytest.mark.skip-->\n```python\nfrom streaming.multimodal import StreamingInsideWebVid\ndataset = StreamingInsideWebVid(local=local, remote=remote, shuffle=True)\n```\n\n# **🔑** Key Features\n\n---\n\n## Seamless data mixing\n\nEasily experiment with dataset mixtures with [`Stream`](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Flatest\u002Fapi_reference\u002Fgenerated\u002Fstreaming.Stream.html#stream). Dataset sampling can be controlled in relative (proportion) or absolute (repeat or samples terms). During streaming, the different datasets are streamed, shuffled, and mixed seamlessly just-in-time.\n\n\u003C!--pytest.mark.skip-->\n```python\n# mix C4, github code, and internal datasets\nstreams = [\n  Stream(remote='s3:\u002F\u002Fdatasets\u002Fc4', proportion=0.4),\n  Stream(remote='s3:\u002F\u002Fdatasets\u002Fgithub', proportion=0.1),\n  Stream(remote='gcs:\u002F\u002Fdatasets\u002Fmy_internal', proportion=0.5),\n]\n\ndataset = StreamingDataset(\n  streams=streams,\n  samples_per_epoch=1e8,\n)\n```\n\n## True Determinism\n\nA unique feature of our solution: samples are in the same order regardless of the number of GPUs, nodes, or CPU workers. This makes it easier to:\n\n- Reproduce and debug training runs and loss spikes\n- Load a checkpoint trained on 64 GPUs and debug on 8 GPUs with reproducibility\n\nSee the figure below — training a model on 1, 8, 16, 32, or 64 GPUs yields the **exact same loss curve** (up to the limitations of floating point math!)\n\n![Plot of elastic determinism](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_fb3a6a275df2.png)\n\n## Instant mid-epoch resumption\n\nIt can be expensive — and annoying — to wait for your job to resume while your dataloader spins after a hardware failure or loss spike. Thanks to our deterministic sample ordering, StreamingDataset lets you resume training in seconds, not hours, in the middle of a long training run.\n\nMinimizing resumption latency can save thousands of dollars in egress fees and idle GPU compute time compared to existing solutions.\n\n## High throughput\n\nOur MDS format cuts extraneous work to the bone, resulting in ultra-low sample latency and higher throughput compared to alternatives for workloads bottlenecked by the dataloader.\n\n| Tool | Throughput |\n| --- | --- |\n| StreamingDataset | ~19000 img\u002Fsec |\n| ImageFolder | ~18000 img\u002Fsec |\n| WebDataset | ~16000 img\u002Fsec |\n\n*Results shown are from ImageNet + ResNet-50 training, collected over 5 repetitions after the data is cached after the first epoch.*\n\n## Equal convergence\n\nModel convergence from using StreamingDataset is just as good as using local disk, thanks to our shuffling algorithm.\n\n![Plot of equal convergence](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_50d2644feb6f.png)\n\nBelow are results from ImageNet + ResNet-50 training, collected over 5 repetitions.\n\n| Tool | Top-1 Accuracy |\n| --- | --- |\n| StreamingDataset | 76.51% +\u002F- 0.09 |\n| ImageFolder | 76.57% +\u002F- 0.10 |\n| WebDataset | 76.23% +\u002F- 0.17 |\n\nStreamingDataset shuffles across all samples assigned to a node, whereas alternative solutions only shuffle samples in a smaller pool (within a single process). Shuffling across a wider pool spreads out adjacent samples more. In addition, our shuffling algorithm minimizes dropped samples. We have found both of these shuffling features advantageous for model convergence.\n\n## Random access\n\nAccess the data you need when you need it.\n\nEven if a sample isn’t downloaded yet, you can access `dataset[i]` to get sample `i`. The download will kick off immediately and the result will be returned when it’s done - similar to a map-style PyTorch dataset with samples numbered sequentially and accessible in any order.\n\n\u003C!--pytest.mark.skip-->\n```python\ndataset = StreamingDataset(...)\nsample = dataset[19543]\n```\n\n## No divisibility requirements\n\nStreamingDataset will happily iterate over any number of samples. You do not have to forever delete samples so that the dataset is divisible over a baked-in number of devices. Instead, each epoch a different selection of samples are repeated (none dropped) so that each device processes the same count.\n\n\u003C!--pytest.mark.skip-->\n```python\ndataset = StreamingDataset(...)\ndl = DataLoader(dataset, num_workers=...)\n```\n\n## Disk usage limits\n\nDynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument `cache_limit`. See the [shuffling](.\u002Fdocs\u002Fsource\u002Ffundamentals\u002Fshuffling.md) guide for more details.\n\n```\ndataset = StreamingDataset(\n    cache_limit='100gb',\n    ...\n)\n```\n\n# 🏆 Project Showcase\n\nHere are some projects and experiments that used StreamingDataset. Got something to add?  Email [mcomm@databricks.com](mailto:mcomm@databricks.com) or join our [Community Slack](https:\u002F\u002Fdub.sh\u002Fmcomm).\n\n- [BioMedLM](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fintroducing-pubmed-gpt): a Domain Specific Large Language Model for BioMedicine by MosaicML and Stanford CRFM\n- [Mosaic Diffusion Models](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Ftraining-stable-diffusion-from-scratch-costs-160k): Training Stable Diffusion from Scratch Costs \u003C$160k\n- [Mosaic LLMs](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fgpt-3-quality-for-500k): GPT-3 quality for \u003C$500k\n- [Mosaic ResNet](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaic-resnet): Blazingly Fast Computer Vision Training with the Mosaic ResNet and Composer\n- [Mosaic DeepLabv3](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaic-image-segmentation): 5x Faster Image Segmentation Training with MosaicML Recipes\n- …more to come! Stay tuned!\n\n# 💫 Contributors\n\nWe welcome any contributions, pull requests, or issues.\n\nTo start contributing, see our [Contributing](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) page.\n\nP.S.: [We're hiring](https:\u002F\u002Fmosaicml.com\u002Fjobs)!\n\nIf you like this project, give us a star **⭐** and check out our other projects:\n\n- **[Composer](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fcomposer) -** a modern PyTorch library that makes scalable, efficient neural network training easy\n- **[MosaicML Examples](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fexamples)** - reference examples for training ML models quickly and to high accuracy - featuring starter code for GPT \u002F Large Language Models, Stable Diffusion, BERT, ResNet-50, and DeepLabV3\n- **[MosaicML Cloud](https:\u002F\u002Fwww.mosaicml.com\u002Fcloud)** - our training platform built to minimize training costs for LLMs, Diffusion Models, and other large models - featuring multi-cloud orchestration, effortless multi-node scaling, and under-the-hood optimizations for speeding up training time\n\n# ✍️ Citation\n\n```\n@misc{mosaicml2022streaming,\n    author = {The Mosaic ML Team},\n    title = {streaming},\n    year = {2022},\n    howpublished = {\\url{\u003Chttps:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002F>}},\n}\n```\n","\u003Cbr \u002F>\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming#gh-light-mode-only\" class=\"only-light\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_a1ce0ef06e72.png\" width=\"50%\"\u002F>\n    \u003C\u002Fa>\n    \u003C!--pypi网站不支持暗模式，也不识别GitHub标签。因此，它会同时渲染两张图片。\n    下面的标签用于在pypi网站上隐藏暗模式下的图片。-->\n    \u003C!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_BEGIN -->\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming#gh-dark-mode-only\" class=\"only-dark\">\n      \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_3dd555c3ef87.png\" width=\"50%\"\u002F>\n    \u003C\u002Fa>\n    \u003C!-- SETUPTOOLS_LONG_DESCRIPTION_HIDE_END -->\n\u003C\u002Fp>\n\n\u003Ch2>\u003Cp align=\"center\">从云存储中快速、准确地流式传输训练数据\u003C\u002Fp>\u003C\u002Fh2>\n\n\u003Ch4>\u003Cp align='center'>\n\u003Ca href=\"https:\u002F\u002Fwww.mosaicml.com\">[官网]\u003C\u002Fa>\n- \u003Ca href=\"https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Flatest\u002Fgetting_started\u002Fquick_start.html\">[快速入门]\u003C\u002Fa>\n- \u003Ca href=\"https:\u002F\u002Fstreaming.docs.mosaicml.com\u002F\">[文档]\n- \u003Ca href=\"https:\u002F\u002Fwww.databricks.com\u002Fcompany\u002Fcareers\u002Fopen-positions?department=Mosaic%20AI&location=all\">[招聘！]\u003C\u002Fa>\n\u003C\u002Fp>\u003C\u002Fh4>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi版本\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fmosaicml-streaming\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi包版本\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fmosaicml-streaming\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Factions?query=workflow%3ATest\">\n        \u003Cimg alt=\"单元测试\" src=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Factions\u002Fworkflows\u002Fpytest.yaml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Fmosaicml-streaming\u002F\">\n        \u003Cimg alt=\"PyPi下载量\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_ae1d798aadb7.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fstreaming.docs.mosaicml.com\">\n        \u003Cimg alt=\"文档\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_13d664e1afd7.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdub.sh\u002Fmcomm\">\n        \u003Cimg alt=\"Slack聊天\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fslack-chat-2eb67d.svg?logo=slack\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002FLICENSE\">\n        \u003Cimg alt=\"许可证\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-green.svg?logo=slack\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgurubase.io\u002Fg\u002Fstreaming\">\n        \u003Cimg alt=\"许可证\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FGurubase-Ask%20Streaming%20Guru-006BFF\">\n    \u003C\u002Fa>\n\u003C\u002Fp>\n\u003Cbr \u002F>\n\n# 👋 欢迎\n\n我们构建了StreamingDataset，旨在使基于云存储的大型数据集训练尽可能快速、经济且可扩展。\n\n它专为大型模型的多节点分布式训练而设计，最大限度地保证正确性、性能和易用性。现在，您可以不受训练数据位置的限制，在任何地方高效地进行训练。只需在需要时按需流式加载所需数据即可。要了解我们为何构建StreamingDataset，请阅读我们的[公告博客](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaicml-streamingdataset)。\n\nStreamingDataset兼容任何数据类型，包括**图像、文本、视频和多模态数据**。\n\nStreamingDataset支持主要的云存储提供商（[AWS](https:\u002F\u002Faws.amazon.com\u002Fs3\u002F)、[OCI](https:\u002F\u002Fwww.oracle.com\u002Fcloud\u002Fstorage\u002Fobject-storage\u002F)、[GCS](https:\u002F\u002Fcloud.google.com\u002Fstorage)、[Azure](https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fproducts\u002Fstorage\u002Fblobs)、[Databricks](https:\u002F\u002Fdocs.databricks.com\u002Fen\u002Fstorage\u002Findex.html)，以及任何与S3兼容的对象存储，如[Cloudflare R2](https:\u002F\u002Fwww.cloudflare.com\u002Fproducts\u002Fr2\u002F)、[Coreweave](https:\u002F\u002Fdocs.coreweave.com\u002Fstorage\u002Fobject-storage)、[Backblaze b2](https:\u002F\u002Fwww.backblaze.com\u002Fb2\u002Fcloud-storage.html)等），并且被设计为PyTorch [IterableDataset](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fdata.html#torch.utils.data.IterableDataset) 类的即插即用替代品，能够无缝集成到您现有的训练工作流中。\n\n![样本从云端分片流向集群中设备的流程](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_49119deb8df3.gif)\n\n# 🚀 快速开始\n\n## 💾 安装\n\n可以通过`pip`安装Streaming：\n\n\u003C!--pytest.mark.skip-->\n```bash\npip install mosaicml-streaming\n```\n\n## 🏁 快速入门\n\n### 1. 准备您的数据\n\n将原始数据集转换为我们的支持的流式格式之一：\n\n- MDS（Mosaic Data Shard）格式，可编码和解码任何Python对象\n- CSV \u002F TSV\n- JSONL\n\n\u003C!--pytest.mark.skip-->\n```python\nimport numpy as np\nfrom PIL import Image\nfrom streaming import MDSWriter\n\n# 用于存储压缩后输出文件的本地或远程目录\ndata_dir = '数据集路径'\n\n# 将输入字段映射到其数据类型的字典\ncolumns = {\n    'image': 'jpeg',\n    'class': 'int'\n}\n\n# 如果有分片压缩，则指定压缩算法\ncompression = 'zstd'\n\n# 使用MDSWriter将样本保存为分片\nwith MDSWriter(out=data_dir, columns=columns, compression=compression) as out:\n    for i in range(10000):\n        sample = {\n            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),\n            'class': np.random.randint(10),\n        }\n        out.write(sample)\n```\n\n### 2. 将您的数据上传到云存储\n\n将您的流式数据集上传到您选择的云存储中（[AWS](https:\u002F\u002Faws.amazon.com\u002Fs3\u002F)、[OCI](https:\u002F\u002Fwww.oracle.com\u002Fcloud\u002Fstorage\u002Fobject-storage\u002F) 或 [GCP](https:\u002F\u002Fcloud.google.com\u002Fstorage)）。以下是使用[AWS CLI](https:\u002F\u002Faws.amazon.com\u002Fcli\u002F)将目录上传到S3存储桶的一个示例。\n\n\u003C!--pytest.mark.skip-->\n```bash\n$ aws s3 cp --recursive 数据集路径 s3:\u002F\u002Fmy-bucket\u002F数据集路径\n```\n\n### 3. 构建StreamingDataset和DataLoader\n\n\u003C!--pytest.mark.skip-->\n```python\nfrom torch.utils.data import DataLoader\nfrom streaming import StreamingDataset\n\n# 存储完整数据集的远程路径\nremote = 's3:\u002F\u002Fmy-bucket\u002F数据集路径'\n\n# 数据集在运行期间缓存的本地工作目录\nlocal = '\u002Ftmp\u002F数据集路径'\n\n# 创建流式数据集\ndataset = StreamingDataset(local=local, remote=remote, shuffle=True)\n\n# 让我们看看第1337个样本里有什么...\nsample = dataset[1337]\nimg = sample['image']\ncls = sample['class']\n\n# 创建PyTorch DataLoader\ndataloader = DataLoader(dataset)\n```\n\n### 📚 下一步？\n\n入门指南、示例、API 参考以及其他有用的信息，都可以在我们的[文档](https:\u002F\u002Fstreaming.docs.mosaicml.com\u002F)中找到。\n\n我们提供了针对以下数据集训练模型的端到端教程：\n\n- [CIFAR-10](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fhow_to_guides\u002Fcifar10.html)\n- [FaceSynthetics](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fexamples\u002Ffacesynthetics.ipynb)\n- [SyntheticNLP](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fhow_to_guides\u002Fsynthetic_nlp.html)\n\n此外，我们还为以下常用数据集提供了入门代码，这些代码可以在 `streaming` [目录](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming)中找到：\n\n| 数据集       | 任务         | 读取           | 写入             |\n| ------------ | ------------ | -------------- | ---------------- |\n| LAION-400M   | 文本和图像   | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fdiffusion-benchmark\u002Fblob\u002Fmain\u002Fdata.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fconvert\u002Flaion\u002Flaion400m) |\n| WebVid       | 文本和视频   | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fwebvid.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fmultimodal\u002Fconvert\u002Fwebvid.py) |\n| C4           | 文本         | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fc4.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fc4.py) |\n| EnWiki       | 文本         | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fenwiki.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Ftree\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fenwiki) |\n| Pile         | 文本         | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fpile.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Ftext\u002Fconvert\u002Fpile.py) |\n| ADE20K       | 图像分割     | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fade20k.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fade20k.py) |\n| CIFAR10      | 图像分类     | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fcifar10.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fcifar10.py) |\n| COCO         | 图像分类     | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fcoco.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fcoco.py) |\n| ImageNet     | 图像分类     | [读取](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fimagenet.py) | [写入](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fstreaming\u002Fvision\u002Fconvert\u002Fimagenet.py) |\n\n**要开始使用这些数据集进行训练：**\n\n1. 使用 `convert` 目录中的相应脚本将原始数据转换为 `.mds` 格式。\n\n例如：\n\n\u003C!--pytest.mark.skip-->\n```bash\n$ python -m streaming.multimodal.convert.webvid --in \u003CCSV文件> --out \u003CMDS输出目录>\n```\n\n2. 导入数据集类以开始训练模型。\n\n\u003C!--pytest.mark.skip-->\n```python\nfrom streaming.multimodal import StreamingInsideWebVid\ndataset = StreamingInsideWebVid(local=local, remote=remote, shuffle=True)\n```\n\n# **🔑** 关键特性\n\n---\n\n## 无缝的数据混合\n\n借助 [`Stream`](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Flatest\u002Fapi_reference\u002Fgenerated\u002Fstreaming.Stream.html#stream)，您可以轻松地试验不同的数据集组合。数据采样既可以通过相对比例控制，也可以通过绝对数量（重复次数或样本数）来控制。在流式处理过程中，不同数据集会实时无缝地被加载、打乱并混合。\n\n\u003C!--pytest.mark.skip-->\n```python\n# 混合 C4、GitHub 代码库和内部数据集\nstreams = [\n  Stream(remote='s3:\u002F\u002Fdatasets\u002Fc4', proportion=0.4),\n  Stream(remote='s3:\u002F\u002Fdatasets\u002Fgithub', proportion=0.1),\n  Stream(remote='gcs:\u002F\u002Fdatasets\u002Fmy_internal', proportion=0.5),\n]\n\ndataset = StreamingDataset(\n  streams=streams,\n  samples_per_epoch=1e8,\n)\n```\n\n## 真正的确定性\n\n我们解决方案的一个独特优势：无论 GPU 数量、节点数或 CPU 工作进程如何，样本的顺序始终保持一致。这使得以下操作更加容易：\n\n- 复现和调试训练过程及损失突增问题\n- 加载在 64 个 GPU 上训练的检查点，并在 8 个 GPU 上进行可复现的调试\n\n请参见下图——无论使用 1、8、16、32 或 64 个 GPU 训练模型，都会得到**完全相同的损失曲线**（受限于浮点运算精度）！\n\n![弹性确定性的图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_fb3a6a275df2.png)\n\n## 无需等待，即可在训练中途恢复\n\n当硬件故障或损失突增导致数据加载器卡住时，等待作业恢复可能既昂贵又令人沮丧。得益于我们确定性的样本排序机制，StreamingDataset 允许您在长时间训练过程中随时快速恢复训练，而无需等待数小时。\n\n与现有方案相比，这种低延迟的恢复方式可以节省数千美元的流量费用和空闲 GPU 计算时间。\n\n## 高吞吐量\n\n我们的 MDS 格式最大限度地减少了冗余工作，从而实现了极低的样本延迟和更高的吞吐量，特别适合那些受数据加载瓶颈限制的工作负载。\n\n| 工具            | 吞吐量        |\n| --------------- | ------------- |\n| StreamingDataset | ~19000 张\u002F秒 |\n| ImageFolder      | ~18000 张\u002F秒 |\n| WebDataset       | ~16000 张\u002F秒 |\n\n*以上结果来自 ImageNet + ResNet-50 的训练，数据在第一个 epoch 缓存后，经过 5 次重复实验得出。*\n\n## 平等的收敛性\n\n由于我们采用了独特的洗牌算法，使用 StreamingDataset 进行模型训练的收敛效果与直接使用本地磁盘无异。\n\n![平等收敛性的图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_readme_50d2644feb6f.png)\n\n以下是 ImageNet + ResNet-50 训练的结果，共进行了 5 次重复实验。\n\n| 工具            | Top-1 准确率    |\n| --------------- | --------------- |\n| StreamingDataset | 76.51% ± 0.09  |\n| ImageFolder      | 76.57% ± 0.10  |\n| WebDataset       | 76.23% ± 0.17  |\n\nStreamingDataset 会在分配给某个节点的所有样本中进行洗牌，而其他方案则仅在一个较小的范围内（单个进程内）对样本进行洗牌。更广泛的洗牌范围能够更好地分散相邻样本。此外，我们的洗牌算法还能最大限度地减少样本丢失。我们发现，这两种洗牌特性都有助于提升模型的收敛性能。\n\n## 随机访问\n\n在需要时访问所需的数据。\n\n即使某个样本尚未下载，您仍然可以通过 `dataset[i]` 来获取第 i 个样本。系统会立即启动下载，并在完成后返回结果——这与按顺序编号且可任意访问的 PyTorch map-style 数据集类似。\n\n\u003C!--pytest.mark.skip-->\n```python\ndataset = StreamingDataset(...)\nsample = dataset[19543]\n```\n\n## 无数据集大小可除性要求\n\nStreamingDataset 可以愉快地遍历任意数量的样本。您无需永久删除部分样本，以使数据集大小能被预设的设备数量整除。相反，在每个 epoch 中，都会重复不同的样本子集（不丢弃任何样本），从而确保每台设备处理的样本数量相同。\n\n\u003C!--pytest.mark.skip-->\n```python\ndataset = StreamingDataset(...)\ndl = DataLoader(dataset, num_workers=...)\n```\n\n## 磁盘使用限制\n\n为了将磁盘使用量控制在指定的上限内，系统会动态删除最近最少使用的分片。此功能可通过设置 StreamingDataset 的 `cache_limit` 参数来启用。更多详细信息请参阅 [洗牌](.\u002Fdocs\u002Fsource\u002Ffundamentals\u002Fshuffling.md) 指南。\n\n```\ndataset = StreamingDataset(\n    cache_limit='100gb',\n    ...\n)\n```\n\n# 🏆 项目展示\n\n以下是一些使用了 StreamingDataset 的项目和实验。您也有想分享的吗？请发送邮件至 [mcomm@databricks.com](mailto:mcomm@databricks.com) 或加入我们的 [Community Slack](https:\u002F\u002Fdub.sh\u002Fmcomm)。\n\n- [BioMedLM](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fintroducing-pubmed-gpt)：由 MosaicML 和斯坦福 CRFM 共同推出的生物医学领域专用大型语言模型\n- [Mosaic 扩散模型](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Ftraining-stable-diffusion-from-scratch-costs-160k)：从零开始训练 Stable Diffusion 的成本低于 16 万美元\n- [Mosaic LLM](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fgpt-3-quality-for-500k)：以不到 50 万美元的成本实现 GPT-3 水平的性能\n- [Mosaic ResNet](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaic-resnet)：借助 Mosaic ResNet 和 Composer，实现超快速的计算机视觉训练\n- [Mosaic DeepLabv3](https:\u002F\u002Fwww.mosaicml.com\u002Fblog\u002Fmosaic-image-segmentation)：使用 MosaicML Recipes，图像分割训练速度提升 5 倍\n- …更多内容即将发布！敬请期待！\n\n# 💫 贡献者\n\n我们欢迎任何形式的贡献、拉取请求或问题反馈。\n\n如需开始贡献，请参阅我们的 [贡献指南](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)。\n\n附注：[我们正在招聘](https:\u002F\u002Fmosaicml.com\u002Fjobs)！\n\n如果您喜欢这个项目，请为我们点亮星标 **⭐**，并查看我们的其他项目：\n\n- **[Composer](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fcomposer)** —— 一个现代化的 PyTorch 库，让大规模、高效的神经网络训练变得简单易行\n- **[MosaicML 示例](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fexamples)** —— 快速且高精度训练机器学习模型的参考示例，包含 GPT \u002F 大型语言模型、Stable Diffusion、BERT、ResNet-50 和 DeepLabV3 的入门代码\n- **[MosaicML Cloud](https:\u002F\u002Fwww.mosaicml.com\u002Fcloud)** —— 我们的训练平台专为降低大型语言模型、扩散模型及其他大型模型的训练成本而设计，提供多云编排、轻松的多节点扩展以及底层优化，以加速训练过程\n\n# ✍️ 引用\n\n```\n@misc{mosaicml2022streaming,\n    author = {The Mosaic ML Team},\n    title = {streaming},\n    year = {2022},\n    howpublished = {\\url{\u003Chttps:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002F>}},\n}\n```","# MosaicML Streaming 快速上手指南\n\nMosaicML Streaming 是一个专为大规模分布式训练设计的开源工具，支持从云存储（如 AWS S3、阿里云 OSS 等兼容接口）高速流式加载训练数据。它兼容 PyTorch `IterableDataset`，支持图像、文本、视频及多模态数据，能够显著降低数据加载延迟并节省存储成本。\n\n## 环境准备\n\n*   **系统要求**：Linux 或 macOS（推荐 Linux 环境用于生产训练）。\n*   **Python 版本**：Python 3.8 及以上。\n*   **核心依赖**：\n    *   PyTorch\n    *   Pillow (处理图像数据)\n    *   NumPy\n*   **云存储访问**：若使用云端数据，需配置相应的云厂商 CLI 工具（如 `awscli`）或确保运行环境具有对应的访问凭证（Access Key\u002FSecret Key）。\n\n> **提示**：国内开发者若访问 PyPI 较慢，建议使用清华或阿里镜像源进行安装。\n\n## 安装步骤\n\n使用 pip 直接安装官方发布包：\n\n```bash\npip install mosaicml-streaming\n```\n\n**推荐使用国内镜像源加速安装：**\n\n```bash\npip install mosaicml-streaming -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n如需安装额外依赖（如特定编解码器），可参考官方文档追加 extras，基础使用上述命令即可。\n\n## 基本使用\n\n使用流程分为三步：数据转换、上传云端（可选）、构建 DataLoader。\n\n### 1. 准备数据 (转换为 MDS 格式)\n\n首先将原始数据转换为 Streaming 支持的 **MDS (Mosaic Data Shard)** 格式。以下示例演示如何生成包含图像和标签的随机数据：\n\n```python\nimport numpy as np\nfrom PIL import Image\nfrom streaming import MDSWriter\n\n# 本地或远程目录，用于存储压缩后的输出文件\ndata_dir = 'path-to-dataset'\n\n# 定义字段及其数据类型\ncolumns = {\n    'image': 'jpeg',\n    'class': 'int'\n}\n\n# 压缩算法 (可选，推荐 zstd)\ncompression = 'zstd'\n\n# 使用 MDSWriter 写入分片数据\nwith MDSWriter(out=data_dir, columns=columns, compression=compression) as out:\n    for i in range(10000):\n        sample = {\n            'image': Image.fromarray(np.random.randint(0, 256, (32, 32, 3), np.uint8)),\n            'class': np.random.randint(10),\n        }\n        out.write(sample)\n```\n\n### 2. 上传数据到云存储 (可选)\n\n如果需要在集群间共享或直接从云端训练，将生成的数据集目录上传至对象存储（如 AWS S3、阿里云 OSS、腾讯云 COS 等）。\n\n以 AWS S3 为例：\n\n```bash\naws s3 cp --recursive path-to-dataset s3:\u002F\u002Fmy-bucket\u002Fpath-to-dataset\n```\n\n> **注意**：国内云厂商（阿里云\u002F腾讯云）通常提供 S3 兼容接口，只需配置对应的 endpoint 即可使用类似的上传逻辑。\n\n### 3. 构建 StreamingDataset 和 DataLoader\n\n在训练代码中，直接使用 `StreamingDataset` 替换原有的 Dataset 类。它会自动处理数据的下载、缓存和流式读取。\n\n```python\nfrom torch.utils.data import DataLoader\nfrom streaming import StreamingDataset\n\n# 云端数据存储的完整路径 (远程)\nremote = 's3:\u002F\u002Fmy-bucket\u002Fpath-to-dataset'\n\n# 本地缓存目录 (数据会被流式下载到此地)\nlocal = '\u002Ftmp\u002Fpath-to-dataset'\n\n# 创建流式数据集\n# shuffle=True 开启打乱，支持多卡确定性打乱\ndataset = StreamingDataset(local=local, remote=remote, shuffle=True)\n\n# 验证读取单个样本\nsample = dataset[1337]\nimg = sample['image']\ncls = sample['class']\n\n# 创建标准的 PyTorch DataLoader\ndataloader = DataLoader(dataset, batch_size=32, num_workers=4)\n```\n\n现在，`dataloader` 即可直接用于您的 PyTorch 训练循环中。该方案支持断点续训、多节点确定性采样以及无缝混合多个数据集。","某自动驾驶团队正在利用多节点 GPU 集群，基于存储在 AWS S3 上的 PB 级高清路况视频数据训练大型视觉模型。\n\n### 没有 streaming 时\n- **启动延迟极高**：训练前需花费数小时将 TB 级数据集从云端下载到本地高速存储，严重拖慢实验迭代节奏。\n- **存储成本高昂**：每个计算节点都必须配置昂贵的本地大容量硬盘来缓存全量数据，造成巨大的资源浪费。\n- **数据一致性风险**：在分布式环境下，不同节点间的数据同步容易出现版本不一致或损坏，导致模型训练收敛困难。\n- **扩展性受限**：增加训练节点时，必须重复繁琐的数据拷贝和校验流程，难以实现弹性伸缩。\n\n### 使用 streaming 后\n- **秒级启动训练**：streaming 支持直接从 S3 流式读取数据，无需预先下载，任务提交后即刻开始计算。\n- **大幅降低存储开销**：各节点仅需按需缓存当前批次数据，不再依赖昂贵的本地全量存储，成本降低 90% 以上。\n- **确保训练准确性**：内置的强一致性保证机制自动处理分片与去重，确保多节点读取的数据精确无误且无重复。\n- **无缝弹性扩展**：新增节点可直接加入训练队列并即时获取数据流，轻松实现大规模分布式训练的动态扩容。\n\nstreaming 通过消除数据搬运瓶颈，让大规模模型训练真正实现了“数据在哪，算力就在哪”的高效敏捷模式。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmosaicml_streaming_fb3a6a27.png","mosaicml","Databricks Mosaic Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmosaicml_b4287c5e.png","We remove the barriers to state-of-the-art generative AI model development and make data + AI available to all",null,"DbrxMosaicAI","https:\u002F\u002Fwww.databricks.com\u002Fresearch\u002Fmosaic","https:\u002F\u002Fgithub.com\u002Fmosaicml",[81,85,89],{"name":82,"color":83,"percentage":84},"Python","#3572A5",99.7,{"name":86,"color":87,"percentage":88},"Shell","#89e051",0.2,{"name":90,"color":91,"percentage":92},"Makefile","#427819",0.1,1488,188,"2026-04-08T18:40:49","Apache-2.0","未说明",{"notes":99,"python":100,"dependencies":101},"该工具是一个数据流式加载库（StreamingDataset），旨在替代 PyTorch 的 IterableDataset，用于从云存储（如 AWS S3, GCS, Azure, OCI 等）高效流式传输训练数据。它本身不强制要求特定的 GPU 型号或显存大小，也不直接依赖 transformers 或 accelerate 等模型库，而是作为数据管道与任何深度学习框架配合使用。安装方式为 pip install mosaicml-streaming。支持多种数据格式（图像、文本、视频、多模态）及压缩格式（如 zstd）。其核心优势在于分布式训练中的确定性采样、秒级断点续训以及高吞吐量。","3.8+",[102,103,104],"torch","Pillow","numpy",[14,16],[107,108,109,110,111,64],"dataset","deep-learning","machine-learning","neural-network","pytorch","2026-03-27T02:49:30.150509","2026-04-10T20:34:10.784381",[115,120,125,130],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},28330,"为什么在多个 epoch 之间 GPU 利用率会下降？","这是由于 epoch 之间的挂起问题导致的。维护者已在主分支（main）中修复了该问题，内部测试显示整体吞吐量提升了 10-40%。您可以直接从 main 分支安装 Streaming 来尝试修复：`pip install git+https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming.git`。此外，如果您使用的是 TorchDistributor 启动器，可以尝试切换到 Composer 启动器，这也能解决低利用率问题。对于使用 FUSE 挂载的情况，建议增加 DataLoader 的 prefetch factor 和 Dataset 的 predownload 参数，或者将 remote 设置为 FUSE 挂载路径，local 设置为本地磁盘路径。","https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fissues\u002F643",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},28331,"无法从自定义 S3 端点加载数据集怎么办？","如果您使用自定义 S3 端点 URL 并遇到“对象未找到”的错误，可能是因为旧版本的 Streaming 不支持导出 `S3_ENDPOINT_URL` 环境变量。请将 mosaicml-streaming 升级到 v0.5.0 或更高版本，该版本已支持此功能。升级后，确保在运行前正确导出环境变量，例如：`export S3_ENDPOINT_URL=\"MY_ENDPOINT_URL\"`。注意，Streaming 数据集会自动创建不存在的目录，直接使用 boto3 下载文件可能会报错，请使用 Streaming 提供的接口进行加载。","https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fissues\u002F310",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},28332,"如何使用 dataframe_to_mds 保存 Spark Vector 或 ndarray 类型数据？","`dataframe_to_mds` 目前不直接支持 Spark 的 `Vector` 类型或 `ndarray` 类型。解决方案是先将数据转换为一维数组（1d array），然后在训练过程中进行预处理还原。例如，在 Spark 中将 Vector 列展平为普通数组列后再调用转换函数。不要尝试在 `columns` 参数中指定 `ndarray` 类型，这会引发 ValueError。","https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fissues\u002F736",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},28333,"如何将 MDS 数据集存储在 Hugging Face Datasets 上并进行流式加载？","Mosaic Streaming 现已支持将 MDS 数据集存储在 Hugging Face Hub 上。您可以使用 `hf:\u002F\u002F` 协议作为 remote 路径来流式加载数据。示例代码如下：\n```python\nfrom streaming import StreamingDataset\n\n# 创建流式数据集\ndataset = StreamingDataset(remote=\"hf:\u002F\u002Fdatasets\u002Forionweller\u002Fwikipedia_mds\u002F\", shuffle=False, split=\"train\")\n```\n请确保已配置好 Hugging Face 的认证凭证（如需访问私有数据集）。详细配置指南可参考官方文档中关于 Hugging Face Datasets 的部分。","https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fissues\u002F633",[136,141,146,151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231],{"id":137,"version":138,"summary_zh":139,"released_at":140},189225,"v0.13.0","## 变更内容\n* 由 @dakinggg 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F907 中修复的输入问题\n* 由 @erayinanc 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F904 中实现的 Azure 服务替代认证方式\n* 允许 Spark 的 BinaryType 映射到二进制编码的 MDS 类型（如 PNG），前提是用户明确指定，由 @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F913 中完成\n* 【Bug 修复】HfUploader - 使用正确的本地文件路径。由 @Abhinay1997 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F874 中完成\n\n## 新贡献者\n* @erayinanc 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F904 中完成了首次贡献\n* @Abhinay1997 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F874 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.12.0...v0.13.0","2025-07-15T22:04:37",{"id":142,"version":143,"summary_zh":144,"released_at":145},189226,"v0.12.0","## 新增功能\n### 1. 升级至 Python 3.12（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F894）\n我们新增了对 Python 3.12 的支持，并弃用了对 Python 3.9 的支持。\n\n## 变更内容\n* 主分支版本号从 0.11.0.dev0 升级至 0.12.0.dev0，由 @es94129 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F862 中完成\n* 【修复 Issue 415】对于从字节流构建的 JPEG 图像，回退到内存编码方式，由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F878 中完成\n* 将 Jpeg 数组添加到 MDS 编码列表中，由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F881 中完成\n* 增加对图像列表的支持，由 @ethantang-db 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F882 中完成\n* 将 fastapi 版本从 0.115.6 升级至 0.115.12，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F886 中完成\n* 将 pydantic 版本从 2.10.5 升级至 2.10.6，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F866 中完成\n* 更新 setuptools 的依赖范围，由 \u003C76.0.0 调整为 \u003C79.0.0，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F891 中完成\n* 将 databricks-sdk 版本从 0.29.0 升级至 0.49.0，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F890 中完成\n* 将 yamllint 版本从 1.35.1 升级至 1.37.0，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F888 中完成\n* 将 pytest 版本从 8.3.4 升级至 8.3.5，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F887 中完成\n* 更新 huggingface-hub 的依赖范围，由 \u003C0.28,>=0.23.4 调整为 >=0.23.4,\u003C0.30，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F889 中完成\n* 修复 Stream.get_shards() 中的悬空文件句柄问题，由 @aadyotb 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F892 中完成\n* 升级至 Python 3.12，由 @KuuCi 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F894 中完成\n\n## 新贡献者\n* @aadyotb 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F892 中完成了首次贡献\n* @KuuCi 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F894 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.11.0...v0.12.0","2025-04-03T22:04:04",{"id":147,"version":148,"summary_zh":149,"released_at":150},189227,"v0.11.0","# 🚀 Streaming v0.11.0\n\nStreaming `v0.11.0` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.11.0\n```\n\n## 新特性\n### 1. 引入可自定义组件的注册表（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F858）\n`StreamingDataset` 现在可以通过注册表与自定义的 `Stream` 实现一起使用。有关示例用法，请参阅[文档页面](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fdataset_configuration\u002Fmixing_data_sources.html)。\n\n## 🐛 Bug 修复\n* 修复 `simulation` 模块的导入路径问题（@srstevenson）\n* 修复 `S3Downloader` 的序列化问题（@wouterzwerink）\n\n## 变更内容\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F849 中将 numpy 版本限制在 2.2.0 以下\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F838 中修复了 `simulation` 模块的导入路径\n* @wouterzwerink 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F847 中防止 `_s3_client` 被序列化\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F843 中修正了几处拼写错误\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F841 中将损坏的用户指南链接替换为快速入门链接\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F842 中移除了快速入门示例中的未使用导入\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F839 中修改了模拟器 UI 的帮助文本，使其指向目录\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F845 中将 fastapi 从 0.115.5 升级到 0.115.6\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F846 中将 pydantic 从 2.10.2 升级到 2.10.3\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F850 中将 mosaicml-cli 的版本要求从 `\u003C0.7,>=0.5.25` 更新为 `>=0.5.25,\u003C0.8`\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F855 中将 uvicorn 从 0.32.1 升级到 0.34.0\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F856 中将 pydantic 从 2.10.3 升级到 2.10.4\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F859 中将 huggingface-hub 的版本要求从 `\u003C0.27,>=0.23.4` 更新为 `>=0.23.4,\u003C0.28`\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F840 中为 `SimulationDataset` 设置了 `epoch_seed_change` 属性\n* @es94129 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F858 中在创建 `StreamingDataset` 的 `Stream` 时使用了注册表\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F861 中将 pydantic 从 2.10.4 升级到 2.10.5\n\n## 新贡献者\n* @srstevenson 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F838 中做出了首次贡献\n* @wouterzwerink 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F847 中做出了首次贡献\n* @es94129 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F858 中做出了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.10.0...v0.11.0","2025-01-15T18:04:24",{"id":152,"version":153,"summary_zh":154,"released_at":155},189228,"v0.10.0","# 🚀 Streaming v0.10.0\n\nStreaming `v0.10.0` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.10.0\n```\n\n## 改进\n\n### 1. 可复用的云下载客户端（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F817）\n* Streaming 现在在下载分片文件时会复用云下载客户端，而不是为每次下载都创建一个新的客户端。\n* 这样可以避免因打开的套接字过多或云认证请求过于频繁而导致的运行失败。\n\n### 2：`py1b` 洗牌算法已弃用（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F837）\n* `py1b` 洗牌算法现已弃用。请改用改进后的 `py1e`（默认）或 `py1br` 洗牌算法。\n\n## 变更内容\n* 更新常见问题解答，指出包装功能不受支持，由 @milocress 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F822 中完成。\n* 由 @ethantang-db 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F817 中重构了下载模块，使其具备可复用的客户端。\n* 将 pytest-cov 的依赖范围从 `\u003C6,>=4` 更新为 `>=4,\u003C7`，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F821 中完成。\n* 在批处理方法中对未使用的流统一抛出错误，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F826 中完成。\n* 将 setuptools 的依赖范围从 `\u003C68.0.0` 更新为 `\u003C76.0.0`，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F825 中完成。\n* 修复 f-string 格式化问题，由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F829 中完成。\n* 将 fastapi 从 0.115.4 升级到 0.115.5，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F830 中完成。\n* 将 uvicorn 从 0.32.0 升级到 0.32.1，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F834 中完成。\n* 将 pydantic 从 2.9.2 升级到 2.10.1，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F833 中完成。\n* 将 pytest 从 8.3.3 升级到 8.3.4，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F836 中完成。\n* 将 pydantic 从 2.10.1 升级到 2.10.2，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F835 中完成。\n* 版本号更新至 0.11.0.dev0，并包含弃用内容，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F837 中完成。\n\n## 新贡献者\n* @ethantang-db 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F817 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.9.1...v0.10.0","2024-12-03T21:14:59",{"id":157,"version":158,"summary_zh":159,"released_at":160},189229,"v0.9.1","# 🚀 Streaming v0.9.1\n\nStreaming `v0.9.1` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.9.1\n```\n\n## 新增功能\n### 1. Streaming 已加入 Gurubase（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F805）\n* Streaming 现在配备了一位 AI 助手，可帮助用户解答问题！试用 Streaming Guru，它利用本仓库及 [文档](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002F) 中的数据，借助大语言模型来回答用户的问题。\n\n## 改进\n### 1. 权限问题修复（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F813）\n* 解决了在共享计算环境中创建共享内存文件时出现的读取权限问题。我们添加了重试机制，以便在遇到权限错误时能够重新创建共享内存文件。\n* 共享内存文件前缀完整性：在创建共享内存文件时，现在会同时验证 LOCALS 和 FILELOCKS，以确保不会与现有文件发生重叠，并且使用一致的前缀标识符进行匹配。\n* 处理非正常程序退出：增强了清理流程，以应对因非正常程序退出而导致部分共享内存文件未被清除的情况。现在会检查 SHM_TO_CLEAN 中的所有文件，以防止重复清理。\n这些改进提升了共享环境下的共享内存管理和可靠性。\n\n### 2. 修复分片驱逐卡死问题（https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F795）\n* 修改了对最冷分片的查找方式，避免遍历远程分片，仅将本地分片作为可能的驱逐候选。\n\n## 变更内容\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F785 中将 pydantic 从 2.9.1 升级至 2.9.2\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F786 中将 fastapi 从 0.114.2 升级至 0.115.0\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F793 中将 uvicorn 从 0.30.6 升级至 0.31.0\n* 由 @LukaszSztukiewicz 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F794 中修复了 README.md 中的 broken links\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F795 中修复了分片驱逐问题\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F787 中将 huggingface-hub 的依赖范围由 \u003C0.25,>=0.23.4 更新为 >=0.23.4,\u003C0.26\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F798 中修正了文档中 dataset.size() 的拼写错误\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F799 中将 v0.7.0 版本中的默认设置相关警告升级为信息提示\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F803 中将 uvicorn 从 0.31.0 升级至 0.31.1\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F804 中将 fastapi 从 0.115.0 升级至 0.115.2\n* 由 @kursataktas 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F805 中引入了 Gurubase.io 上的 Streaming Guru\n* 由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F806 中添加了关于共享前缀的更友好的错误提示\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaic","2024-11-04T20:50:54",{"id":162,"version":163,"summary_zh":164,"released_at":165},189230,"v0.9.0","# 🚀 Streaming v0.9.0\n\nStreaming `v0.9.0` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.9.0\n```\n\n## 新特性\n### 1. 改进了 ndarray 和 json 类型的兼容性 (#776, #777)\n现在，如果将列的类型指定为 `json`，包含 map 类型的列可以成功转换为 MDS 文件中的 JSON 格式，并且 JSON 编码器能够处理 ndarray 类型。\n\n## 变更内容\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F768 中将 fastapi 从 0.112.1 升级到 0.112.2\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F770 中升级了 CI 测试\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F772 中将 jupyter 从 1.0.0 升级到 1.1.1\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F779 中将 fastapi 从 0.112.2 升级到 0.114.0\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F778 中将 pydantic 从 2.8.2 升级到 2.9.1\n* 由 @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F777 中允许 JSON 编码器处理 ndarray 类型\n* 由 @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F776 中添加了与 JSON 兼容的 MapType\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F783 中将 fastapi 从 0.114.0 升级到 0.114.2\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F784 中将 datasets 的依赖版本从 `\u003C3,>=2.4.0` 更新为 `>=2.4.0,\u003C4`\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F782 中将 pytest 从 8.3.2 升级到 8.3.3\n* 由 @dakinggg 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F790 中将主分支版本号更新为 0.10.0.dev0\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.8.1...v0.9.0","2024-09-25T02:34:58",{"id":167,"version":168,"summary_zh":169,"released_at":170},189231,"v0.8.1","# 🚀 Streaming v0.8.1\n\nStreaming `v0.8.1` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.8.1\n```\n\n## 🔧 改进\n**现已解决数据加载器在 epoch 之间卡住的问题！** 我们观察到，对于一些需要训练多个 epoch 的任务，训练时间最多可提升 40%。如果这个问题曾影响您的运行并已得到修复，请告知我们！\n\n* @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F741 中修复了数据加载器在 epoch 结束时卡住的问题。\n* @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F748 中为 `dataframe_to_mds` 添加了默认压缩，并增加了对本地路径的警告。\n* @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F754 中实现了在调用 `write()` 后检查 `event.is_set()` 并抛出异常的功能。\n\n## 🐛 错误修复\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F750 中确保了当 `shuffle=False` 时，不同 epoch 之间的样本顺序保持确定性。\n\n## 变更内容\n* @eitanturok 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F739 中使 GitHub Actions 中的 Pytest 日志以彩色显示。\n* @jaehwana2z 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F733 中修复了 `download_from_azure` 函数中 Azure 容器名和 Blob 名称的错误。\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F743 中将 Uvicorn 从 0.30.3 升级至 0.30.5。\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F729 中将 `huggingface-hub` 的依赖版本从 `\u003C0.24,>=0.23.4` 更新为 `>=0.23.4,\u003C0.25`。\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F744 中将 FastAPI 从 0.111.1 升级至 0.112.0。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F745 中将 `ci-testing` 升级至 v0.1.0。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F746 中针对 Sphinx 废弃配置操作的情况对 `conf.py` 进行了修补。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F747 中将 `ci-testing` 再次升级至 v0.1.2。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F752 中使类型注解符合 PEP 585 标准。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F756 中应用 Ruff 规则移除了未使用的导入。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F764 中修复了 NumPy 2.1.0 的代码风格问题。\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F760 中将 FastAPI 从 0.112.0 升级至 0.112.1。\n* @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F762 中将 Uvicorn 从 0.30.5 升级至 0.30.6。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F766 中发布了版本 0.8.1。\n\n## 新贡献者\n* @eitanturok 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F739 中完成了首次贡献。\n* @jaehwana2z 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F733 中完成了首次贡献。\n* @srowen 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F748 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.8.0...v0.8.1","2024-08-23T20:26:04",{"id":172,"version":173,"summary_zh":174,"released_at":175},189232,"v0.8.0","## ✨ 新增功能 ✨ \n\n### **1. HF 文件系统流式传输 (#711)**\n\n现在，流式传输支持从 HF 文件系统中流式传输数据！这为托管您的数据添加了另一个流行后端选项。\n\n\n## 变更内容\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F670 中将 fastapi 从 0.110.2 升级到 0.111.0\n* 修复：在将 Spark DataFrame 转换为保存在 dbfs:\u002FVolumes 上的 MDS 后出现零字节文件的问题，由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F668 中完成\n* 确保分片大小不超过 4GB，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F672 中实现\n* 为编写不当的数据集提供有用的 `py1e` 错误提示，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F673 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F680 中将 pytest 从 8.2.0 升级到 8.2.1\n* 更新平台引用，由 @aspfohl 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F675 中完成\n* 更新 CODEOWNERS 文件，由 @karan6181 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F681 中完成\n* 修复文档中 `Stream` 对象的 `batch_size` 拼写错误，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F682 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F679 中将 databricks-sdk 从 0.27.0 升级到 0.27.1\n* 改进仅指定 `remote` 时本地临时目录的错误提示，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F683 中完成\n* 修复 `World` 对象中 `replication` 的节点计算问题，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F685 中完成\n* 修改序列并行性的警告条件，由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F688 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F692 中将 pydantic 从 2.7.1 升级到 2.7.2\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F691 中将 uvicorn 从 0.29.0 升级到 0.30.1\n* 确保 epoch_size 为整数，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F693 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F687 中将 databricks-sdk 从 0.27.1 升级到 0.28.0\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F697 中将 pytest 从 8.2.1 升级到 8.2.2\n* 修复：扩展 Writer 输出目录的用户路径。由 @huxuan 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F694 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F696 中将 pydantic 从 2.7.2 升级到 2.7.3\n* 修复标量或空 NumPy 数组编码的边缘情况，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F702 中完成\n* 将 `Spanner` 对象中的异常类型由 `ValueError` 改为 `IndexError`，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F701 中完成\n* 修复与 NumPy 2 兼容时的代码风格问题，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F705 中完成\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F704 中将 pydantic 从 2.7.3 升级到 2.7.4\n* 实现从 epoch 结束处正确恢复的功能，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F700 中完成\n* 修复分区中的 `drop_first` 检查逻辑，以考虑 `world_size` 是否能整除","2024-07-30T17:00:39",{"id":177,"version":178,"summary_zh":179,"released_at":180},189233,"v0.7.6","# 🚀 Streaming v0.7.6\n\nStreaming `v0.7.6` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.7.6\n```\n\n## :gem: 新特性\n\n### 1. `device_per_stream` 批次化方法\n用户现在可以构建批次，使得每个设备仅看到来自单个流的样本。这在不同数据源包含不同大小的样本或张量，但模型仍需在每个优化器步骤中同时处理这些不同数据源样本的情况下非常有用。\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F661 中添加了 `device_per_stream` 批次化方法。\n\n### 2. 为 Spark 数据框添加 `ndarray` 类型。\n在将 Spark 数据框转换为 MDS 时，支持解析 Spark 的 ArrayType（包括 ShortType、LongType、IntegerType、FloatType 和 DoubleType）。\n* 由 @XiaohanZhangCMU 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F623 中添加了 ndarray 类型。\n\n### 3. 对阿里盘存储的支持\n新增对阿里云存储服务——阿里盘的支持。\n* 由 @PeterDing 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F651 中添加了对阿里盘存储后端的支持。\n\n## 变更内容\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F660 中将 fastapi 从 0.110.0 升级至 0.110.2。\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F653 中将 pydantic 从 2.6.4 升级至 2.7.0。\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F666 中将 pydantic 从 2.7.0 升级至 2.7.1。\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F664 中将 pytest 从 8.1.1 升级至 8.2.0。\n* 由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F667 中将 databricks-sdk 从 0.23.0 升级至 0.27.0。\n* 由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F669 中将版本号更新至 v0.7.6。\n\n## 新贡献者\n* @PeterDing 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F651 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.7.5...v0.7.6","2024-05-10T22:22:21",{"id":182,"version":183,"summary_zh":184,"released_at":185},189234,"v0.7.5","# 🚀 Streaming v0.7.5\n\nStreaming `v0.7.5` 已发布！可通过 `pip` 安装：\n\n```bash\npip install --upgrade mosaicml-streaming==0.7.5\n```\n\n## :gem: 新特性\n\n### 1. 张量\u002F序列并行支持\n通过 `replication` 参数，可以轻松地在多个进程间共享数据样本，从而实现序列并行或张量并行。\n* @knighton 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F597 中实现了跨设备复制样本的功能（支持 SP \u002F TP）。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F607 中扩展了复制测试并完善了文档。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F619 中修复了在使用 SP\u002FTP 时 Streaming 正确计算唯一样本数量的问题。\n\n### 2. 全面升级的 Streaming 文档\n全新且优化的 Streaming 文档可在此处查看：[https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002F#](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002F#) —— 欢迎提交反馈和问题。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F636 中对 Streaming 文档进行了重大改版。\n\n### 3. `batch_size` 现在是 StreamingDataset 的必填参数\n由于我们发现许多用户未设置 `batch_size` 参数，导致错误和性能下降，因此现将其设为迭代数据集时的必填项。\n* @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F624 中明确指出必须设置 `batch_size`，别无选择。\n\n### 4. 支持 Python 3.11，弃用 Python 3.8\n* @karan6181 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F586 中增加了对 Python 3.11 的支持，并正式弃用了 Python 3.8。\n\n## 🐛 Bug 修复\n* [简单拼写修正] 修复 f-string 错误，由 @bigning 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F596 中完成。\n* 将分区比较逻辑修改为包含等于条件，由 @JAEarly 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F587 中完成。\n* 在初始化 SharedMemory 大小时使用整型，由 @bchiang2 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F604 中修复。\n* COCO 数据集修复 — 避免使用 `allow_unsafe_types=True`，由 @snarayan21 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F647 中完成。\n\n## 🔧 改进\n* 允许写入者覆盖现有数据，由 @JAEarly 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F594 中实现。\n* 更新招聘信息链接，由 @milocress 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F611 中完成。\n* 更新许可证信息，由 @b-chu 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F568 中完成。\n* 更新 S3 兼容对象存储的文档，由 @AIproj 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F592 中完成。\n* 使 yamllint 格式与 Composer 保持一致，由 @b-chu 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F583 中完成。\n* 将代码检查工作流切换至 ci-testing 仓库，由 @b-chu 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F616 中完成。\n\n## 变更内容\n* 将 uvicorn 从 0.26.0 升级至 0.27.1，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F599 中完成。\n* 将 pytest-split 从 0.8.1 升级至 0.8.2，由 @dependabot 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F581 中完成。\n* 将 ruff 更新至 0.2.2，由 @Skylion007 在 https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F608 中完成。\n* 将 fastapi 从 0.109.0 升级至 0.1","2024-04-09T00:35:40",{"id":187,"version":188,"summary_zh":189,"released_at":190},189235,"v0.7.4","# 🚀 Streaming v0.7.4\r\n\r\nStreaming `v0.7.4` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.7.4\r\n```\r\n\r\n## 🐛 Bug Fixes\r\n* Download to temporary path from azure by @philipnrmn in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F566\r\n* fix(merge_index): scheme was not well formatted by @fwertel in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F576\r\n* Update misplaced params of _format_remote_index_files by @lsongx in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F584\r\n* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F593\r\n\r\n## What's Changed\r\n* Bump fastapi from 0.108.0 to 0.109.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F564\r\n* Bump gitpython from 3.1.40 to 3.1.41 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F565\r\n* Download to temporary path from azure by @philipnrmn in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F566\r\n* Use `tempfile.gettempdir()` instead of a hardcoded temp root. by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F570\r\n* fix(merge_index): scheme was not well formatted by @fwertel in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F576\r\n* Bump uvicorn from 0.25.0 to 0.26.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F572\r\n* Bump sphinx-tabs from 3.4.4 to 3.4.5 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F571\r\n* Update misplaced params of _format_remote_index_files by @lsongx in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F584\r\n* Remove .ci folder and move FILE_HEADER and CODEOWNERS by @irenedea in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F588\r\n* Modifications to resumption shared memory allowing `load_state_dict` multiple times. by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F593\r\n* Bump version to 0.7.4 by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F595\r\n\r\n## New Contributors\r\n* @philipnrmn made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F566\r\n* @fwertel made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F576\r\n* @lsongx made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F584\r\n* @irenedea made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F588\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.7.3...v0.7.4","2024-02-08T22:00:23",{"id":192,"version":193,"summary_zh":194,"released_at":195},189236,"v0.7.3","# 🚀 Streaming v0.7.3\r\n\r\nStreaming `v0.7.3` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.7.3\r\n```\r\n\r\n## 🐛 Bug Fixes\r\n- Logging messages for new defaults only show once per rank. (#543)\r\n- Fixed padding calculation for repeat samples in the partition. (#544)\r\n\r\n## 🔧 Other improvements\r\n- Update copyright license year from 2023 -> 2022-2024. (#560)\r\n\r\n## What's Changed\r\n* Logging messages from new defaults only show once per rank. by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F543\r\n* Fixed condition for warning when partitioning over tiny datasets. by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F544\r\n* Removing stray print statement by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F553\r\n* Bump pydantic from 2.5.2 to 2.5.3 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F548\r\n* Bump uvicorn from 0.24.0.post1 to 0.25.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F549\r\n* Bump fastapi from 0.104.1 to 0.108.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F557\r\n* Bump pytest from 7.4.3 to 7.4.4 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F558\r\n* Update copyright: 2023 -> 2022-2024. by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F560\r\n* Bump version to 0.7.3 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F562\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.7.2...v0.7.3","2024-01-12T18:12:30",{"id":197,"version":198,"summary_zh":199,"released_at":200},189237,"v0.7.2","# 🚀 Streaming v0.7.2\r\n\r\nStreaming `v0.7.2` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.7.2\r\n\r\n```\r\n\r\n## :gem: New Features\r\n### 1. Canned ACL Support (#512)\r\nAdd support for the Canned ACL using the environment variable `S3_CANNED_ACL` for AWS S3. Checkout [Canned ACL](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fhow_to_guides\u002Fconfigure_cloud_storage_credentials.html#canned-acl) document on how to use it.\r\n\r\n### 2. Allow\u002Freject datasets containing unsafe types (#519)\r\nThe pickle serialization format, one of the available MDS encodings, is a potential security vulnerability. We added a boolean flag `allow_unsafe_types ` in the `StreamingDataset` class to allow or reject datasets containing Pickle.\r\n\r\n\r\n\r\n## 🐛 Bug Fixes\r\n- Retrieve batch size correctly from vision yamls for the streaming simulator (#501)\r\n- Fix for CVE-2023-47248 (#504)\r\n- Streaming simulator bug fixes (proportion, repeat, yaml ingestion) (#514)\r\n  - Proportion of None instead of a string 'None' is now handled correctly.\r\n  - Repeat of None instead of a string 'None' is now handled correctly.\r\n  - Added warning for StreamingDataset subclass defaults\r\n- Fix sample partitioning algorithm bug for tiny datasets (#517)\r\n\r\n## 🔧 Improvements\r\n- Added warning messages for new streaming dataset defaults to inform users about the old and new values. (#502)\r\n\r\n## What's Changed\r\n* Migrate pydocstyle to ruff by @Skylion007 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F500\r\n* Bump fastapi from 0.104.0 to 0.104.1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F496\r\n* Bump uvicorn from 0.23.2 to 0.24.0.post1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F497\r\n* Retrieve batch size correctly from vision yamls for simulator by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F501\r\n* Adding warning messages for new defaults by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F502\r\n* Fix for CVE-2023-47248 by @bandish-shah in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F504\r\n* Bump pydantic from 2.4.2 to 2.5.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F513\r\n* Bump yamllint from 1.32.0 to 1.33.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F506\r\n* Fixed comments and update dataframe_to_MDS API signature by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F515\r\n* Simulator bug fixes (proportion, repeat, yaml ingestion) by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F514\r\n* Add support for the Canned ACL environment variable for AWS S3 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F512\r\n* Fixed bugs when trying to use very small datasets by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F517\r\n* Bump databricks-sdk from 0.8.0 to 0.14.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F518\r\n* Add flag to allow or reject datasets containing unsafe types (i.e., Pickle) by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F519\r\n* improve exception error messages for downloading by @Skylion007 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F525\r\n* doc: add NDArray format by @OrenLeung in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F527\r\n* Offload exception to mds_write.  by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F528\r\n* Add allow_unsafe_types parameter to the streaming regression tests by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F531\r\n* Bump version to 0.7.2 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F532\r\n\r\n## New Contributors\r\n* @OrenLeung made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F527\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.7.1...v0.7.2","2023-12-14T17:26:01",{"id":202,"version":203,"summary_zh":204,"released_at":205},189238,"v0.7.1","# 🚀 Streaming v0.7.1\r\n\r\nStreaming `v0.7.1` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.7.1\r\n```\r\n\r\n## 🐛 Bug Fixes\r\n\r\n- Simulation from command line with `simulator` is fixed (#499)\r\n\r\n## What's Changed\r\n* Fixing simulator command with simulation directories being included in package by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F499\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.7.0...v0.7.1","2023-11-06T23:03:26",{"id":207,"version":208,"summary_zh":209,"released_at":210},189239,"v0.7.0","# 🚀 Streaming v0.7.0\r\n\r\nStreaming `v0.7.0` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.7.0\r\n```\r\n\r\n## 📈 Better Defaults for `StreamingDataset` (#479)\r\n- The default values for `StreamingDataset` have been updated to be more performant and are applicable for most use cases, detailed below:\r\n\r\n| Parameter             | Old Value                                                  | New Value                                                              | Benefit                                                        |\r\n|-----------------------|------------------------------------------------------------|------------------------------------------------------------------------|----------------------------------------------------------------|\r\n| `shuffle_algo`        | `py1s`                                                     | `py1e`                                                                 | Better shuffle and balanced downloading                        |\r\n| `num_canonical_nodes` | `64 * physical nodes`                                      | if `py1s` or `py2s`, `64 * physical_nodes`, otherwise `physical_nodes` | Consistently good shuffle for all shuffle algos                |\r\n| `shuffle_block_size`  | `262,144`                                                  | `4,000,000 \u002F num_canonical_nodes`                                      | Consistently good shuffle for all `num_canonical_nodes` values |\r\n| `predownload`         | `max(batch_size, 256 * batch_size \u002F\u002F num_canonical_nodes)` | `8 * batch_size`                                                       | Better balanced downloading                                    |\r\n| `partition_algo`      | `orig`                                                     | `relaxed`                                                              | More flexible deterministic resumptions on nodes               |\r\n\r\n## :gem: New Features\r\n\r\n### 🤖 Streaming Simulator: Easily simulate the performance of training configurations. (#385)\r\n- After installing this version of streaming, simply run the command `simulator` in your terminal to open the simulation interface.\r\n- Simulate throughput, network downloads, shuffle quality, and cache limit requirements for configurations. \r\n- Easily de-risk runs and find performant parameter settings.\r\n- Check out the [docs](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Ffundamentals\u002Fsimulator.html) for more information!\r\n\r\n### 🔢 More flexible deterministic training and resumption (#476)\r\n- Deterministic training and resumptions are now possible on more numbers of nodes!\r\n- Previously, the `num_canonical_nodes` parameter had to divide or be a multiple of the number of physical nodes for determinism.\r\n- Now, deterministic training is possible on any number of nodes that also evenly divides your run's global batch size.\r\n\r\n## 🐛 Bug Fixes\r\n\r\n- Check for invalid hash algorithm names (#486)\r\n\r\n## What's Changed\r\n* Bump fastapi from 0.103.2 to 0.104.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F480\r\n* Bump gitpython from 3.1.37 to 3.1.40 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F481\r\n* Bump sphinx-tabs from 3.4.1 to 3.4.4 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F482\r\n* do not remove local directory when out is local by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F477\r\n* Update __init__.py by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F484\r\n* Check for invalid hash algorithm name by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F486\r\n* Relaxing divisibility constraints on num_canonical_nodes and num_physical_nodes by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F476\r\n* Better default values for StreamingDataset args by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F479\r\n* Update release yaml to not write anything to GitHub by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F487\r\n* Bump pypandoc from 1.11 to 1.12 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F490\r\n* Bump pytest from 7.4.2 to 7.4.3 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F491\r\n* Bumping version for streaming v0.7.0 by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F495\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.6.1...v0.7.0","2023-11-06T01:23:37",{"id":212,"version":213,"summary_zh":214,"released_at":215},189240,"v0.6.1","# 🚀 Streaming v0.6.1\r\n\r\nStreaming `v0.6.1` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.6.1\r\n\r\n```\r\n\r\n## :gem: New Features\r\n\r\n### :railway_car: Merge meta-data information from sub-directories dataset to form one unified dataset. (#449)\r\n- Addition of the `merge_index()` utility method to merge subdirectories index files from an MDS dataset. The subdirectories can be local or any supported cloud provider URL path.\r\n- Checkout [dataset conversion](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fexamples\u002Fmultiprocess_dataset_conversion.html) and [Spark Dataframe to MDS](https:\u002F\u002Fdocs.mosaicml.com\u002Fprojects\u002Fstreaming\u002Fen\u002Fstable\u002Fexamples\u002Fspark_dataframe_to_MDS.html) jupyter notebook for an example in action.\r\n\r\n### :repeat: Retry uploading a file to a cloud provider path. (#448)\r\n- Added upload retry logic with backoff and jitter during dataset conversion as part of parameter `retry` in [Writer](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fv0.6.1\u002Fstreaming\u002Fbase\u002Fformat\u002Fbase\u002Fwriter.py#L65).\r\n```python\r\nfrom streaming import MDSWriter\r\n\r\nwith MDSWriter(\r\n               ...,\r\n               retry=3) as out:\r\n    for sample in dataset:\r\n        out.write(sample)\r\n```\r\n\r\n\r\n## 🐛 Bug Fixes\r\n\r\n- Validate [Writer](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fv0.6.1\u002Fstreaming\u002Fbase\u002Fformat\u002Fbase\u002Fwriter.py#L32) arguments and raise a ValueError exception if argument(s) is\u002Fare invalid. (#434)\r\n- Terminate the main process if one of the upload threads receives an Exception during dataset conversion. (#448)\r\n\r\n## 🔧 Improvements\r\n\r\n- More balancing inter-node downloading for the `py1e` shuffling algorithm by varying shard sample ranges, helping to reduce throughput drops at scale. (#442)\r\n\r\n## What's Changed\r\n* Validate writer arguments by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F434\r\n* Bump pytest from 7.4.1 to 7.4.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F428\r\n* Bump gitpython from 3.1.34 to 3.1.36 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F435\r\n* Fix stylistic issues (mostly 100col, docstring conventions) by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F439\r\n* Bump pytest-codeblocks from 0.16.1 to 0.17.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F436\r\n* py1e randomized by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F442\r\n* Bump gitpython from 3.1.36 to 3.1.37 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F446\r\n* Fix BatchFeature of Transformers not handled by StreamingDataloader by @Hubert-Bonisseur in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F450\r\n* Add a retry logic with backoff and jitter by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F448\r\n* Fix broken bibtext by @Skylion007 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F452\r\n* Update integration test to include sample order comparison by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F456\r\n* Bump pydantic from 2.3.0 to 2.4.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F455\r\n* Update MCLI credential page for Databricks by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F466\r\n* Add merge index file utility by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F449\r\n* Add py1e warning when Shuffle block size is smaller than shard size by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F463\r\n* Fix doc strings by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F469\r\n* Bump fastapi from 0.103.1 to 0.103.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F454\r\n* Maintain order for merge_index_from_list by @XiaohanZhangCMU in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F472\r\n* Fixed codeql out of disk space issue by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F473\r\n* Bump version to 0.6.1 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F474\r\n\r\n## New Contributors\r\n* @Hubert-Bonisseur made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F450\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.6.0...v0.6.1","2023-10-18T21:28:58",{"id":217,"version":218,"summary_zh":219,"released_at":220},189241,"v0.6.0","# 🚀 Streaming v0.6.0\r\n\r\nStreaming `v0.6.0` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.6.0\r\n\r\n```\r\n\r\n## New Features\r\n\r\n### **🆕**  Databricks File System and Databricks Unity Catalog  (#362)\r\n\r\nSupport for reading and writing data from and to the Databricks File System (DBFS) and Unity Catalog (UC) Volumes. This means that you can now use DBFS and UC Volumes as a source or sink for your streaming data pipelines or model training. Below is the path structure:\r\n\r\n**Databricks File System (DBFS)** \r\n\r\nDBFS path structure is a hierarchical namespace that is organized into directories and files. The DBFS prefix must starts with `dbfs:\u002F`.\r\n\r\n**UC Volumes**\r\n\r\nThe path structure for UC Volumes is similar to the path structure for DBFS, but with a few key differences.\r\n\r\nThe root of the UC Volumes namespace is `dbfs:\u002FVolumes\u002F\u003Ccatalog>\u002F\u003Cschema>\u002F\u003Cvolume>`, where:\r\n\r\n- `\u003Ccatalog>` is the name of the catalog where the volume is created.\r\n- `\u003Cschema>` is the name of the schema where the volume is created.\r\n- `\u003Cvolume>` is the name of the volume.\r\n\r\nHence, use a `dbfs:\u002F\u002FVolumes` prefix to specify a UC Volumes path.\r\n\r\n### 💽 Spark Dataframe to MDS convertor (#363)\r\n\r\nIntroducing the new `DataFrameToMDS` API, empowering users to effortlessly leverage Spark's capabilities for handling diverse datasets in various formats. This API enables seamless conversion of Spark DataFrames into MDS datasets, with the flexibility to specify output locations to both local and cloud storage. Index files are optionally merged. Additionally, users can add data preprocessing steps by defining custom iterator functions and arguments. All these features are seamlessly bundled into a single Spark job, ensuring an efficient and streamlined workflow for data transformation. An example [notebook](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fexamples\u002Fspark_dataframe_to_MDS.ipynb) is provided to help users get started.\r\n\r\n### 🔀 Randomize and offset shuffle blocks algorithm (#373)\r\n\r\nThe new `py1br` shuffle algorithm helps mitigate download spikes that occur when using the `py1b` algorithm. With `py1b`, shuffle blocks are all the same size, so when progressing through training, nodes will have to download many shards at the same time. In contrast, with `py1br`, shuffle blocks are offset from each other and are variably sized. This results in more balanced downloads over time. The `py1br` algorithm is a replacement for the `py1b` algorithm, which will be deprecated soon.\r\n\r\n```python\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(\r\n    shuffle_algo='py1br',\r\n    ...\r\n)\r\n```\r\n\r\n### 🔀 Expanded range shuffle algorithm (#394)\r\n\r\nThe new `py1e` shuffle algorithm helps reduce the minimum cache limit needed for training, and results in much smoother downloads than both `py1br` and `py1e`. However, its shuffle quality is slightly lower. Rather than shuffling all samples in blocks of size `shuffle_block_size`, it instead spreads the samples of each shard over a range of maximum size `shuffle_block_size`, retaining most of the shuffle quality from `py1b` and `py1br` while reducing download spikes across the duration of training.\r\n\r\n```python\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(\r\n    shuffle_algo='py1e',\r\n    ...\r\n)\r\n```\r\n\r\n### 🔥 Per-Stream Batching (#407)\r\n\r\nUsers are now able to ensure that each batch comes has samples from only a single stream. You can now set the new parameter `batching_method` to `per_stream` to access this functionality. Per-stream batching will still take into account upsampling and downsampling of streams, set by `proportion`, `repeat`, or `choose`. To make batches contain only samples from a group of streams, merge streams’ `index.json` files to create a single one for each group.\r\n\r\n```python\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(\r\n    batching_method='per_stream',\r\n    ...\r\n)\r\n```\r\n\r\n### 🔥 Stratified Batching (#408)\r\n\r\nUsers are now able to ensure that each batch has a consistent number of samples from every stream. Previously, stream proportions were satisfied in the aggregate but not at the batch level. You can now set the new parameter `batching_method` to `stratified` to access this functionality. Stratified batching will still take into account upsampling and downsampling of streams, set by `proportion`, `repeat`, or `choose`.\r\n\r\n```python\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(\r\n    batching_method='stratified',\r\n    ...\r\n)\r\n```\r\n\r\n### 💪 Download-Efficient Sparse Sampling (#391)\r\n\r\nPrevious versions of StreamingDataset implement downsampling\u002Fupsampling by giving each sample equal probability of being selected (plus or minus one due when sampling is fractional), without regard to what shard a sample is on. This means that no matter how small your desired downsampling is, StreamingDataset will still use each shard at as equal a rate as possible. This is problematic for do","2023-09-13T20:11:20",{"id":222,"version":223,"summary_zh":224,"released_at":225},189242,"v0.5.2","# 🚀 Streaming v0.5.2\r\n\r\nStreaming `v0.5.2` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.5.2\r\n```\r\n\r\n## New features\r\n- Allow authentication with GCS for service accounts #315 \r\n- human-readable suffixes for size_limit and epoch_size #333 \r\n- static sampling #348 \r\n\r\n## Documentation changes\r\n- Update contribution guide and improved unittest logic #343 \r\n- static sampling #348 \r\n\r\n## Testing\r\n- Add a regression test for StreamingDataset instantiation and iteration #318 \r\n- Fixed accidental shard delete test #341 \r\n- Add a regression test for StreamingDataset using cloud providers #319 \r\n- Add iteration time test as part of regression testing #358 \r\n\r\n## Bug fix\r\n- Fix init local dir zip-only shard handling #330\r\n- added default behavior if no streams and epoch_size specified #348 \r\n\r\n## What's Changed\r\n* Bump myst-parser from 1.0.0 to 2.0.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F309\r\n* Added files to support azure datalake storage by @shivshandilya in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F311\r\n* Add secrets check as part of pre-commit by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F312\r\n* Bump pytest from 7.3.2 to 7.4.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F313\r\n* Bump fastapi from 0.97.0 to 0.98.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F314\r\n* Add GCS authentication for service accounts by @b-chu in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F315\r\n* Bump fastapi from 0.98.0 to 0.100.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F322\r\n* Bump uvicorn from 0.22.0 to 0.23.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F327\r\n* Bump gitpython from 3.1.31 to 3.1.32 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F329\r\n* Bump pydantic from 1.10.9 to 1.10.11 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F328\r\n* Sync tmp directory by @b-chu in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F316\r\n* Add a regression test for StreamingDataset instantiation and iteration by @b-chu in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F318\r\n* human-readable suffixes for size_limit and epoch_size by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F333\r\n* Updated pre commit packages by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F340\r\n* Fix init local dir zip-only shard handling by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F330\r\n* Fixed accidental shard delete test by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F341\r\n* Bump uvicorn from 0.23.0 to 0.23.1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F338\r\n* Download the index.json file as tmp extension until it finishes by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F346\r\n* Update contribution guide and improved unittest logic by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F343\r\n* Bump fastapi from 0.100.0 to 0.100.1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F351\r\n* Bump uvicorn from 0.23.1 to 0.23.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F352\r\n* Bump furo from 2023.5.20 to 2023.7.26 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F354\r\n* Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F353\r\n* added default behavior if no streams and epoch_size specified by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F348\r\n* Add a regression test for StreamingDataset using cloud providers by @b-chu in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F319\r\n* Fixed sampling by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F356\r\n* mds ndarray int conversion by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F357\r\n* Add iteration time test as part of regression testing by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F358\r\n* Bump pydantic from 1.10.11 to 2.1.1 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F366\r\n* Fixed CI test to perform proper directory cleanup by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F369\r\n* version bump to 0.5.2 by @snarayan21 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F370\r\n\r\n## New Contributors\r\n* @shivshandilya made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F311\r\n* @b-chu made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F315\r\n* @snarayan21 made their first contribution in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F333\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.5.1...v0.5.2","2023-06-19T05:58:05",{"id":227,"version":228,"summary_zh":229,"released_at":230},189243,"v0.5.1","## What's Changed\r\n* Improved shard eviction test execution time by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F291\r\n* Bump fastapi from 0.96.0 to 0.97.0 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F294\r\n* Bump pytest from 7.3.1 to 7.3.2 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F295\r\n* Bump pydantic from 1.10.8 to 1.10.9 by @dependabot in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F296\r\n* Terminate the main process if thread died unexpectedly by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F297\r\n* Improved existing exception and exception messages by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F298\r\n* Round drop_first to be divisible by num_physical_nodes. by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F301\r\n* Added a utility method to clean stale shared memory by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F299\r\n* Propagate exception between threads and processes and improved error message by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F304\r\n* Fix LocalDataset (property size for fancy __getitem__). by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F305\r\n* Natively support encoding and decoding ndarrays in MDS by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F82\r\n* Bump version to 0.5.1 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F308\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fcompare\u002Fv0.5.0...v0.5.1","2023-08-08T18:59:51",{"id":232,"version":233,"summary_zh":234,"released_at":235},189244,"v0.5.0","# 🚀 Streaming v0.5.0\r\n\r\nStreaming `v0.5.0` is released! Install via `pip`:\r\n\r\n```\r\npip install --upgrade mosaicml-streaming==0.5.0\r\n\r\n```\r\n\r\n## New Features\r\n\r\n### 🆕 Cold Shard Eviction. ( #219 )\r\n\r\nDynamically delete least recently used shards in order to keep disk usage under a specified limit. This is enabled by setting the StreamingDataset argument `cache_limit`. See the [shuffling](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Ffundamentals\u002Fshuffling.md) guide for more details.\r\n\r\n```python\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(\r\n    cache_limit='100gb',\r\n    ...\r\n)\r\n```\r\n\r\n### 🤙 Fetch sample using NumPy style indexing. ( #120 )\r\n\r\nUsers can now randomly access samples using NumPy-style indexing with `StreamingDataset`. For example,\r\n\r\n```python\r\nimport numpy as np\r\nfrom streaming import StreamingDataset\r\n\r\ndataset = StreamingDataset(local=local, remote=remote)\r\n\r\ndataset[0]  # Fetch sample 0\r\ndataset[-1]  # Fetch last sample\r\ndataset[[10, 20]]  # Fetch sample 10 and 20\r\ndataset[slice(1, 10, 2)]  # Fetch sample 1, 3, 5, 7, and 9\r\ndataset[5:0:-1]  # Fetch sample 5, 4, 3, 2, 1\r\ndataset[np.array([4, 7])]  # Fetch sample 4 and 7\r\n```\r\n\r\n### 🦾 Any S3 compatible object store. ( #265 )\r\n\r\nSupport of any S3 compatible object stores, meaning, an object store which uses the S3 API to communicate with any connected device or system. Some of the S3 compatible object stores are [Cloudflare R2](https:\u002F\u002Fwww.cloudflare.com\u002Fproducts\u002Fr2\u002F), [Coreweave](https:\u002F\u002Fdocs.coreweave.com\u002Fstorage\u002Fobject-storage), [Backblaze b2](https:\u002F\u002Fwww.backblaze.com\u002Fb2\u002Fcloud-storage.html), etc. User needs to provide an environment variable `S3_ENDPOINT_URL` based on the object store that you are using. Details on how to configure credentials can be found [here](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Fhow_to_guides\u002Fconfigure_cloud_storage_cred.md#any-s3-compatible-object-store).\r\n\r\n### 🦾 Azure cloud blob storage. ( #256 )\r\n\r\nSupport of Azure cloud blob storage. Details on how to configure credentials can be found [here](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Fhow_to_guides\u002Fconfigure_cloud_storage_cred.md#azure-blob-storage).\r\n\r\n## Bug Fixes\r\n\r\n- Wait for download and ready thread to finish before terminating job. ( #286 )\r\n- Fixed length calculation to use resampled epoch size, not underlying num samples. ( #278 )\r\n- Fixed mypy errors by adding a py.typed marker file. ( #245 )\r\n- Create a new boto3 session per thread to avoid sharing resources. ( #241 )\r\n\r\n## 🔧 **API changes**\r\n\r\n- The argument `samples_per_epoch` has been renamed to `epoch_size` in `StreamingDataset`to better distinguish the actual number of underlying samples as serialized and the number of observed samples when iterating (which may be different due to weighting sub-datasets).\r\n- The argument `samples` has been renamed to `choose` in `Stream` to better distinguish the  underlying sample vs resampled data.\r\n- The argument `keep_raw` has been removed in `StreamingDataset` in the process of finalizing the design for shard eviction (see the newly-added `cache_limit` parameter).\r\n- The default value of `predownload` in `StreamingDataset` was updated; it is now derived using batch size and number of canonical nodes instead of previous constant value of `100_000`. This is to prevent predownloaded shards from getting evicted before ever being used.\r\n- The default value of `num_canonical_nodes` in `StreamingDataset` was updated to 64 times the number of nodes of the initial run instead of number of nodes of the initial run to increase data source diversity and improve convergence.\r\n- The default value of `shuffle_algo` in `StreamingDataset` was changed from `py1b` to `py1s` as it requires less shards to be downloaded during iteration. More details about different shuffling algorithms can be found [here](https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fblob\u002Fmain\u002Fdocs\u002Fsource\u002Ffundamentals\u002Fshuffling.md).\r\n\r\n## What's Changed\r\n* Redesign shard index by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F236\r\n* Propagate an exception raise by a thread to its caller by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F241\r\n* Raise descriptive error message when index.json is corrupted by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F242\r\n* Rename \"samples\" to \"choose\" (distinguish underlying vs resampled) by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F243\r\n* Added py.typed to indicate that the repository has typing annotations by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F245\r\n* Add \"Array\" base class, which provides numpy-style indexing. by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F120\r\n* Better organize code by @knighton in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F246\r\n* Update readthedocs python version to 3.9 by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosaicml\u002Fstreaming\u002Fpull\u002F249\r\n* Create a new boto3 session per thread by @karan6181 in https:\u002F\u002Fgithub.com\u002Fmosa","2023-06-06T13:31:58"]