[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-uber--petastorm":3,"tool-uber--petastorm":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",156033,2,"2026-04-14T23:32:00",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":10,"env_os":101,"env_gpu":102,"env_ram":103,"env_deps":104,"category_tags":113,"github_topics":114,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":122,"updated_at":123,"faqs":124,"releases":154},7596,"uber\u002Fpetastorm","petastorm","Petastorm library enables single machine or distributed training and evaluation of deep learning models from datasets in Apache Parquet format. It supports ML frameworks such as Tensorflow, Pytorch, and PySpark and can be used from pure Python code.","Petastorm 是一款由 Uber ATG 开源的数据访问库，旨在让开发者能够直接从 Apache Parquet 格式的数据集中进行深度学习模型的训练与评估。无论是单机环境还是分布式集群，它都能流畅运行，完美适配 TensorFlow、PyTorch 和 PySpark 等主流机器学习框架。\n\n在传统工作流中，将大规模数据集转换为模型可读的格式往往繁琐且低效，尤其是处理图像或多维数组时。Petastorm 巧妙解决了这一痛点，它无需将数据预先转换为 TFRecord 等特定格式，而是通过扩展 Parquet  schema，原生支持多维数组和自定义数据编码（如 JPEG、PNG 压缩），让数据读取变得简单高效。借助 PySpark 强大的处理能力，用户可以轻松在本地或 Spark 集群上生成标准化的数据集。\n\n这款工具特别适合从事深度学习研发的工程师、数据科学家以及需要处理海量结构化或非结构化数据的研究人员。如果你正在寻找一种既能利用 Parquet 存储优势，又能无缝对接现代深度学习框架的解决方案，Petastorm 能让你的数据管道更加简洁、灵活且高性能。只需几行 Pytho","Petastorm 是一款由 Uber ATG 开源的数据访问库，旨在让开发者能够直接从 Apache Parquet 格式的数据集中进行深度学习模型的训练与评估。无论是单机环境还是分布式集群，它都能流畅运行，完美适配 TensorFlow、PyTorch 和 PySpark 等主流机器学习框架。\n\n在传统工作流中，将大规模数据集转换为模型可读的格式往往繁琐且低效，尤其是处理图像或多维数组时。Petastorm 巧妙解决了这一痛点，它无需将数据预先转换为 TFRecord 等特定格式，而是通过扩展 Parquet  schema，原生支持多维数组和自定义数据编码（如 JPEG、PNG 压缩），让数据读取变得简单高效。借助 PySpark 强大的处理能力，用户可以轻松在本地或 Spark 集群上生成标准化的数据集。\n\n这款工具特别适合从事深度学习研发的工程师、数据科学家以及需要处理海量结构化或非结构化数据的研究人员。如果你正在寻找一种既能利用 Parquet 存储优势，又能无缝对接现代深度学习框架的解决方案，Petastorm 能让你的数据管道更加简洁、灵活且高性能。只需几行 Python 代码，即可开启高效的数据加载体验。","\nPetastorm\n=========\n\n.. image:: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Factions\u002Fworkflows\u002Funittest.yml\u002Fbadge.svg?branch=master\n   :target: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Factions\u002Fworkflows\u002Funittest.yml\n   :alt: Build Status\n\n.. image:: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fuber\u002Fpetastorm\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\n   :target: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fuber\u002Fpetastorm\u002Fbranch\u002Fmaster\n   :alt: Code coverage\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\n   :target: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\n   :alt: License\n\n.. image:: https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fpetastorm.svg\n   :target: https:\u002F\u002Fpypi.org\u002Fproject\u002Fpetastorm\n   :alt: Latest Version\n\n.. inclusion-marker-start-do-not-remove\n\n.. contents::\n\n\nPetastorm is an open source data access library developed at Uber ATG. This library enables single machine or\ndistributed training and evaluation of deep learning models directly from datasets in Apache Parquet\nformat. Petastorm supports popular Python-based machine learning (ML) frameworks such as\n`Tensorflow \u003Chttp:\u002F\u002Fwww.tensorflow.org\u002F>`_, `PyTorch \u003Chttps:\u002F\u002Fpytorch.org\u002F>`_, and\n`PySpark \u003Chttp:\u002F\u002Fspark.apache.org\u002Fdocs\u002Flatest\u002Fapi\u002Fpython\u002Fpyspark.html>`_. It can also be used from pure Python code.\n\nDocumentation web site: `\u003Chttps:\u002F\u002Fpetastorm.readthedocs.io>`_\n\n\n\nInstallation\n------------\n\n.. code-block:: bash\n\n    pip install petastorm\n\n\nThere are several extra dependencies that are defined by the ``petastorm`` package that are not installed automatically.\nThe extras are: ``tf``, ``tf_gpu``, ``torch``, ``opencv``, ``docs``, ``test``.\n\nFor example to trigger installation of GPU version of tensorflow and opencv, use the following pip command:\n\n.. code-block:: bash\n\n    pip install petastorm[opencv,tf_gpu]\n\n\n\nGenerating a dataset\n--------------------\n\nA dataset created using Petastorm is stored in `Apache Parquet \u003Chttps:\u002F\u002Fparquet.apache.org\u002F>`_ format.\nOn top of a Parquet schema, petastorm also stores higher-level schema information that makes multidimensional arrays into a native part of a petastorm dataset. \n\nPetastorm supports extensible data codecs. These enable a user to use one of the standard data compressions (jpeg, png) or implement her own.\n\nGenerating a dataset is done using PySpark.\nPySpark natively supports Parquet format, making it easy to run on a single machine or on a Spark compute cluster.\nHere is a minimalistic example writing out a table with some random data.\n\n\n.. code-block:: python\n\n   import numpy as np\n   from pyspark.sql import SparkSession\n   from pyspark.sql.types import IntegerType\n\n   from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec\n   from petastorm.etl.dataset_metadata import materialize_dataset\n   from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField\n\n   # The schema defines how the dataset schema looks like\n   HelloWorldSchema = Unischema('HelloWorldSchema', [\n       UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),\n       UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),\n       UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),\n   ])\n\n\n   def row_generator(x):\n       \"\"\"Returns a single entry in the generated dataset. Return a bunch of random values as an example.\"\"\"\n       return {'id': x,\n               'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),\n               'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}\n\n\n   def generate_petastorm_dataset(output_url='file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset'):\n       rowgroup_size_mb = 256\n\n       spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()\n       sc = spark.sparkContext\n\n       # Wrap dataset materialization portion. Will take care of setting up spark environment variables as\n       # well as save petastorm specific metadata\n       rows_count = 10\n       with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):\n\n           rows_rdd = sc.parallelize(range(rows_count))\\\n               .map(row_generator)\\\n               .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))\n\n           spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \\\n               .coalesce(10) \\\n               .write \\\n               .mode('overwrite') \\\n               .parquet(output_url)\n\n\n- ``HelloWorldSchema`` is an instance of a ``Unischema`` object.\n  ``Unischema`` is capable of rendering types of its fields into different\n  framework specific formats, such as: Spark ``StructType``, Tensorflow\n  ``tf.DType`` and numpy ``numpy.dtype``.\n- To define a dataset field, you need to specify a ``type``, ``shape``, a\n  ``codec`` instance and whether the field is nullable for each field of the\n  ``Unischema``.\n- We use PySpark for writing output Parquet files. In this example, we launch\n  PySpark on a local box (``.master('local[2]')``). Of course for a larger\n  scale dataset generation we would need a real compute cluster.\n- We wrap spark dataset generation code with the ``materialize_dataset``\n  context manager.  The context manager is responsible for configuring row\n  group size at the beginning and write out petastorm specific metadata at the\n  end.\n- The row generating code is expected to return a Python dictionary indexed by\n  a field name. We use ``row_generator`` function for that. \n- ``dict_to_spark_row`` converts the dictionary into a ``pyspark.Row``\n  object while ensuring schema ``HelloWorldSchema`` compliance (shape,\n  type and is-nullable condition are tested).\n- Once we have a ``pyspark.DataFrame`` we write it out to a parquet\n  storage. The parquet schema is automatically derived from\n  ``HelloWorldSchema``.\n\nPlain Python API\n----------------\nThe ``petastorm.reader.Reader`` class is the main entry point for user\ncode that accesses the data from an ML framework such as Tensorflow or Pytorch.\nThe reader has multiple features such as:\n\n- Selective column readout\n- Multiple parallelism strategies: thread, process, single-threaded (for debug)\n- N-grams readout support\n- Row filtering (row predicates)\n- Shuffling\n- Partitioning for multi-GPU training\n- Local caching\n\nReading a dataset is simple using the ``petastorm.reader.Reader`` class which can be created using the\n``petastorm.make_reader`` factory method:\n\n.. code-block:: python\n\n   from petastorm import make_reader\n\n    with make_reader('hdfs:\u002F\u002Fmyhadoop\u002Fsome_dataset') as reader:\n       for row in reader:\n           print(row)\n\n``hdfs:\u002F\u002F...`` and ``file:\u002F\u002F...`` are supported URL protocols.\n\nOnce a ``Reader`` is instantiated, you can use it as an iterator.\n\nTensorflow API\n--------------\n\nTo hookup the reader into a tensorflow graph, you can use the ``tf_tensors``\nfunction:\n\n.. code-block:: python\n\n    from petastorm.tf_utils import tf_tensors\n\n    with make_reader('file:\u002F\u002F\u002Fsome\u002Flocalpath\u002Fa_dataset') as reader:\n       row_tensors = tf_tensors(reader)\n       with tf.Session() as session:\n           for _ in range(3):\n               print(session.run(row_tensors))\n\nAlternatively, you can use new ``tf.data.Dataset`` API;\n\n.. code-block:: python\n\n    from petastorm.tf_utils import make_petastorm_dataset\n\n    with make_reader('file:\u002F\u002F\u002Fsome\u002Flocalpath\u002Fa_dataset') as reader:\n        dataset = make_petastorm_dataset(reader)\n        iterator = dataset.make_one_shot_iterator()\n        tensor = iterator.get_next()\n        with tf.Session() as sess:\n            sample = sess.run(tensor)\n            print(sample.id)\n\nPytorch API\n-----------\n\nAs illustrated in\n`pytorch_example.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fmnist\u002Fpytorch_example.py>`_,\nreading a petastorm dataset from pytorch\ncan be done via the adapter class ``petastorm.pytorch.DataLoader``,\nwhich allows custom pytorch collating function and transforms to be supplied.\n\nBe sure you have ``torch`` and ``torchvision`` installed:\n\n.. code-block:: bash\n\n    pip install torchvision\n\nThe minimalist example below assumes the definition of a ``Net`` class and\n``train`` and ``test`` functions, included in ``pytorch_example``:\n\n.. code-block:: python\n\n    import torch\n    from petastorm.pytorch import DataLoader\n\n    torch.manual_seed(1)\n    device = torch.device('cpu')\n    model = Net().to(device)\n    optimizer = torch.optim.SGD(model.parameters(), lr=0.01, momentum=0.5)\n\n    def _transform_row(mnist_row):\n        transform = transforms.Compose([\n            transforms.ToTensor(),\n            transforms.Normalize((0.1307,), (0.3081,))\n        ])\n        return (transform(mnist_row['image']), mnist_row['digit'])\n\n\n    transform = TransformSpec(_transform_row, removed_fields=['idx'])\n\n    with DataLoader(make_reader('file:\u002F\u002F\u002Flocalpath\u002Fmnist\u002Ftrain', num_epochs=10,\n                                transform_spec=transform, seed=1, shuffle_rows=True), batch_size=64) as train_loader:\n        train(model, device, train_loader, 10, optimizer, 1)\n    with DataLoader(make_reader('file:\u002F\u002F\u002Flocalpath\u002Fmnist\u002Ftest', num_epochs=10,\n                                transform_spec=transform), batch_size=1000) as test_loader:\n        test(model, device, test_loader)\n\nIf you are working with very large batch sizes and do not need support for Decimal\u002Fstrings we provide a ``petastorm.pytorch.BatchedDataLoader`` that can buffer using Torch tensors (``cpu`` or ``cuda``) with a signficantly higher throughput.\n\nIf the size of your dataset can fit into system memory, you can use an in-memory version dataloader ``petastorm.pytorch.InMemBatchedDataLoader``. This dataloader only reades the dataset once, and caches data in memory to avoid additional I\u002FO for multiple epochs.\n\nSpark Dataset Converter API\n---------------------------\n\nSpark converter API simplifies the data conversion from Spark to TensorFlow or PyTorch.\nThe input Spark DataFrame is first materialized in the parquet format and then loaded as\na ``tf.data.Dataset`` or ``torch.utils.data.DataLoader``.\n\nThe minimalist example below assumes the definition of a compiled ``tf.keras`` model and a\nSpark DataFrame containing a feature column followed by a label column.\n\n.. code-block:: python\n\n    from petastorm.spark import SparkDatasetConverter, make_spark_converter\n    import tensorflow.compat.v1 as tf  # pylint: disable=import-error\n\n    # specify a cache dir first.\n    # the dir is used to save materialized spark dataframe files\n    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'hdfs:\u002F...')\n\n    df = ... # `df` is a spark dataframe\n\n    # create a converter from `df`\n    # it will materialize `df` to cache dir.\n    converter = make_spark_converter(df)\n\n    # make a tensorflow dataset from `converter`\n    with converter.make_tf_dataset() as dataset:\n        # the `dataset` is `tf.data.Dataset` object\n        # dataset transformation can be done if needed\n        dataset = dataset.map(...)\n        # we can train\u002Fevaluate model on the `dataset`\n        model.fit(dataset)\n        # when exiting the context, the reader of the dataset will be closed\n\n    # delete the cached files of the dataframe.\n    converter.delete()\n\nThe minimalist example below assumes the definition of a ``Net`` class and\n``train`` and ``test`` functions, included in\n`pytorch_example.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fmnist\u002Fpytorch_example.py>`_,\nand a Spark DataFrame containing a feature column followed by a label column.\n\n.. code-block:: python\n\n    from petastorm.spark import SparkDatasetConverter, make_spark_converter\n\n    # specify a cache dir first.\n    # the dir is used to save materialized spark dataframe files\n    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'hdfs:\u002F...')\n\n    df_train, df_test = ... # `df_train` and `df_test` are spark dataframes\n    model = Net()\n\n    # create a converter_train from `df_train`\n    # it will materialize `df_train` to cache dir. (the same for df_test)\n    converter_train = make_spark_converter(df_train)\n    converter_test = make_spark_converter(df_test)\n\n    # make a pytorch dataloader from `converter_train`\n    with converter_train.make_torch_dataloader() as dataloader_train:\n        # the `dataloader_train` is `torch.utils.data.DataLoader` object\n        # we can train model using the `dataloader_train`\n        train(model, dataloader_train, ...)\n        # when exiting the context, the reader of the dataset will be closed\n\n    # the same for `converter_test`\n    with converter_test.make_torch_dataloader() as dataloader_test:\n        test(model, dataloader_test, ...)\n\n    # delete the cached files of the dataframes.\n    converter_train.delete()\n    converter_test.delete()\n\n\nAnalyzing petastorm datasets using PySpark and SQL\n--------------------------------------------------\n\nA Petastorm dataset can be read into a Spark DataFrame using PySpark, where you can\nuse a wide range of Spark tools to analyze and manipulate the dataset.\n\n.. code-block:: python\n\n   # Create a dataframe object from a parquet file\n   dataframe = spark.read.parquet(dataset_url)\n\n   # Show a schema\n   dataframe.printSchema()\n\n   # Count all\n   dataframe.count()\n\n   # Show a single column\n   dataframe.select('id').show()\n\nSQL can be used to query a Petastorm dataset:\n\n.. code-block:: python\n\n   spark.sql(\n      'SELECT count(id) '\n      'from parquet.`file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset`').collect()\n\nYou can find a full code sample here: `pyspark_hello_world.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fhello_world\u002Fpetastorm_dataset\u002Fpyspark_hello_world.py>`_,\n\nNon Petastorm Parquet Stores\n----------------------------\nPetastorm can also be used to read data directly from Apache Parquet stores. To achieve that, use\n``make_batch_reader`` (and not ``make_reader``). The following table summarizes the differences\n``make_batch_reader`` and ``make_reader`` functions.\n\n\n==================================================================  =====================================================\n``make_reader``                                                     ``make_batch_reader``\n==================================================================  =====================================================\nOnly Petastorm datasets (created using materializes_dataset)        Any Parquet store (some native Parquet column types\n                                                                    are not supported yet.\n------------------------------------------------------------------  -----------------------------------------------------\nThe reader returns one record at a time.                            The reader returns batches of records. The size of the\n                                                                    batch is not fixed and defined by Parquet row-group\n                                                                    size.\n------------------------------------------------------------------  -----------------------------------------------------\nPredicates passed to ``make_reader`` are evaluated per single row.  Predicates passed to ``make_batch_reader`` are evaluated per batch.\n------------------------------------------------------------------  -----------------------------------------------------\nCan filter parquet file based on the ``filters`` argument.          Can filter parquet file based on the ``filters`` argument\n==================================================================  =====================================================\n\n\nTroubleshooting\n---------------\n\nSee the Troubleshooting_ page and please submit a ticket_ if you can't find an\nanswer.\n\n\nSee also\n--------\n\n1. Gruener, R., Cheng, O., and Litvin, Y. (2018) *Introducing Petastorm: Uber ATG's Data Access Library for Deep Learning*. URL: https:\u002F\u002Feng.uber.com\u002Fpetastorm\u002F\n2. QCon.ai 2019: `\"Petastorm: A Light-Weight Approach to Building ML Pipelines\" \u003Chttps:\u002F\u002Fwww.infoq.com\u002Fpresentations\u002Fpetastorm-ml-pipelines\u002F>`_.\n\n\n.. _Troubleshooting: docs\u002Ftroubleshoot.rst\n.. _ticket: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002Fnew\n.. _Development: docs\u002Fdevelopment.rst\n\nHow to Contribute\n=================\n\nWe prefer to receive contributions in the form of GitHub pull requests. Please send pull requests against the ``github.com\u002Fuber\u002Fpetastorm`` repository.\n\n- If you are looking for some ideas on what to contribute, check out `github issues \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues>`_ and comment on the issue.\n- If you have an idea for an improvement, or you'd like to report a bug but don't have time to fix it please a `create a github issue \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002Fnew>`_.\n\nTo contribute a patch:\n\n- Break your work into small, single-purpose patches if possible. It's much harder to merge in a large change with a lot of disjoint features.\n- Submit the patch as a GitHub pull request against the master branch. For a tutorial, see the GitHub guides on forking a repo and sending a pull request.\n- Include a detailed describtion of the proposed change in the pull request.\n- Make sure that your code passes the unit tests. You can find instructions how to run the unit tests `here \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Fdevelopment.rst>`_.\n- Add new unit tests for your code.\n\nThank you in advance for your contributions!\n\n\nSee the Development_ for development related information.\n\n\n.. inclusion-marker-end-do-not-remove\n   Place contents above here if they should also appear in read-the-docs.\n   Contents below are already part of the read-the-docs table of contents.\n\n","Petastorm\n=========\n\n.. image:: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Factions\u002Fworkflows\u002Funittest.yml\u002Fbadge.svg?branch=master\n   :target: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Factions\u002Fworkflows\u002Funittest.yml\n   :alt: 构建状态\n\n.. image:: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fuber\u002Fpetastorm\u002Fbranch\u002Fmaster\u002Fgraph\u002Fbadge.svg\n   :target: https:\u002F\u002Fcodecov.io\u002Fgh\u002Fuber\u002Fpetastorm\u002Fbranch\u002Fmaster\n   :alt: 代码覆盖率\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\n   :target: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\n   :alt: 许可证\n\n.. image:: https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fpetastorm.svg\n   :target: https:\u002F\u002Fpypi.org\u002Fproject\u002Fpetastorm\n   :alt: 最新版本\n\n.. inclusion-marker-start-do-not-remove\n\n.. contents::\n\n\nPetastorm 是 Uber ATG 开发的一款开源数据访问库。该库支持直接从 Apache Parquet 格式的数据集中进行单机或分布式深度学习模型训练与评估。Petastorm 支持流行的基于 Python 的机器学习框架，如 `Tensorflow \u003Chttp:\u002F\u002Fwww.tensorflow.org\u002F>`_、`PyTorch \u003Chttps:\u002F\u002Fpytorch.org\u002F>`_ 和 `PySpark \u003Chttp:\u002F\u002Fspark.apache.org\u002Fdocs\u002Flatest\u002Fapi\u002Fpython\u002Fpyspark.html>`_，同时也可在纯 Python 代码中使用。\n\n文档网站： `\u003Chttps:\u002F\u002Fpetastorm.readthedocs.io>`_\n\n\n\n安装\n------------\n\n.. code-block:: bash\n\n    pip install petastorm\n\n\n``petastorm`` 包定义了几项不会自动安装的额外依赖。这些附加组件包括：``tf``、``tf_gpu``、``torch``、``opencv``、``docs`` 和 ``test``。\n\n例如，要安装 GPU 版本的 TensorFlow 和 OpenCV，可以使用以下 pip 命令：\n\n.. code-block:: bash\n\n    pip install petastorm[opencv,tf_gpu]\n\n\n\n生成数据集\n--------------------\n\n使用 Petastorm 创建的数据集以 `Apache Parquet \u003Chttps:\u002F\u002Fparquet.apache.org\u002F>`_ 格式存储。除了 Parquet 的 schema 外，Petastorm 还存储更高层次的 schema 信息，将多维数组原生地集成到 Petastorm 数据集中。\n\nPetastorm 支持可扩展的数据编解码器。这些编解码器允许用户使用标准的数据压缩格式（如 jpeg、png），也可以实现自定义的编解码器。\n\n数据集的生成通过 PySpark 完成。PySpark 原生支持 Parquet 格式，因此可以在单机或 Spark 计算集群上轻松运行。以下是一个将一些随机数据写入表中的极简示例。\n\n\n.. code-block:: python\n\n   import numpy as np\n   from pyspark.sql import SparkSession\n   from pyspark.sql.types import IntegerType\n\n   from petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec\n   from petastorm.etl.dataset_metadata import materialize_dataset\n   from petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField\n\n   # Schema 定义了数据集的结构\n   HelloWorldSchema = Unischema('HelloWorldSchema', [\n       UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),\n       UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),\n       UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),\n   ])\n\n\n   def row_generator(x):\n       \"\"\"返回生成数据集中的单条记录。这里以返回一组随机值为例。\"\"\"\n       return {'id': x,\n               'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),\n               'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}\n\n\n   def generate_petastorm_dataset(output_url='file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset'):\n       rowgroup_size_mb = 256\n\n       spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()\n       sc = spark.sparkContext\n\n       # 包装数据集物化部分。它会负责设置 Spark 环境变量，并保存 Petastorm 特有的元数据。\n       rows_count = 10\n       with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):\n\n           rows_rdd = sc.parallelize(range(rows_count))\\\n               .map(row_generator)\\\n               .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))\n\n           spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \\\n               .coalesce(10) \\\n               .write \\\n               .mode('overwrite') \\\n               .parquet(output_url)\n\n\n- ``HelloWorldSchema`` 是一个 ``Unischema`` 对象的实例。\n  ``Unischema`` 能够将其字段类型渲染为不同框架特定的格式，例如：Spark 的 ``StructType``、TensorFlow 的 ``tf.DType`` 以及 NumPy 的 ``numpy.dtype``。\n- 要定义数据集字段，需要为每个 ``Unischema`` 字段指定类型、形状、编解码器实例，以及该字段是否可为空。\n- 我们使用 PySpark 来写入输出的 Parquet 文件。在这个例子中，我们是在本地机器上启动 PySpark（``.master('local[2]')``）。当然，对于更大规模的数据集生成，我们需要一个真正的计算集群。\n- 我们用 ``materialize_dataset`` 上下文管理器来包裹 Spark 数据集生成代码。该上下文管理器负责在开始时配置行组大小，并在结束时写入 Petastorm 特有的元数据。\n- 行生成代码应返回一个以字段名索引的 Python 字典。我们使用 ``row_generator`` 函数来实现这一点。\n- ``dict_to_spark_row`` 将字典转换为 ``pyspark.Row`` 对象，同时确保符合 ``HelloWorldSchema`` 的 schema 要求（会检查形状、类型和是否可空）。\n- 一旦有了 ``pyspark.DataFrame``，我们就将其写入 Parquet 存储中。Parquet 的 schema 会自动从 ``HelloWorldSchema`` 中推导出来。\n\n纯 Python API\n----------------\n``petastorm.reader.Reader`` 类是用户代码访问来自 Tensorflow 或 Pytorch 等机器学习框架数据的主要入口。Reader 具有多种功能，例如：\n\n- 选择性列读取\n- 多种并行策略：线程、进程、单线程（用于调试）\n- N-gram 读取支持\n- 行过滤（行谓词）\n- 打乱顺序\n- 用于多 GPU 训练的分区\n- 本地缓存\n\n使用 ``petastorm.reader.Reader`` 类读取数据集非常简单，可以通过 ``petastorm.make_reader`` 工厂方法创建：\n\n.. code-block:: python\n\n   from petastorm import make_reader\n\n    with make_reader('hdfs:\u002F\u002Fmyhadoop\u002Fsome_dataset') as reader:\n       for row in reader:\n           print(row)\n\n支持的 URL 协议包括 ``hdfs:\u002F\u002F...`` 和 ``file:\u002F\u002F...``。\n\n一旦 ``Reader`` 实例化后，就可以将其作为迭代器使用。\n\nTensorFlow API\n--------------\n\n要将 Reader 连接到 TensorFlow 图中，可以使用 ``tf_tensors`` 函数：\n\n.. code-block:: python\n\n    from petastorm.tf_utils import tf_tensors\n\n使用 `make_reader('file:\u002F\u002F\u002Fsome\u002Flocalpath\u002Fa_dataset')` 作为读取器时：\n    row_tensors = tf_tensors(reader)\n    使用 `tf.Session()` 作为会话时：\n        对于范围内的 3 次迭代：\n            打印(session.run(row_tensors))\n\n或者，您也可以使用新的 `tf.data.Dataset` API；\n\n.. code-block:: python\n\n    from petastorm.tf_utils import make_petastorm_dataset\n\n    使用 `make_reader('file:\u002F\u002F\u002Fsome\u002Flocalpath\u002Fa_dataset')` 作为读取器时：\n        dataset = make_petastorm_dataset(reader)\n        迭代器 = dataset.make_one_shot_iterator()\n        张量 = 迭代器.get_next()\n        使用 `tf.Session()` 作为会话时：\n            样本 = sess.run(tensor)\n            打印(sample.id)\n\nPyTorch API\n-----------\n\n如 `pytorch_example.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fmnist\u002Fpytorch_example.py>`_ 中所示，\n通过 PyTorch 读取 Petastorm 数据集可以通过适配器类 `petastorm.pytorch.DataLoader` 来完成，\n该类允许提供自定义的 PyTorch 聚合函数和转换。\n\n请确保已安装 `torch` 和 `torchvision`：\n\n.. code-block:: bash\n\n    pip install torchvision\n\n下面的极简示例假设已定义了 `Net` 类以及 `train` 和 `test` 函数，这些内容包含在 `pytorch_example` 中：\n\n.. code-block:: python\n\n    import torch\n    from petastorm.pytorch import DataLoader\n\n    torch.manual_seed(1)\n    设备 = torch.device('cpu')\n    模型 = Net().to(设备)\n    优化器 = torch.optim.SGD(模型参数, lr=0.01, momentum=0.5)\n\n    def _transform_row(mnist_row):\n        transform = transforms.Compose([\n            transforms.ToTensor(),\n            transforms.Normalize((0.1307,), (0.3081,))\n        ])\n        返回 (transform(mnist_row['image']), mnist_row['digit'])\n\n\n    transform = TransformSpec(_transform_row, removed_fields=['idx'])\n\n    使用 DataLoader(make_reader('file:\u002F\u002F\u002Flocalpath\u002Fmnist\u002Ftrain', num_epochs=10,\n                                transform_spec=transform, seed=1, shuffle_rows=True), batch_size=64) 作为训练数据加载器时：\n        train(model, 设备, 训练数据加载器, 10, 优化器, 1)\n    使用 DataLoader(make_reader('file:\u002F\u002F\u002Flocalpath\u002Fmnist\u002Ftest', num_epochs=10,\n                                transform_spec=transform), batch_size=1000) 作为测试数据加载器时：\n        test(model, 设备, 测试数据加载器)\n\n如果您正在处理非常大的批量大小，并且不需要支持 Decimal\u002F字符串类型，我们提供了 `petastorm.pytorch.BatchedDataLoader`，它可以使用 Torch 张量（CPU 或 CUDA）进行缓冲，从而显著提高吞吐量。\n\n如果您的数据集大小可以容纳在系统内存中，您可以使用内存中的数据加载器 `petastorm.pytorch.InMemBatchedDataLoader`。此数据加载器只需读取一次数据集，并将数据缓存在内存中，以避免在多个 epoch 中进行额外的 I\u002FO 操作。\n\nSpark 数据集转换器 API\n---------------------------\n\nSpark 转换器 API 简化了从 Spark 到 TensorFlow 或 PyTorch 的数据转换过程。\n输入的 Spark DataFrame 首先被物化为 Parquet 格式，然后作为 `tf.data.Dataset` 或 `torch.utils.data.DataLoader` 加载。\n\n下面的极简示例假设已定义了一个编译好的 `tf.keras` 模型，以及一个包含特征列和标签列的 Spark DataFrame。\n\n.. code-block:: python\n\n    from petastorm.spark import SparkDatasetConverter, make_spark_converter\n    import tensorflow.compat.v1 as tf  # pylint: disable=import-error\n\n    # 首先指定一个缓存目录。\n    # 该目录用于保存物化的 Spark 数据帧文件\n    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'hdfs:\u002F...')\n\n    df = ... # `df` 是一个 Spark 数据帧\n\n    # 从 `df` 创建一个转换器\n    # 它会将 `df` 物化到缓存目录。\n    converter = make_spark_converter(df)\n\n    # 从 `converter` 创建一个 TensorFlow 数据集\n    使用 converter.make_tf_dataset() 作为数据集时：\n        # `dataset` 是 `tf.data.Dataset` 对象\n        # 如果需要，可以对数据集进行转换\n        dataset = dataset.map(...)\n        # 我们可以在该数据集上训练\u002F评估模型\n        model.fit(dataset)\n        # 当退出上下文时，数据集的读取器会被关闭\n\n    # 删除数据帧的缓存文件。\n    converter.delete()\n\n下面的极简示例假设已定义了 `Net` 类以及 `train` 和 `test` 函数，这些内容包含在\n`pytorch_example.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fmnist\u002Fpytorch_example.py>`_ 中，\n以及一个包含特征列和标签列的 Spark DataFrame。\n\n.. code-block:: python\n\n    from petastorm.spark import SparkDatasetConverter, make_spark_converter\n\n    # 首先指定一个缓存目录。\n    # 该目录用于保存物化的 Spark 数据帧文件\n    spark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'hdfs:\u002F...')\n\n    df_train, df_test = ... # `df_train` 和 `df_test` 是 Spark 数据帧\n    模型 = Net()\n\n    # 从 `df_train` 创建一个转换器_train\n    # 它会将 `df_train` 物化到缓存目录。（`df_test` 同理）\n    converter_train = make_spark_converter(df_train)\n    converter_test = make_spark_converter(df_test）\n\n    # 从 `converter_train` 创建一个 PyTorch 数据加载器\n    使用 converter_train.make_torch_dataloader() 作为训练数据加载器时：\n        # `dataloader_train` 是 `torch.utils.data.DataLoader` 对象\n        # 我们可以使用该数据加载器训练模型\n        train(model, 数据加载器, ...)\n        # 当退出上下文时，数据集的读取器会被关闭\n\n    # 对于 `converter_test` 也执行相同的操作\n    使用 converter_test.make_torch_dataloader() 作为测试数据加载器时：\n        test(model, 数据加载器, ...)\n    \n    # 删除数据帧的缓存文件。\n    converter_train.delete()\n    converter_test.delete()\n\n\n使用 PySpark 和 SQL 分析 Petastorm 数据集\n----------------------------------------------\n\nPetastorm 数据集可以使用 PySpark 读入 Spark DataFrame，在那里您可以使用各种 Spark 工具来分析和操作数据集。\n\n.. code-block:: python\n\n   # 从 Parquet 文件创建一个数据帧对象\n   dataframe = spark.read.parquet(dataset_url)\n\n   # 显示模式\n   dataframe.printSchema()\n\n   # 统计总数\n   dataframe.count()\n\n   # 显示单列\n   dataframe.select('id').show()\n\nSQL 可以用来查询 Petastorm 数据集：\n\n.. code-block:: python\n\n   spark.sql(\n      'SELECT count(id) '\n      'from parquet.`file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset`').collect()\n\n完整的代码示例可在以下位置找到：`pyspark_hello_world.py \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fexamples\u002Fhello_world\u002Fpetastorm_dataset\u002Fpyspark_hello_world.py>`_，\n\n非 Petastorm Parquet 存储\n----------------------------\nPetastorm 也可以直接从 Apache Parquet 存储中读取数据。为此，请使用 `make_batch_reader`（而不是 `make_reader`）。下表总结了 `make_batch_reader` 和 `make_reader` 函数之间的区别。\n\n==================================================================  =====================================================\n``make_reader``                                                     ``make_batch_reader``\n==================================================================  =====================================================\n仅限 Petastorm 数据集（使用 materializes_dataset 创建）        任何 Parquet 存储（某些原生 Parquet 列类型\n                                                                    尚不支持。\n------------------------------------------------------------------  -----------------------------------------------------\n读取器每次返回一条记录。                            读取器返回一批记录。批次大小不固定，由 Parquet 行组\n                                                                    大小决定。\n------------------------------------------------------------------  -----------------------------------------------------\n传递给 ``make_reader`` 的谓词按单行评估。          传递给 ``make_batch_reader`` 的谓词按批次评估。\n------------------------------------------------------------------  -----------------------------------------------------\n可根据 ``filters`` 参数过滤 Parquet 文件。          可根据 ``filters`` 参数过滤 Parquet 文件。\n==================================================================  =====================================================\n\n\n故障排除\n---------------\n\n请参阅 故障排除_ 页面；如果您找不到答案，请提交工单_。\n\n\n另请参阅\n--------\n\n1. Gruener, R., Cheng, O., 和 Litvin, Y. (2018) *介绍 Petastorm：Uber ATG 的深度学习数据访问库*。网址：https:\u002F\u002Feng.uber.com\u002Fpetastorm\u002F\n2. QCon.ai 2019：“Petastorm：构建机器学习流水线的轻量级方法” \u003Chttps:\u002F\u002Fwww.infoq.com\u002Fpresentations\u002Fpetastorm-ml-pipelines\u002F>_。\n\n\n.. _Troubleshooting: docs\u002Ftroubleshoot.rst\n.. _ticket: https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002Fnew\n.. _Development: docs\u002Fdevelopment.rst\n\n如何贡献\n=================\n\n我们更倾向于以 GitHub 拉取请求的形式接收贡献。请将拉取请求发送到 ``github.com\u002Fuber\u002Fpetastorm`` 仓库。\n\n- 如果您正在寻找可以贡献的内容，请查看 `GitHub 问题 \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues>`_ 并在相关问题下留言。\n- 如果您有改进建议，或想报告一个错误但没有时间修复它，请 `创建一个 GitHub 问题 \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002Fnew>`_。\n\n要贡献补丁：\n\n- 尽可能将您的工作拆分为小型、单一功能的补丁。合并包含大量不相关功能的大规模更改会困难得多。\n- 将补丁作为针对主分支的 GitHub 拉取请求提交。有关教程，请参阅 GitHub 关于分叉仓库和发送拉取请求的指南。\n- 在拉取请求中包含对拟议更改的详细描述。\n- 确保您的代码通过单元测试。您可以在此处找到运行单元测试的说明：`这里 \u003Chttps:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Fdevelopment.rst>`_。\n- 为您的代码添加新的单元测试。\n\n提前感谢您的贡献！\n\n\n有关开发相关信息，请参阅 开发_。\n\n\n.. inclusion-marker-end-do-not-remove\n   Place contents above here if they should also appear in read-the-docs.\n   Contents below are already part of the read-the-docs table of contents.","# Petastorm 快速上手指南\n\nPetastorm 是 Uber ATG 开发的开源数据访问库，支持直接从 Apache Parquet 格式的数据集中进行单机或分布式的深度学习模型训练与评估。它完美适配 TensorFlow、PyTorch 和 PySpark 等主流框架。\n\n## 环境准备\n\n*   **操作系统**: Linux, macOS, Windows\n*   **Python 版本**: 3.6+\n*   **核心依赖**:\n    *   Apache Spark (PySpark): 用于生成数据集（可本地运行或集群运行）。\n    *   深度学习框架：根据需求选择安装 TensorFlow 或 PyTorch。\n*   **存储支持**: 支持本地文件系统 (`file:\u002F\u002F`) 和 HDFS (`hdfs:\u002F\u002F`)。\n\n> **国内加速建议**：\n> 在安装 Python 依赖时，推荐使用清华或阿里镜像源以提升下载速度：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`\n\n## 安装步骤\n\n### 1. 基础安装\n仅安装核心功能：\n```bash\npip install petastorm\n```\n\n### 2. 按需安装额外依赖\nPetastorm 将特定框架的依赖作为“extras”提供，需手动指定。\n\n*   **安装 PyTorch 支持**:\n    ```bash\n    pip install petastorm[torch]\n    ```\n*   **安装 TensorFlow GPU 版及 OpenCV 支持**:\n    ```bash\n    pip install petastorm[opencv,tf_gpu]\n    ```\n*   **其他可选组件**: `tf` (CPU 版 TF), `docs`, `test` 等。\n\n## 基本使用\n\nPetastorm 的工作流分为两步：**生成数据集**（使用 PySpark）和 **读取数据**（用于训练）。\n\n### 第一步：生成 Parquet 数据集\n\n使用 PySpark 将数据转换为 Petastorm 格式的 Parquet 文件。以下示例生成包含随机图像和数组的数据集：\n\n```python\nimport numpy as np\nfrom pyspark.sql import SparkSession\nfrom pyspark.sql.types import IntegerType\n\nfrom petastorm.codecs import ScalarCodec, CompressedImageCodec, NdarrayCodec\nfrom petastorm.etl.dataset_metadata import materialize_dataset\nfrom petastorm.unischema import dict_to_spark_row, Unischema, UnischemaField\n\n# 1. 定义数据架构 (Unischema)\nHelloWorldSchema = Unischema('HelloWorldSchema', [\n    UnischemaField('id', np.int32, (), ScalarCodec(IntegerType()), False),\n    UnischemaField('image1', np.uint8, (128, 256, 3), CompressedImageCodec('png'), False),\n    UnischemaField('array_4d', np.uint8, (None, 128, 30, None), NdarrayCodec(), False),\n])\n\ndef row_generator(x):\n    \"\"\"生成单条数据记录\"\"\"\n    return {'id': x,\n            'image1': np.random.randint(0, 255, dtype=np.uint8, size=(128, 256, 3)),\n            'array_4d': np.random.randint(0, 255, dtype=np.uint8, size=(4, 128, 30, 3))}\n\ndef generate_petastorm_dataset(output_url='file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset'):\n    rowgroup_size_mb = 256\n\n    # 初始化 Spark (本地模式)\n    spark = SparkSession.builder.config('spark.driver.memory', '2g').master('local[2]').getOrCreate()\n    sc = spark.sparkContext\n\n    rows_count = 10\n    \n    # 使用 materialize_dataset 上下文管理器处理元数据\n    with materialize_dataset(spark, output_url, HelloWorldSchema, rowgroup_size_mb):\n        rows_rdd = sc.parallelize(range(rows_count))\\\n            .map(row_generator)\\\n            .map(lambda x: dict_to_spark_row(HelloWorldSchema, x))\n\n        spark.createDataFrame(rows_rdd, HelloWorldSchema.as_spark_schema()) \\\n            .coalesce(10) \\\n            .write \\\n            .mode('overwrite') \\\n            .parquet(output_url)\n\n# 执行生成\nif __name__ == '__main__':\n    generate_petastorm_dataset()\n```\n\n### 第二步：读取数据\n\n#### 方式 A：纯 Python \u002F 通用迭代器\n最基础的读取方式，适用于调试或自定义数据处理逻辑。\n\n```python\nfrom petastorm import make_reader\n\n# 支持 file:\u002F\u002F 和 hdfs:\u002F\u002F 协议\nwith make_reader('file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset') as reader:\n    for row in reader:\n        print(row.id, row.image1.shape)\n```\n\n#### 方式 B：PyTorch 集成\n直接使用 `DataLoader` 对接 PyTorch 训练循环，支持数据变换（Transform）和洗牌（Shuffle）。\n\n```python\nimport torch\nfrom petastorm import make_reader\nfrom petastorm.pytorch import DataLoader\nfrom torchvision import transforms\n\n# 定义数据变换逻辑\ndef _transform_row(row):\n    transform = transforms.Compose([\n        transforms.ToTensor(),\n        transforms.Normalize((0.5,), (0.5,))\n    ])\n    # 返回 (输入数据，标签)\n    return (transform(row['image1']), row['id'])\n\nfrom petastorm.transform import TransformSpec\ntransform_spec = TransformSpec(_transform_row, removed_fields=['array_4d'])\n\n# 创建 DataLoader\nwith DataLoader(make_reader('file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset', \n                            num_epochs=1, \n                            shuffle_rows=True,\n                            transform_spec=transform_spec), \n                batch_size=4) as train_loader:\n    \n    for data, target in train_loader:\n        # 此处接入你的 PyTorch 训练代码\n        print(f\"Batch shape: {data.shape}\")\n```\n\n#### 方式 C：TensorFlow 集成\n支持 `tf.data.Dataset` API，可无缝嵌入 TensorFlow 计算图。\n\n```python\nfrom petastorm import make_reader\nfrom petastorm.tf_utils import make_petastorm_dataset\nimport tensorflow as tf\n\nwith make_reader('file:\u002F\u002F\u002Ftmp\u002Fhello_world_dataset') as reader:\n    # 转换为 tf.data.Dataset\n    dataset = make_petastorm_dataset(reader)\n    \n    # 应用常见的 TF 数据预处理\n    dataset = dataset.map(lambda x: (x.image1, x.id)).batch(4)\n    \n    iterator = iter(dataset)\n    sample = next(iterator)\n    print(sample)\n```\n\n#### 方式 D：Spark DataFrame 直接转换\n如果你已经有了 Spark DataFrame，可以使用转换器直接生成训练数据，无需手动编写 Parquet 生成代码。\n\n```python\nfrom petastorm.spark import SparkDatasetConverter, make_spark_converter\nimport tensorflow.compat.v1 as tf\n\n# 设置缓存目录 (必须配置)\nspark.conf.set(SparkDatasetConverter.PARENT_CACHE_DIR_URL_CONF, 'file:\u002F\u002F\u002Ftmp\u002Fpetastorm_cache')\n\n# 假设 df 是一个已有的 Spark DataFrame\n# converter = make_spark_converter(df) \n\n# 生成 TF 数据集\n# with converter.make_tf_dataset() as dataset:\n#     model.fit(dataset)\n    \n# 使用完毕后清理缓存\n# converter.delete()\n```","某自动驾驶团队需要利用存储在数据湖中的海量行车记录（包含高分辨率图像和传感器数组）训练深度学习感知模型。\n\n### 没有 petastorm 时\n- **数据加载瓶颈**：工程师需编写复杂的自定义代码将 Parquet 文件转换为 TFRecord 或大量小图片文件，预处理耗时数天且容易出错。\n- **内存溢出风险**：直接读取多维数组数据时，常因无法高效流式加载导致单机内存爆满，被迫缩小批量大小（Batch Size）。\n- **框架适配困难**：在 PyTorch 和 TensorFlow 之间切换实验时，必须重写整套数据输入管道（Input Pipeline），维护成本极高。\n- **分布式训练复杂**：在 Spark 集群上进行分布式训练时，缺乏原生支持，需手动协调数据分片与 Worker 节点的对齐逻辑。\n\n### 使用 petastorm 后\n- **直通式训练**：petastorm 支持直接从 Apache Parquet 格式读取数据，无需任何中间格式转换，数据准备时间从几天缩短至几小时。\n- **高效流式读取**：内置的多维数组编解码器实现了按需解压和流式传输，轻松处理 GB 级图像张量，彻底解决内存溢出问题。\n- **框架无缝切换**：提供统一的 Python 接口同时兼容 TensorFlow、PyTorch 和 PySpark，同一份数据集配置可直接用于不同框架的实验。\n- **原生分布式支持**：完美集成 Spark 生态，自动处理数据分片与并行加载，让单机调试到集群大规模训练的扩展变得平滑自然。\n\npetastorm 通过打通大数据存储与深度学习训练之间的“最后一公里”，让算法团队能专注于模型优化而非数据工程。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fuber_petastorm_fbe9d848.png","uber","Uber Open Source","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fuber_3a532a8c.png","Open Source at Uber",null,"http:\u002F\u002Fwww.uber.com","https:\u002F\u002Fgithub.com\u002Fuber",[81,85,89,93],{"name":82,"color":83,"percentage":84},"Python","#3572A5",99.6,{"name":86,"color":87,"percentage":88},"Dockerfile","#384d54",0.3,{"name":90,"color":91,"percentage":92},"Makefile","#427819",0.1,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0,1884,285,"2026-04-14T14:32:07","Apache-2.0","未说明","非必需。支持通过安装额外依赖 'tf_gpu' 使用 GPU 版本的 TensorFlow，但未指定具体显卡型号、显存大小或 CUDA 版本要求。","未说明（示例代码中建议 Spark 驱动内存配置为 2g，实际需求取决于数据集规模）",{"notes":105,"python":101,"dependencies":106},"该工具主要用于从 Apache Parquet 格式数据集中读取数据以进行深度学习训练。核心依赖 PySpark 生成数据集。GPU 支持是可选的，需通过 pip 安装特定额外依赖（如 petastorm[tf_gpu]）。支持本地文件系统 (file:\u002F\u002F) 和 HDFS (hdfs:\u002F\u002F) 存储后端。",[107,108,109,110,111,112],"pyspark","pyarrow","numpy","tensorflow (可选，通过 tf 或 tf_gpu  extras)","pytorch (可选，通过 torch extra)","opencv-python (可选，通过 opencv extra)",[14],[115,116,117,118,119,107,108,120,121],"tensorflow","pytorch","deep-learning","machine-learning","sysml","parquet","parquet-files","2026-03-27T02:49:30.150509","2026-04-15T08:09:34.746749",[125,130,135,140,145,149],{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},34028,"如何读取由 AWS Athena 生成的 Parquet 文件时遇到 'requested metadata for row group' 错误？","这是一个已确认的 Bug。如果代码中使用了 `dataset.piece[0]` 这样的索引访问，尝试将其修改为直接使用 `piece` 对象通常可以解决问题。维护者已在后续版本中修复了此问题，建议升级到最新版本。","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002F447",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},34029,"运行 HelloWorld 示例时遇到 'Could not find _common_metadata file' 错误怎么办？","该错误通常是因为缺少 Petastorm 特有的元数据文件。解决方法有两种：1. 在生成数据集的 ETL 代码中使用 `materialize_dataset(..)` 上下文管理器（位于 `petastorm.etl.dataset_metadata.py`）来自动生成该文件；2. 对于已存在的数据集，可以使用命令行工具 `petastorm-generate-metadata.py` 来补全元数据。此外，确保在 Conda 环境中正确安装了 `libhdfs3` 也可能解决相关依赖问题。","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002F475",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},34030,"Petastorm 是否支持将数据集划分为训练集和测试集（Train-Test Split）？","Petastorm 本身没有直接的拆分函数，但可以通过在 Schema 中定义一个分区字段（如 'splitName'），并在读取时使用谓词过滤（Predicate Filtering）来实现。具体步骤：1. 在 `UnischemaField` 中使用 `np.unicode_` 类型定义分区字段；2. 创建自定义谓词类继承自 `PredicateBase`，在 `do_include` 方法中判断字段值（注意 Python 3 中可能需要对字节串进行 `.decode()` 处理）；3. 将该谓词传递给 `make_reader` 以仅加载特定分区（如 'train' 或 'test'）的数据。","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002F391",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},34031,"在使用 PyArrow 进行谓词过滤时，如何处理 `filesystem` 等参数冲突的问题？","Petastorm 计划利用 PyArrow ParquetDataset 原生的谓词过滤功能。如果用户在 `pa_dataset_args` 中提供了 `filesystem`、`validate_schema` 或 `metadata_nthreads` 等参数，系统会发出警告提示这些参数将被覆盖。为了保持 API 兼容性，目前仍保留这些参数，但在底层实现上会优先使用 PyArrow 的原生能力。","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002F20",{"id":146,"question_zh":147,"answer_zh":148,"source_url":139},34032,"如何在定义 Unischema 字段时正确处理字符串类型以避免 Unicode 错误？","建议在定义 `UnischemaField` 时使用 `np.unicode_` 作为 numpy 数据类型，这样可以透明地兼容 Python 2 和 Python 3。避免使用 `np.string_`，因为它可能导致编码问题。示例代码：`UnischemaField('partition_key', np.unicode_, (), ScalarCodec(StringType()), False)`。如果在读取时发现需要解码（decode），这通常是 Python 版本差异导致的，使用 `np.unicode_` 可从根本上避免此类麻烦。",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},34033,"如何优化 ProcessWorker 中的 ZeroMQ 性能？","可以通过禁用 ZeroMQ 的复制缓冲区（copy buffers）来优化 `ProcessWorker` 的性能。虽然这一选项最初未暴露在顶层工厂方法（如 `make_reader` 和 `make_batch_reader`）中，但社区已确认其有效性并推动将其暴露给用户。对于高级用户，可以通过配置相关参数或直接修改源码来启用此优化，从而减少内存拷贝开销，提升数据传输效率。","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fissues\u002F595",[155,160,165,170,175,179,183,188,192,196,200,205,209,214,218,223,227,232,236,241],{"id":156,"version":157,"summary_zh":158,"released_at":159},263937,"v0.13.1","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0131\n","2026-01-02T23:20:39",{"id":161,"version":162,"summary_zh":163,"released_at":164},263938,"v0.13.0rc0","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0130\n","2025-08-10T07:30:07",{"id":166,"version":167,"summary_zh":168,"released_at":169},263939,"v0.12.2rc0","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0122\n","2023-02-03T00:30:46",{"id":171,"version":172,"summary_zh":173,"released_at":174},263940,"v0.12.1","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0121\n","2022-12-16T20:51:38",{"id":176,"version":177,"summary_zh":173,"released_at":178},263941,"v0.12.1rc1","2022-12-16T06:17:07",{"id":180,"version":181,"summary_zh":173,"released_at":182},263942,"v0.12.1rc0","2022-12-14T05:20:45",{"id":184,"version":185,"summary_zh":186,"released_at":187},263943,"v0.12.0","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0120\n","2022-08-25T00:20:43",{"id":189,"version":190,"summary_zh":186,"released_at":191},263944,"v0.12.0rc3","2022-08-24T20:23:23",{"id":193,"version":194,"summary_zh":186,"released_at":195},263945,"v0.12.0rc2","2022-08-24T20:18:35",{"id":197,"version":198,"summary_zh":186,"released_at":199},263946,"v0.12.0rc1","2022-08-24T00:06:28",{"id":201,"version":202,"summary_zh":203,"released_at":204},263947,"v0.11.5","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0115\n","2022-07-28T07:16:45",{"id":206,"version":207,"summary_zh":203,"released_at":208},263948,"v0.11.5rc0","2022-07-27T20:29:39",{"id":210,"version":211,"summary_zh":212,"released_at":213},263949,"v0.11.4","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0114\n","2022-02-19T02:46:44",{"id":215,"version":216,"summary_zh":212,"released_at":217},263950,"v0.11.4rc0","2022-02-15T17:28:02",{"id":219,"version":220,"summary_zh":221,"released_at":222},263951,"v0.11.3","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0113\n","2021-09-04T05:46:37",{"id":224,"version":225,"summary_zh":221,"released_at":226},263952,"v0.11.3rc1","2021-09-03T16:29:24",{"id":228,"version":229,"summary_zh":230,"released_at":231},263953,"v0.11.2","https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fblob\u002Fmaster\u002Fdocs\u002Frelease-notes.rst#release-0112\n","2021-08-03T05:30:19",{"id":233,"version":234,"summary_zh":230,"released_at":235},263954,"v0.11.2rc1","2021-07-30T14:07:27",{"id":237,"version":238,"summary_zh":239,"released_at":240},263955,"v0.11.1","PR 687 (resolves issue #684 ): Fix a failure when reading data from a parquet file (and not a parquet directory).\r\nPR 686 (resolves issue #685 ): Silenty omit fields that have unsupported types. Previously were failing loudly making parquet stores with such fields unusable with Petastorm.","2021-06-24T15:50:55",{"id":242,"version":243,"summary_zh":244,"released_at":245},263956,"v0.9.3","Thanks to our new contributors: Travis Addair and Ryan (rb-determined-ai).\r\n\r\n- Retire support for Python 2.\r\n- [PR 568](https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fpull\u002F568): Added additional kwargs for Spark Dataset Converter.\r\n- [PR 564](https:\u002F\u002Fgithub.com\u002Fuber\u002Fpetastorm\u002Fpull\u002F564): Expose filters (PyArrow filters) argument in make_reader and make_batch_reader\r\n\r\n","2020-07-24T21:22:17"]