[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-basveeling--pcam":3,"tool-basveeling--pcam":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",147882,2,"2026-04-09T11:32:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":75,"owner_twitter":72,"owner_website":75,"owner_url":78,"languages":79,"stars":84,"forks":85,"last_commit_at":86,"license":87,"difficulty_score":32,"env_os":88,"env_gpu":89,"env_ram":90,"env_deps":91,"category_tags":97,"github_topics":99,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":105,"updated_at":106,"faqs":107,"releases":138},5911,"basveeling\u002Fpcam","pcam","The PatchCamelyon (PCam) deep learning classification benchmark.","pcam 是一个专为深度学习设计的图像分类基准数据集，旨在推动机器学习在医疗影像领域的应用。它包含超过 32 万张从淋巴结组织病理扫描中提取的彩色小图（96x96 像素），每张图像都标注了是否含有转移性肿瘤组织的二元标签。\n\n传统机器学习进展多依赖 MNIST 或 CIFAR 等自然图像数据集，而 pcam 填补了医疗领域缺乏标准化、易上手基准数据的空白。它将临床上复杂的转移灶检测任务，简化为类似 CIFAR-10 的直接图像分类问题，既保留了临床相关性，又降低了研究门槛。研究人员可以在单块 GPU 上仅用数小时完成模型训练，并在肿瘤检测任务中获得具有竞争力的结果。\n\npcam 特别适合人工智能研究人员、算法开发者以及关注医疗影像分析的学者使用。其独特的价值在于平衡了任务难度与可执行性：规模大于 CIFAR-10 但小于 ImageNet，非常适合作为主动学习、模型不确定性分析及可解释性研究等前沿方向的实验平台。通过提供标准化的数据划分和便捷的下载方式，pcam 帮助社区更高效地验证新算法，促进有益于医疗诊断的技术发展。","# PatchCamelyon (PCam)\n_That which is measured, improves._ - Karl Pearson\n\nThe PatchCamelyon benchmark is a new and challenging image classification dataset. It consists of 327.680 color images (96 x 96px) extracted from histopathologic scans of lymph node sections. Each image is annoted with a binary label indicating presence of metastatic tissue. PCam provides a new benchmark for machine learning models: bigger than CIFAR10, smaller than imagenet, trainable on a single GPU.\n\n![PCam example images. Green boxes indicate positive labels.](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbasveeling_pcam_readme_2cdf5ae357e0.jpg)\n*Example images from PCam. Green boxes indicate tumor tissue in center region, which dictates a positive label.*\n\n\u003Cdetails>\u003Csummary>Table of Contents\u003C\u002Fsummary>\u003Cp>\n\n* [Why PCam](#why-pcam)\n* [Download](#download)\n* [Details](#details)\n* [Usage and Tips](#usage-and-tips)\n* [Benchmark](#benchmark)\n* [Visualization](#visualization)\n* [Contributing](#contributing)\n* [Contact](#contact)\n* [Citing PCam](#citing-pcam)\n* [License](#license)\n\u003C\u002Fp>\u003C\u002Fdetails>\u003Cp>\u003C\u002Fp>\n\n## Why PCam\nFundamental machine learning advancements are predominantly evaluated on straight-forward natural-image classification datasets. Think MNIST, CIFAR, SVHN. Medical imaging is becoming one of the major applications of ML and we believe it deserves a spot on the list of _go-to_ ML datasets. Both to challenge future work, and to steer developments into directions that are beneficial for this domain.\n\nWe think PCam can play a role in this. It packs the clinically-relevant task of metastasis detection into a straight-forward binary image classification task, akin to CIFAR-10 and MNIST. Models can easily be trained on a single GPU in a couple hours, and achieve competitive scores in the Camelyon16 tasks of tumor detection and WSI diagnosis. Furthermore, the balance between task-difficulty and tractability makes it a prime suspect for fundamental machine learning research on topics as active learning, model uncertainty and explainability.\n\n\n## Download\nThe data is stored in gzipped HDF5 files and can be downloaded using the following links. Each set consist of a data and target file. An additional meta csv file is provided which describes from which Camelyon16 slide the patches were extracted from, but this information is not used in training for or evaluating the benchmark. Please report any downloading problems via a github issue.\n\nDownload all at once from [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1gHou49cA1s5vua2V5L98Lt8TiWA3FrKB?usp=sharing).\n\n| Name  | Content | Size | Link | MD5 Checksum|\n| --- | --- |--- | --- |--- |\n| `camelyonpatch_level_2_split_train_x.h5.gz` | training images | 6.1 GB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2)|`1571f514728f59376b705fc836ff4b63`|\n| `camelyonpatch_level_2_split_train_y.h5.gz` | training labels | 21 KB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1269yhu3pZDP8UYFQs-NYs3FPwuK-nGSG)|`35c2d7259d906cfc8143347bb8e05be7`|\n| `camelyonpatch_level_2_split_valid_x.h5.gz` | valid images | 0.8 GB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1hgshYGWK8V-eGRy8LToWJJgDU_rXWVJ3)|`d8c2d60d490dbd479f8199bdfa0cf6ec`|\n| `camelyonpatch_level_2_split_valid_y.h5.gz` | valid labels | 3.0 KB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1bH8ZRbhSVAhScTS0p9-ZzGnX91cHT3uO)|`60a7035772fbdb7f34eb86d4420cf66a`|\n| `camelyonpatch_level_2_split_test_x.h5.gz`  | test images  | 0.8 GB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1qV65ZqZvWzuIVthK8eVDhIwrbnsJdbg_)|`d5b63470df7cfa627aeec8b9dc0c066e`|\n| `camelyonpatch_level_2_split_test_y.h5.gz`  | test labels  | 3.0 KB | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=17BHrSrwWKjYsOgTMmoqrIjDy6Fa2o_gP)|`2b85f58b927af9964a4c15b8f7e8f179`|\n| `camelyonpatch_level_2_split_train_meta.csv` | training meta |  | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1XoaGG3ek26YLFvGzmkKeOz54INW0fruR)|`5a3dd671e465cfd74b5b822125e65b0a`|\n| `camelyonpatch_level_2_split_valid_meta.csv` | valid meta | | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=16hJfGFCZEcvR3lr38v3XCaD5iH1Bnclg)|`3455fd69135b66734e1008f3af684566`|\n| `camelyonpatch_level_2_split_test_meta.csv`  | test meta |  | [Download](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=19tj7fBlQQrd4DapCjhZrom_fA4QlHqN4)|`67589e00a4a37ec317f2d1932c7502ca`|\n\n#### Mirror Zenodo:\nhttps:\u002F\u002Fzenodo.org\u002Frecord\u002F2546921\n\n#### Baidu AI Studio:\nhttps:\u002F\u002Faistudio.baidu.com\u002Faistudio\u002Fdatasetdetail\u002F30060\n\n## Usage and Tips\n### Keras Example\n[General dataloader for keras](https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fblob\u002Fmaster\u002Fkeras_pcam\u002Fdataset\u002Fpcam.py)\n\n```python\nfrom keras.utils import HDF5Matrix\nfrom keras.preprocessing.image import ImageDataGenerator\n\nx_train = HDF5Matrix('camelyonpatch_level_2_split_train_x.h5', 'x')\ny_train = HDF5Matrix('camelyonpatch_level_2_split_train_y.h5', 'y')\n\ndatagen = ImageDataGenerator(\n              preprocessing_function=lambda x: x\u002F255.,\n              width_shift_range=4,  # randomly shift images horizontally\n              height_shift_range=4,  # randomly shift images vertically \n              horizontal_flip=True,  # randomly flip images\n              vertical_flip=True)  # randomly flip images\n              \nmodel.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),\n                    steps_per_epoch=len(x_train) \u002F\u002F batch_size\n                    epochs=1024,\n                    )\n```\n\n## Details\n### Numbers\nThe dataset is divided into a training set of 262.144 (2^18) examples, and a validation and test set both of 32.768 (2^15) examples. There is no overlap in WSIs between the splits, and all splits have a 50\u002F50 balance between positive and negative examples.\n\n### Labeling\nA positive label indicates that the center 32x32px region of a patch contains at least one pixel of tumor tissue. Tumor tissue in the outer region of the patch does not influence the label. This outer region is provided to enable the design of fully-convolutional models that do not use any zero-padding, to ensure consistent behavior when applied to a whole-slide image. This is however not a requirement for the PCam benchmark.\n\n### Patch selection \nPCam is derived from the Camelyon16 Challenge [2], which contains 400 H\\&E stained WSIs of sentinel lymph node sections. The slides were acquired and digitized at 2 different centers  using a 40x objective (resultant pixel resolution of 0.243 microns). We undersample this at 10x to increase the field of view.\nWe follow the train\u002Ftest split from the Camelyon16 challenge [2], and further hold-out 20% of the train WSIs for the validation set. To prevent selecting background patches, slides are converted to HSV, blurred, and patches filtered out if maximum pixel saturation lies below 0.07 (which was validated to not throw out tumor data in the training set).\nThe patch-based dataset is sampled by iteratively choosing a WSI and selecting a positive or negative patch with probability _p_. Patches are rejected following a stochastic hard-negative mining scheme with a small CNN, and _p_ is adjusted to retain a balance close to 50\u002F50.\n\n### Statistics\n_Coming soon_\n\n## Contact\nFor problems and questions not fit for a github issue, please email [Bas Veeling](mailto:basveeling+pcam@gmail.com).\n## Citing PCam\nIf you use PCam in a scientific publication, we would appreciate references to the following paper:\n\n\n**[1] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. \"Rotation Equivariant CNNs for Digital Pathology\". [arXiv:1806.03962](http:\u002F\u002Farxiv.org\u002Fabs\u002F1806.03962)**\n\nA citation of the original Camelyon16 dataset paper is appreciated as well:\n\n**[2] Ehteshami Bejnordi et al. Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. JAMA: The Journal of the American Medical Association, 318(22), 2199–2210. [doi:jama.2017.14585](https:\u002F\u002Fdoi.org\u002F10.1001\u002Fjama.2017.14585)**\n\n\nBiblatex entry:\n```bibtex\n@ARTICLE{Veeling2018-qh,\n  title         = \"Rotation Equivariant {CNNs} for Digital Pathology\",\n  author        = \"Veeling, Bastiaan S and Linmans, Jasper and Winkens, Jim and\n                   Cohen, Taco and Welling, Max\",\n  month         =  jun,\n  year          =  2018,\n  archivePrefix = \"arXiv\",\n  primaryClass  = \"cs.CV\",\n  eprint        = \"1806.03962\"\n}\n```\n\n\u003C!-- [Who is citing PCam?](https:\u002F\u002Fscholar.google.de\u002Fscholar?hl=en&as_sdt=0%2C5&q=pcam&btnG=&oq=fas) -->\n\n\n## Benchmark\n| Name  | Reference | Augmentations | Acc | AUC|  NLL | FROC* |\n| --- | --- | --- | --- | --- | --- | --- |\n| GDensenet | [1] | Following Liu et al. | 89.8 | 96.3 |  0.260 |75.8 (64.3, 87.2)|\n| [Add yours](https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fedit\u002Fmaster\u002FREADME.md) | |\n\n\\* Performance on Camelyon16 tumor detection task, not part of the PCam benchmark.\n\n\n## Contributing\nContributions with example scripts for other frameworks are welcome!\n\n## License\nThe data is provided under the [CC0 License](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fcc0-1.0\u002F), following the license of Camelyon16.\n\nThe rest of this repository is under the [MIT License](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fmit\u002F).\n\n## Acknowledgements\n* Babak Ehteshami Bejnordi, Geert Litjens, Jeroen van der Laak for their input on the configuration of this dataset.\n* README derived from [Fashion-MNIST](https:\u002F\u002Fgithub.com\u002Fzalandoresearch\u002Ffashion-mnist).\n","# PatchCamelyon (PCam)\n_可测量的事物，才能改进。_ - 卡尔·皮尔逊\n\nPatchCamelyon 基准是一个全新且极具挑战性的图像分类数据集。它由 327,680 张彩色图像（96×96 像素）组成，这些图像取自淋巴结切片的组织病理学扫描。每张图像都附有一个二元标签，用于指示是否存在转移性组织。PCam 为机器学习模型提供了一个新的基准：规模大于 CIFAR-10，小于 ImageNet，可在单块 GPU 上进行训练。\n\n![PCam 示例图像。绿色框表示阳性标签。](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbasveeling_pcam_readme_2cdf5ae357e0.jpg)\n*PCam 中的示例图像。绿色框标示出中心区域的肿瘤组织，该区域决定了图像的阳性标签。*\n\n\u003Cdetails>\u003Csummary>目录\u003C\u002Fsummary>\u003Cp>\n\n* [为何选择 PCam](#why-pcam)\n* [下载](#download)\n* [详细信息](#details)\n* [使用与技巧](#usage-and-tips)\n* [基准测试](#benchmark)\n* [可视化](#visualization)\n* [贡献](#contributing)\n* [联系方式](#contact)\n* [引用 PCam](#citing-pcam)\n* [许可证](#license)\n\u003C\u002Fp>\u003C\u002Fdetails>\u003Cp>\u003C\u002Fp>\n\n## 为何选择 PCam\n机器学习领域的基础性进展通常是在一些较为简单的自然图像分类数据集上进行评估的，比如 MNIST、CIFAR 和 SVHN。然而，医学影像正逐渐成为机器学习的主要应用领域之一，我们认为它理应被列入“首选”机器学习数据集之列。这不仅能够推动未来研究的创新，还能引导相关技术朝着对该领域有益的方向发展。\n\n我们相信，PCam 可以在这一过程中发挥重要作用。它将具有临床意义的转移灶检测任务简化为一个直接的二分类图像任务，类似于 CIFAR-10 和 MNIST。模型只需在单块 GPU 上运行数小时即可完成训练，并在 Camelyon16 的肿瘤检测和全切片图像诊断任务中取得具有竞争力的成绩。此外，任务难度与可操作性之间的平衡使其成为主动学习、模型不确定性及可解释性等基础机器学习研究的理想对象。\n\n## 下载\n数据以 gzip 压缩的 HDF5 文件形式存储，可通过以下链接下载。每个数据集包含数据文件和标签文件。此外，还提供了一个元数据 CSV 文件，其中描述了补丁是从 Camelyon16 的哪张切片中提取的，但此信息并未用于基准的训练或评估。如遇下载问题，请通过 GitHub 问题提交反馈。\n\n可从 [Google Drive](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1gHou49cA1s5vua2V5L98Lt8TiWA3FrKB?usp=sharing) 一次性下载所有文件。\n\n| 名称  | 内容 | 大小 | 链接 | MD5 校验码 |\n| --- | --- |--- | --- |--- |\n| `camelyonpatch_level_2_split_train_x.h5.gz` | 训练图像 | 6.1 GB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1Ka0XfEMiwgCYPdTI-vv6eUElOBnKFKQ2)|`1571f514728f59376b705fc836ff4b63`|\n| `camelyonpatch_level_2_split_train_y.h5.gz` | 训练标签 | 21 KB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1269yhu3pZDP8UYFQs-NYs3FPwuK-nGSG)|`35c2d7259d906cfc8143347bb8e05be7`|\n| `camelyonpatch_level_2_split_valid_x.h5.gz` | 验证图像 | 0.8 GB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1hgshYGWK8V-eGRy8LToWJJgDU_rXWVJ3)|`d8c2d60d490dbd479f8199bdfa0cf6ec`|\n| `camelyonpatch_level_2_split_valid_y.h5.gz` | 验证标签 | 3.0 KB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1bH8ZRbhSVAhScTS0p9-ZzGnX91cHT3uO)|`60a7035772fbdb7f34eb86d4420cf66a`|\n| `camelyonpatch_level_2_split_test_x.h5.gz`  | 测试图像  | 0.8 GB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1qV65ZqZvWzuIVthK8eVDhIwrbnsJdbg_)|`d5b63470df7cfa627aeec8b9dc0c066e`|\n| `camelyonpatch_level_2_split_test_y.h5.gz`  | 测试标签  | 3.0 KB | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=17BHrSrwWKjYsOgTMmoqrIjDy6Fa2o_gP)|`2b85f58b927af9964a4c15b8f7e8f179`|\n| `camelyonpatch_level_2_split_train_meta.csv` | 训练元数据 |  | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=1XoaGG3ek26YLFvGzmkKeOz54INW0fruR)|`5a3dd671e465cfd74b5b822125e65b0a`|\n| `camelyonpatch_level_2_split_valid_meta.csv` | 验证元数据 | | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=16hJfGFCZEcvR3lr38v3XCaD5iH1Bnclg)|`3455fd69135b66734e1008f3af684566`|\n| `camelyonpatch_level_2_split_test_meta.csv`  | 测试元数据 |  | [下载](https:\u002F\u002Fdrive.google.com\u002Fuc?export=download&id=19tj7fBlQQrd4DapCjhZrom_fA4QlHqN4)|`67589e00a4a37ec317f2d1932c7502ca`|\n\n#### Zenodo 镜像：\nhttps:\u002F\u002Fzenodo.org\u002Frecord\u002F2546921\n\n#### 百度 AI Studio：\nhttps:\u002F\u002Faistudio.baidu.com\u002Faistudio\u002Fdatasetdetail\u002F30060\n\n## 使用与技巧\n### Keras 示例\n[Keras 的通用数据加载器](https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fblob\u002Fmaster\u002Fkeras_pcam\u002Fdataset\u002Fpcam.py)\n\n```python\nfrom keras.utils import HDF5Matrix\nfrom keras.preprocessing.image import ImageDataGenerator\n\nx_train = HDF5Matrix('camelyonpatch_level_2_split_train_x.h5', 'x')\ny_train = HDF5Matrix('camelyonpatch_level_2_split_train_y.h5', 'y')\n\ndatagen = ImageDataGenerator(\n              preprocessing_function=lambda x: x\u002F255.,\n              width_shift_range=4,  \u002F\u002F 随机水平平移图像\n              height_shift_range=4,  \u002F\u002F 随机垂直平移图像\n              horizontal_flip=True,  \u002F\u002F 随机翻转图像\n              vertical_flip=True)  \u002F\u002F 随机翻转图像\n              \nmodel.fit_generator(datagen.flow(x_train, y_train, batch_size=batch_size),\n                    steps_per_epoch=len(x_train) \u002F\u002F batch_size\n                    epochs=1024,\n                    )\n```\n\n## 详细信息\n### 数据量\n该数据集分为训练集、验证集和测试集三部分：训练集包含 262,144 个样本（2^18），验证集和测试集各包含 32,768 个样本（2^15）。各划分之间不存在全切片图像的重叠，且所有划分中的阳性与阴性样本比例均为 50\u002F50。\n\n### 标注规则\n阳性标签表示补丁中心区域的 32×32 像素内至少包含一个肿瘤细胞像素。补丁外围区域的肿瘤组织不会影响标签判定。提供外围区域是为了便于设计无需零填充的全卷积模型，从而确保其在应用于整张切片图像时行为一致。不过，这并非 PCam 基准的强制要求。\n\n### 片段选择\nPCam 数据集源自 Camelyon16 挑战赛 [2]，该数据集包含 400 张前哨淋巴结切片的 H&E 染色全切片图像（WSI）。这些载玻片由两家不同的中心使用 40 倍物镜采集并数字化，最终像素分辨率为 0.243 微米。我们将其下采样至 10 倍放大率，以扩大视野范围。\n我们沿用了 Camelyon16 挑战赛 [2] 的训练\u002F测试划分方式，并进一步从训练 WSI 中抽出 20% 作为验证集。为避免选取背景片段，我们将载玻片转换为 HSV 颜色空间并进行模糊处理，若最大像素饱和度低于 0.07，则过滤掉该片段（经验证此阈值不会误剔除训练集中的肿瘤数据）。\n基于片段的数据集通过迭代选择一张 WSI，并以概率 _p_ 选取阳性或阴性片段来采样。随后，利用一个小型卷积神经网络按照随机硬负样本挖掘策略拒绝部分片段，并调整 _p_ 值，以维持接近 50\u002F50 的正负样本比例。\n\n### 统计信息\n*即将发布*\n\n## 联系方式\n如遇不适合在 GitHub 上提交的问题或疑问，请发送邮件至 [Bas Veeling](mailto:basveeling+pcam@gmail.com)。\n## 引用 PCam\n若您在科研出版物中使用了 PCam 数据集，我们非常感谢您引用以下论文：\n\n\n**[1] B. S. Veeling, J. Linmans, J. Winkens, T. Cohen, M. Welling. “用于数字病理学的旋转等变 CNN”。[arXiv:1806.03962](http:\u002F\u002Farxiv.org\u002Fabs\u002F1806.03962)**\n\n同时，也欢迎引用原始的 Camelyon16 数据集论文：\n\n**[2] Ehteshami Bejnordi 等人. 深度学习算法在检测乳腺癌女性患者淋巴结转移中的诊断评估. JAMA：美国医学会杂志，318(22), 2199–2210. [doi:jama.2017.14585](https:\u002F\u002Fdoi.org\u002F10.1001\u002Fjama.2017.14585)**\n\n\nBiblatex 条目：\n```bibtex\n@ARTICLE{Veeling2018-qh,\n  title         = \"Rotation Equivariant {CNNs} for Digital Pathology\",\n  author        = \"Veeling, Bastiaan S and Linmans, Jasper and Winkens, Jim and\n                   Cohen, Taco and Welling, Max\",\n  month         =  jun,\n  year          =  2018,\n  archivePrefix = \"arXiv\",\n  primaryClass  = \"cs.CV\",\n  eprint        = \"1806.03962\"\n}\n```\n\n\u003C!-- [谁在引用 PCam？](https:\u002F\u002Fscholar.google.de\u002Fscholar?hl=en&as_sdt=0%2C5&q=pcam&btnG=&oq=fas) -->\n\n\n## 基准测试\n| 名称  | 参考文献 | 数据增强 | 准确率 | AUC | NLL | FROC* |\n| --- | --- | --- | --- | --- | --- | --- |\n| GDensenet | [1] | 按照 Liu 等人的方法 | 89.8 | 96.3 |  0.260 |75.8 (64.3, 87.2)|\n| [添加您的结果](https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fedit\u002Fmaster\u002FREADME.md) | |\n\n\\* 在 Camelyon16 肿瘤检测任务上的表现，并非 PCam 基准测试的一部分。\n\n\n## 贡献\n欢迎提供其他框架的示例脚本！\n\n## 许可证\n数据依据 Camelyon16 的许可协议，采用 [CC0 许可证](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fcc0-1.0\u002F) 提供。\n本仓库其余内容则采用 [MIT 许可证](https:\u002F\u002Fchoosealicense.com\u002Flicenses\u002Fmit\u002F)。\n\n## 致谢\n* Babak Ehteshami Bejnordi、Geert Litjens 和 Jeroen van der Laak 对本数据集配置提供的宝贵建议。\n* README 文档参考自 [Fashion-MNIST](https:\u002F\u002Fgithub.com\u002Fzalandoresearch\u002Ffashion-mnist)。","# PCam 快速上手指南\n\nPatchCamelyon (PCam) 是一个具有挑战性的医学图像分类基准数据集，包含从淋巴结组织病理学扫描中提取的 327,680 张彩色图像（96x96 像素）。该数据集旨在为机器学习模型提供一个介于 CIFAR-10 和 ImageNet 之间的基准，支持在单张 GPU 上进行训练，适用于转移性组织检测任务。\n\n## 环境准备\n\n*   **系统要求**：Linux, macOS 或 Windows（推荐 Linux 环境）\n*   **硬件要求**：支持 CUDA 的 GPU（推荐显存 ≥ 4GB），单卡即可训练\n*   **前置依赖**：\n    *   Python 3.6+\n    *   TensorFlow \u002F Keras\n    *   `h5py` (用于读取 HDF5 格式数据)\n    *   `pandas` (可选，用于处理元数据)\n\n安装基础依赖：\n```bash\npip install tensorflow h5py pandas\n```\n\n## 安装与数据下载\n\nPCam 数据集以压缩的 HDF5 (`.h5.gz`) 格式存储。由于文件较大，国内开发者推荐使用 **百度 AI Studio** 镜像源进行下载，速度更快且更稳定。\n\n### 方案一：百度 AI Studio 镜像（推荐国内用户）\n访问以下链接获取数据集：\n*   地址：https:\u002F\u002Faistudio.baidu.com\u002Faistudio\u002Fdatasetdetail\u002F30060\n\n### 方案二：官方 Google Drive 或 Zenodo\n如果无法访问上述镜像，可从以下来源下载：\n*   **Zenodo 镜像**: https:\u002F\u002Fzenodo.org\u002Frecord\u002F2546921\n*   **Google Drive**: [查看完整列表](https:\u002F\u002Fdrive.google.com\u002Fdrive\u002Ffolders\u002F1gHou49cA1s5vua2V5L98Lt8TiWA3FrKB?usp=sharing)\n\n**所需核心文件**（下载后需解压 `.gz` 文件）：\n| 文件名 | 内容 | 大小 |\n| :--- | :--- | :--- |\n| `camelyonpatch_level_2_split_train_x.h5` | 训练集图像 | ~6.1 GB |\n| `camelyonpatch_level_2_split_train_y.h5` | 训练集标签 | ~21 KB |\n| `camelyonpatch_level_2_split_valid_x.h5` | 验证集图像 | ~0.8 GB |\n| `camelyonpatch_level_2_split_valid_y.h5` | 验证集标签 | ~3.0 KB |\n\n> **注意**：请确保将 `.h5.gz` 文件解压为 `.h5` 文件后再使用。\n\n## 基本使用\n\n以下是最简单的基于 Keras 的使用示例，展示如何加载数据并进行基础的数据增强训练。\n\n### 1. 数据加载与预处理\nPCam 数据存储在 HDF5 文件中，可以使用 `keras.utils.HDF5Matrix` 直接读取，无需一次性将所有数据加载到内存中。\n\n```python\nfrom keras.utils import HDF5Matrix\nfrom keras.preprocessing.image import ImageDataGenerator\n\n# 加载训练数据 (确保文件路径正确)\nx_train = HDF5Matrix('camelyonpatch_level_2_split_train_x.h5', 'x')\ny_train = HDF5Matrix('camelyonpatch_level_2_split_train_y.h5', 'y')\n\n# 配置数据增强生成器\n# PCam 图像为 96x96，这里设置了平移和翻转增强\ndatagen = ImageDataGenerator(\n              preprocessing_function=lambda x: x\u002F255., # 归一化到 [0, 1]\n              width_shift_range=4,                     # 水平随机平移\n              height_shift_range=4,                    # 垂直随机平移\n              horizontal_flip=True,                    # 水平随机翻转\n              vertical_flip=True)                      # 垂直随机翻转\n```\n\n### 2. 模型训练示例\n假设你已经定义好了一个 Keras 模型（例如简单的 CNN 或 DenseNet），可以使用 `fit_generator` 开始训练。\n\n```python\n# 假设 model 已经定义并编译完成\n# batch_size 根据显存大小调整，例如 64 或 128\nbatch_size = 64\n\nmodel.fit_generator(\n    datagen.flow(x_train, y_train, batch_size=batch_size),\n    steps_per_epoch=len(x_train) \u002F\u002F batch_size,\n    epochs=10,  # 建议训练更多轮次以达到基准性能\n    validation_data=(\n        HDF5Matrix('camelyonpatch_level_2_split_valid_x.h5', 'x'),\n        HDF5Matrix('camelyonpatch_level_2_split_valid_y.h5', 'y')\n    )\n)\n```\n\n### 关键说明\n*   **标签含义**：标签为二分类（0 或 1）。`1` 表示图像中心 32x32 像素区域内包含至少一个肿瘤组织像素；`0` 表示无肿瘤。\n*   **数据平衡**：训练集、验证集和测试集均保持 50\u002F50 的正负样本平衡。\n*   **输入形状**：图像尺寸为 `(96, 96, 3)`。","某医疗 AI 初创团队正在研发淋巴结转移癌自动筛查算法，急需一个既能反映真实病理特征，又能在单张显卡上快速验证模型原型的基准数据集。\n\n### 没有 pcam 时\n- **数据门槛过高**：直接使用完整的 Camelyon16 全切片图像（WSI）需要巨大的显存和复杂的预处理流程，新手难以在几小时内跑通第一个模型。\n- **评估标准缺失**：缺乏介于 CIFAR-10 简单自然图像与 ImageNet 超大规模数据之间的医学专用基准，导致无法公平对比不同架构在病理场景下的真实性能。\n- **研发迭代缓慢**：训练周期长达数天，严重阻碍了针对“主动学习”或“模型不确定性”等前沿方向的快速实验与调优。\n- **临床相关性弱**：使用非医学通用数据集训练的模型，往往难以迁移到真实的肿瘤检测任务中，造成学术指标与临床应用脱节。\n\n### 使用 pcam 后\n- **开箱即用的高效训练**：利用 pcam 提供的 32 万张 96x96 标准化病理补丁，团队可在单张 GPU 上仅需数小时即可完成模型训练并达到竞争性分数。\n- **确立行业对标基准**：作为专为机器学习设计的分类基准，pcam 让团队能立即将自研模型与全球最新成果进行量化对比，明确技术差距。\n- **加速前沿探索**：适中的任务难度与数据规模，使得研究人员能快速验证关于可解释性和不确定性的新算法，大幅缩短从想法到验证的周期。\n- **无缝衔接临床任务**：由于数据直接源自真实的淋巴结组织切片且标注精准，基于 pcam 优化的模型能更平滑地迁移至实际的癌症辅助诊断系统中。\n\npcam 成功填补了通用图像分类与复杂医疗影像之间的空白，让高精度的癌症检测模型研发变得像训练 CIFAR-10 一样高效且可及。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fbasveeling_pcam_909c1741.png","basveeling","Bas Veeling","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fbasveeling_6d25897c.jpg",null,"AI4Science @ Microsoft Research","Amsterdam, The Netherlands","https:\u002F\u002Fgithub.com\u002Fbasveeling",[80],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,519,108,"2026-04-02T07:55:14","NOASSERTION","未说明","非必需，但推荐用于训练（README 提及可在单张 GPU 上训练），具体型号、显存及 CUDA 版本未说明","未说明（数据集文件较大，训练集约 6.1GB，建议具备足够内存以加载 HDF5 数据）",{"notes":92,"python":88,"dependencies":93},"该工具主要是一个数据集基准（PatchCamelyon），而非独立的软件包。数据以压缩的 HDF5 格式存储，下载后需解压。官方示例代码基于 Keras 框架。训练集图像数据约为 6.1GB，验证集和测试集各约 0.8GB。任务为二分类图像分类，图像尺寸为 96x96 像素。",[94,95,96],"keras","h5py (隐含，用于读取 HDF5Matrix)","numpy (隐含)",[16,14,98],"其他",[100,101,102,103,104],"dataset","deep-learning","benchmark","pathology","deep-learning-datasets","2026-03-27T02:49:30.150509","2026-04-09T23:49:01.860064",[108,113,118,123,128,133],{"id":109,"question_zh":110,"answer_zh":111,"source_url":112},26813,"数据集下载链接失效或服务器无法访问怎么办？","如果原始服务器无法访问，可以使用以下两个镜像源进行下载：\n1. Academic Torrents: http:\u002F\u002Facademictorrents.com\u002Fdetails\u002F1561a180b11d4b746273b5ce46772ad36f1229b6\n2. Zenodo: https:\u002F\u002Fzenodo.org\u002Frecord\u002F1494286\nZenodo 提供了足够的数据限额和直接链接支持。","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F3",{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},26814,"在中国地区下载速度太慢或无法连接 Google Drive\u002FZenodo 怎么办？","针对中国开发者，数据集已重新分发到百度 AI  studio，可以通过以下链接高速下载：\nhttps:\u002F\u002Faistudio.baidu.com\u002Faistudio\u002Fdatasetdetail\u002F30060","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F9",{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},26815,"如何验证下载文件的完整性（MD5 校验和）？","正确的文件 MD5 校验和如下（注意不要混淆文件名）：\n- camelyonpatch_level_2_split_train_x.h5.gz: 1571f514728f59376b705fc836ff4b63\n- camelyonpatch_level_2_split_train_y.h5.gz: 35c2d7259d906cfc8143347bb8e05be7\n- camelyonpatch_level_2_split_valid_x.h5.gz: d5b63470df7cfa627aeec8b9dc0c066e\n- camelyonpatch_level_2_split_valid_y.h5.gz: 2b85f58b927af9964a4c15b8f7e8f179\n- camelyonpatch_level_2_split_test_x.h5.gz: d8c2d60d490dbd479f8199bdfa0cf6ec\n- camelyonpatch_level_2_split_test_y.h5.gz: 60a7035772fbdb7f34eb86d4420cf66a\n如果文档中的顺序有误，请以实际运行 md5sum 命令的结果为准。","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F1",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},26816,"Academic Torrents 下载方式是否可靠？如果被防火墙阻止怎么办？","Academic Torrents 支持 Webseed 功能。即使 BitTorrent 流量被防火墙阻止，大多数 torrent 客户端也会自动切换到从 Web 服务器下载数据。上传者可以动态管理这些 Web seed 位置作为备用托管方案，因此这是一种去中心化且相对可靠的下载方式。","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F2",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},26817,"运行代码时出现 'Unable to open object (object \\'y\\' doesn\\'t exist)' 错误是什么原因？","该错误通常意味着 HDF5 文件中缺少预期的数据集对象 'y'。在 PCam 数据集中，'x' 代表图像数据补丁，'y' 代表对应的标签数据。请确保您下载的文件完整且未损坏，并检查代码中加载 HDF5 文件的路径是否正确指向了包含 'x' 和 'y' 键的文件（通常分为 _x.h5.gz 和 _y.h5.gz 文件）。","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F7",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},26818,"为什么无法使用 wget 命令从 Google Drive 下载数据？","Google Drive 对直接使用 wget 等命令行工具下载大文件有限制（通常需要确认病毒扫描）。建议改用提供的其他直接下载链接，如 Zenodo (https:\u002F\u002Fzenodo.org\u002Frecord\u002F1494286) 或 Academic Torrents，这些源支持直接的命令行下载且更稳定。","https:\u002F\u002Fgithub.com\u002Fbasveeling\u002Fpcam\u002Fissues\u002F14",[]]