[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-opendatalab--DocLayout-YOLO":3,"similar-opendatalab--DocLayout-YOLO":87},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":19,"owner_twitter":18,"owner_website":20,"owner_url":21,"languages":22,"stars":35,"forks":36,"last_commit_at":37,"license":38,"difficulty_score":39,"env_os":40,"env_gpu":41,"env_ram":42,"env_deps":43,"category_tags":50,"github_topics":18,"view_count":39,"oss_zip_url":18,"oss_zip_packed_at":18,"status":52,"created_at":53,"updated_at":54,"faqs":55,"releases":86},1073,"opendatalab\u002FDocLayout-YOLO","DocLayout-YOLO","DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception","DocLayout-YOLO是一款专注于文档布局分析的AI工具，基于YOLO-v10框架开发，通过生成多样化的合成数据和全局到局部的自适应感知机制，提升对复杂文档结构的识别能力。它解决了传统方法在处理多类型文档时精度不足、泛化能力弱的问题，尤其擅长识别表格、段落、标题等元素的位置与结构。工具提供预训练模型和在线演示，支持快速部署，适用于需要自动化处理文档结构的场景。其核心创新包括：利用Mesh-candidate BestFit算法生成高质量合成数据集DocSynth-300K，通过结构优化模块实现多尺度元素精准检测，以及结合全局视角与局部细节的双重感知机制。开发者、研究人员及文档处理相关从业者均可使用，尤其适合需要高效分析PDF、扫描件等非结构化文档的场景。工具已集成至PDF-Extract-Kit，并提供详细文档与示例，便于快速上手。","\u003Cdiv align=\"center\">\n\nEnglish | [简体中文](.\u002FREADME-zh_CN.md)\n\n\n\u003Ch1>DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception\u003C\u002Fh1>\n\nOfficial PyTorch implementation of [DocLayout-YOLO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628).\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.14458-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628) [![Online Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Online%20Demo-yellow)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Models%20and%20Data-yellow)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjuliozhao\u002Fdoclayout-yolo-670cdec674913d9a6f77b542)\n\n\u003C\u002Fdiv>\n    \n## Abstract\n\n> We present DocLayout-YOLO, a real-time and robust layout detection model for diverse documents, based on YOLO-v10. This model is enriched with diversified document pre-training and structural optimization tailored for layout detection. In the pre-training phase, we introduce Mesh-candidate BestFit, viewing document synthesis as a two-dimensional bin packing problem, and create a large-scale diverse synthetic document dataset, DocSynth-300K. In terms of model structural optimization, we propose a module with Global-to-Local Controllability for precise detection of document elements across varying scales. \n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_b9e11e58728c.png\" width=52%>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_b2f4ad42cb5d.png\" width=44%> \u003Cbr>\n\u003C\u002Fp>\n\n## News 🚀🚀🚀\n\n**2024.10.25** 🎉🎉  **Mesh-candidate Bestfit** code is released. Mesh-candidate Bestfit is an automatic pipeline which can synthesize large-scale, high-quality, and visually appealing document layout detection dataset. Tutorial and example data are available in [here](.\u002Fmesh-candidate_bestfit).\n\n**2024.10.23** 🎉🎉  **DocSynth300K dataset** is released on [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002FDocSynth300K), DocSynth300K is a large-scale and diverse document layout analysis pre-training dataset, which can largely boost model performance.\n\n**2024.10.21** 🎉🎉  **Online demo** available on [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO).\n\n**2024.10.18** 🎉🎉  DocLayout-YOLO is implemented in **[PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit)** for document context extraction.\n\n**2024.10.16** 🎉🎉  **Paper** now available on [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628).   \n\n\n## Quick Start\n\n[Online Demo](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) is now available. For local development, follow steps below:\n\n### 1. Environment Setup\n\nFollow these steps to set up your environment:\n\n```bash\nconda create -n doclayout_yolo python=3.10\nconda activate doclayout_yolo\npip install -e .\n```\n\n**Note:** If you only need the package for inference, you can simply install it via pip:\n\n```bash\npip install doclayout-yolo\n```\n\n### 2. Prediction\n\nYou can make predictions using either a script or the SDK:\n\n- **Script**\n\n  Run the following command to make a prediction using the script:\n\n  ```bash\n  python demo.py --model path\u002Fto\u002Fmodel --image-path path\u002Fto\u002Fimage\n  ```\n\n- **SDK**\n\n  Here is an example of how to use the SDK for prediction:\n\n  ```python\n  import cv2\n  from doclayout_yolo import YOLOv10\n\n  # Load the pre-trained model\n  model = YOLOv10(\"path\u002Fto\u002Fprovided\u002Fmodel\")\n\n  # Perform prediction\n  det_res = model.predict(\n      \"path\u002Fto\u002Fimage\",   # Image to predict\n      imgsz=1024,        # Prediction image size\n      conf=0.2,          # Confidence threshold\n      device=\"cuda:0\"    # Device to use (e.g., 'cuda:0' or 'cpu')\n  )\n\n  # Annotate and save the result\n  annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)\n  cv2.imwrite(\"result.jpg\", annotated_frame)\n  ```\n\n\nWe provide model fine-tuned on **DocStructBench** for prediction, **which is capable of handing various document types**. Model can be downloaded from [here](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocStructBench\u002Ftree\u002Fmain) and example images can be found under ```assets\u002Fexample```.\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_576c96560777.png\" width=100%> \u003Cbr>\n\u003C\u002Fp>\n\n\n**Note:** For PDF content extraction, please refer to [PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit\u002Ftree\u002Fmain) and [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU).\n\n**Note:** Thanks to [NielsRogge](https:\u002F\u002Fgithub.com\u002FNielsRogge), DocLayout-YOLO now supports implementation directly from 🤗Huggingface, you can load model as follows:\n\n```python\nfilepath = hf_hub_download(repo_id=\"juliozhao\u002FDocLayout-YOLO-DocStructBench\", filename=\"doclayout_yolo_docstructbench_imgsz1024.pt\")\nmodel = YOLOv10(filepath)\n```\n\nor directly load using ```from_pretrained```:\n\n```python\nmodel = YOLOv10.from_pretrained(\"juliozhao\u002FDocLayout-YOLO-DocStructBench\")\n```\n\nmore details can be found at [this PR](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fpull\u002F6).\n\n**Note:** Thanks to [luciaganlulu](https:\u002F\u002Fgithub.com\u002Fluciaganlulu), DocLayout-YOLO can perform batch inference and prediction. Instead of passing single image into ```model.predict``` in ```demo.py```, pass a **list of image path**. Besides, due to batch inference is not implemented before ```YOLOv11```, you should manually change ```batch_size``` in [here](doclayout_yolo\u002Fengine\u002Fmodel.py#L431).\n\n## DocSynth300K Dataset\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_5f05a11ed5d5.png\" width=100%>\n\u003C\u002Fp>\n\n### Data Download\n\nUse following command to download dataset(about 113G):\n\n```python\nfrom huggingface_hub import snapshot_download\n# Download DocSynth300K\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\")\n# If the download was disrupted and the file is not complete, you can resume the download\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\", resume_download=True)\n```\n\n### Data Formatting & Pre-training\n\nIf you want to perform DocSynth300K pretraining, using ```format_docsynth300k.py``` to convert original ```.parquet``` format into ```YOLO``` format. The converted data will be stored at ```.\u002Flayout_data\u002Fdocsynth300k```.\n\n```bash\npython format_docsynth300k.py\n```\n\nTo perform DocSynth300K pre-training, use this [command](assets\u002Fscript.sh#L2). We default use 8GPUs to perform pretraining. To reach optimal performance, you can adjust hyper-parameters such as ```imgsz```, ```lr``` according to your downstream fine-tuning data distribution or setting.\n\n**Note:** Due to memory leakage in YOLO original data loading code, the pretraining on large-scale dataset may be interrupted unexpectedly, use ```--pretrain last_checkpoint.pt --resume``` to resume the pretraining process.\n\n## Training and Evaluation on Public DLA Datasets\n\n### Data Preparation\n\n1. specify  the data root path\n\nFind your ultralytics config file (for Linux user in ```$HOME\u002F.config\u002FUltralytics\u002Fsettings.yaml)``` and change ```datasets_dir``` to project root path.\n\n2. Download prepared yolo-format D4LA and DocLayNet data from below and put to ```.\u002Flayout_data```:\n\n| Dataset | Download |\n|:--:|:--:|\n| D4LA | [link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-D4LA) |\n| DocLayNet | [link](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-DocLayNet) |\n\nthe file structure is as follows:\n\n```bash\n.\u002Flayout_data\n├── D4LA\n│   ├── images\n│   ├── labels\n│   ├── test.txt\n│   └── train.txt\n└── doclaynet\n    ├── images\n    ├── labels\n    ├── val.txt\n    └── train.txt\n```\n\n### Training and Evaluation\n\nTraining is conducted on 8 GPUs with a global batch size of 64 (8 images per device). The detailed settings and checkpoints are as follows:\n\n| Dataset | Model | DocSynth300K Pretrained? | imgsz | Learning rate | Finetune | Evaluation | AP50 | mAP | Checkpoint |\n|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n| D4LA | DocLayout-YOLO | &cross; | 1600 | 0.04 | [command](assets\u002Fscript.sh#L5) | [command](assets\u002Fscript.sh#L11) | 81.7 | 69.8 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-from_scratch) |\n| D4LA | DocLayout-YOLO | &check; | 1600 | 0.04 | [command](assets\u002Fscript.sh#L8) | [command](assets\u002Fscript.sh#L11) | 82.4 | 70.3 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-Docsynth300K_pretrained) |\n| DocLayNet | DocLayout-YOLO | &cross; | 1120 | 0.02 | [command](assets\u002Fscript.sh#L14) | [command](assets\u002Fscript.sh#L20) | 93.0 | 77.7 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-from_scratch) |\n| DocLayNet | DocLayout-YOLO | &check; | 1120 | 0.02 | [command](assets\u002Fscript.sh#L17) | [command](assets\u002Fscript.sh#L20) | 93.4 | 79.7 | [checkpoint](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-Docsynth300K_pretrained) |\n\nThe DocSynth300K pretrained model can be downloaded from [here](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocSynth300K-pretrain). Change ```checkpoint.pt``` to the path of model to be evaluated during evaluation.\n\n\n## Acknowledgement\n\nThe code base is built with [ultralytics](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics) and [YOLO-v10](https:\u002F\u002Fgithub.com\u002Flyuwenyu\u002FRT-DETR).\n\nThanks for their great work!\n\n## Star History\n\nIf you find our project useful, please add a \"star\" to the repo. It's exciting to us when we see your interest, which keep us motivated to continue investing in the project!\n\n\u003Cpicture>\n  \u003Csource\n    media=\"(prefers-color-scheme: dark)\"\n    srcset=\"\n      https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png&theme=dark\n    \"\n  \u002F>\n  \u003Csource\n    media=\"(prefers-color-scheme: light)\"\n    srcset=\"\n      https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png\n    \"\n  \u002F>\n  \u003Cimg\n    alt=\"Star History Chart\"\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png\"\n  \u002F>\n\u003C\u002Fpicture>\n\n## Citation\n\n```bibtex\n@misc{zhao2024doclayoutyoloenhancingdocumentlayout,\n      title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, \n      author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},\n      year={2024},\n      eprint={2410.12628},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628}, \n}\n\n@article{wang2024mineru,\n  title={MinerU: An Open-Source Solution for Precise Document Content Extraction},\n  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},\n  journal={arXiv preprint arXiv:2409.18839},\n  year={2024}\n}\n\n```\n","\u003Cdiv align=\"center\">\n\nEnglish | [简体中文](.\u002FREADME-zh_CN.md)\n\n\n\u003Ch1>DocLayout-YOLO: 通过多样化合成数据与全局到局部自适应感知增强文档布局分析\u003C\u002Fh1>\n\nOfficial PyTorch implementation of [DocLayout-YOLO](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628).\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2405.14458-b31b1b.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628) [![Online Demo](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Online%20Demo-yellow)](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) [![Hugging Face Spaces](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97-Models%20and%20Data-yellow)](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fjuliozhao\u002Fdoclayout-yolo-670cdec674913d9a6f77b542)\n\n\u003C\u002Fdiv>\n    \n## 摘要\n\n> 我们提出DocLayout-YOLO，基于YOLO-v10构建了一个实时且鲁棒的文档布局检测模型。该模型通过多样化文档预训练和针对布局检测的结构优化进行增强。在预训练阶段，我们引入Mesh-candidate BestFit，将文档合成视为二维装箱问题，并创建了大规模多样化合成文档数据集DocSynth-300K。在模型结构优化方面，我们提出了一种具有全局到局部可控性的模块，用于精确检测不同尺度下的文档元素。\n\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_b9e11e58728c.png\" width=52%>\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_b2f4ad42cb5d.png\" width=44%> \u003Cbr>\n\u003C\u002Fp>\n\n## 新闻 🚀🚀🚀\n\n**2024.10.25** 🎉🎉  **Mesh-candidate Bestfit** 代码已发布。Mesh-candidate Bestfit 是一个自动流程，可生成大规模、高质量且视觉吸引人的文档布局检测数据集。教程和示例数据见 [此处](.\u002Fmesh-candidate_bestfit)。\n\n**2024.10.23** 🎉🎉  **DocSynth300K数据集** 已发布在 [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002FDocSynth300K)，DocSynth300K 是一个大规模且多样化的文档布局分析预训练数据集，可显著提升模型性能。\n\n**2024.10.21** 🎉🎉  **在线演示** 可在 [🤗Huggingface](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) 上使用。\n\n**2024.10.18** 🎉🎉  DocLayout-YOLO 已在 **[PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit)** 中实现，用于文档上下文提取。\n\n**2024.10.16** 🎉🎉  **论文** 现在可在 [ArXiv](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628) 上获取。   \n\n\n## 快速入门\n\n[在线演示](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) 现已可用。对于本地开发，请按照以下步骤操作：\n\n### 1. 环境设置\n\n按照以下步骤设置环境：\n\n```bash\nconda create -n doclayout_yolo python=3.10\nconda activate doclayout_yolo\npip install -e .\n```\n\n**注意：** 如果仅需用于推理，可通过pip直接安装：\n\n```bash\npip install doclayout-yolo\n```\n\n### 2. 预测\n\n可以使用脚本或SDK进行预测：\n\n- **脚本**\n\n  运行以下命令使用脚本进行预测：\n\n  ```bash\n  python demo.py --model path\u002Fto\u002Fmodel --image-path path\u002Fto\u002Fimage\n  ```\n\n- **SDK**\n\n  以下是使用SDK进行预测的示例：\n\n  ```python\n  import cv2\n  from doclayout_yolo import YOLOv10\n\n  # 加载预训练模型\n  model = YOLOv10(\"path\u002Fto\u002Fprovided\u002Fmodel\")\n\n  # 进行预测\n  det_res = model.predict(\n      \"path\u002Fto\u002Fimage\",   # 要预测的图像\n      imgsz=1024,        # 预测图像大小\n      conf=0.2,          # 置信度阈值\n      device=\"cuda:0\"    # 使用的设备（例如，'cuda:0' 或 'cpu'）\n  )\n\n  # 注释并保存结果\n  annotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)\n  cv2.imwrite(\"result.jpg\", annotated_frame)\n  ```\n\n\n我们提供在**DocStructBench**上微调的模型用于预测，**能够处理各种文档类型**。模型可从 [此处](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocStructBench\u002Ftree\u002Fmain) 下载，示例图片见 ```assets\u002Fexample``` 目录。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_576c96560777.png\" width=100%> \u003Cbr>\n\u003C\u002Fp>\n\n\n**注意：** 对于PDF内容提取，请参考 [PDF-Extract-Kit](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FPDF-Extract-Kit\u002Ftree\u002Fmain) 和 [MinerU](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FMinerU)。\n\n**注意：** 感谢 [NielsRogge](https:\u002F\u002Fgithub.com\u002FNielsRogge)，DocLayout-YOLO 现在支持直接从 🤗Huggingface 实现，可按以下方式加载模型：\n\n```python\nfilepath = hf_hub_download(repo_id=\"juliozhao\u002FDocLayout-YOLO-DocStructBench\", filename=\"doclayout_yolo_docstructbench_imgsz1024.pt\")\nmodel = YOLOv10(filepath)\n```\n\n或直接使用 ```from_pretrained```：\n\n```python\nmodel = YOLOv10.from_pretrained(\"juliozhao\u002FDocLayout-YOLO-DocStructBench\")\n```\n\n更多详情见 [此PR](https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fpull\u002F6)。\n\n**注意：** 感谢 [luciaganlulu](https:\u002F\u002Fgithub.com\u002Fluciaganlulu)，DocLayout-YOLO 可进行批量推理和预测。不同于 ```demo.py``` 中单张图像传入 ```model.predict```，应传入**图像路径列表**。此外，由于在 ```YOLOv11``` 之前未实现批量推理，需手动修改 [此处](doclayout_yolo\u002Fengine\u002Fmodel.py#L431) 的 ```batch_size```。\n\n## DocSynth300K 数据集\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_5f05a11ed5d5.png\" width=100%>\n\u003C\u002Fp>\n\n### 数据下载\n\n使用以下命令下载数据集（约113G）：\n\n```python\nfrom huggingface_hub import snapshot_download\n# 下载 DocSynth300K\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\")\n# 如果下载中断且文件不完整，可恢复下载\nsnapshot_download(repo_id=\"juliozhao\u002FDocSynth300K\", local_dir=\".\u002Fdocsynth300k-hf\", repo_type=\"dataset\", resume_download=True)\n```\n\n### 数据格式化与预训练\n\n若想进行 DocSynth300K 预训练，使用 ```format_docsynth300k.py``` 将原始 ```.parquet``` 格式转换为 ```YOLO``` 格式。转换后的数据将存储在 ```.\u002Flayout_data\u002Fdocsynth300k```。\n\n```bash\npython format_docsynth300k.py\n```\n\n进行 DocSynth300K 预训练，请使用此 [命令](assets\u002Fscript.sh#L2)。我们默认使用8块GPU进行预训练。为了达到最佳性能，可根据下游微调数据分布或设置调整超参数，如 ```imgsz```、```lr``` 等。\n\n**注意：** 由于YOLO原始数据加载代码存在内存泄漏，大规模数据集的预训练可能意外中断，使用 ```--pretrain last_checkpoint.pt --resume``` 可恢复预训练过程。\n\n## 在公共DLA数据集上的训练与评估\n\n### 数据准备\n\n1. 指定数据根路径\n\n找到您的 ultralytics 配置文件（Linux 用户可在 ```$HOME\u002F.config\u002FUltralytics\u002Fsettings.yaml)``` 中查找）并修改 ```datasets_dir``` 为项目根路径。\n\n2. 从以下链接下载准备好的 yolo 格式 D4LA 和 DocLayNet 数据并放入 ```.\u002Flayout_data```：\n\n| 数据集 | 下载 |\n|:--:|:--:|\n| D4LA | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-D4LA) |\n| DocLayNet | [链接](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002Fdoclayout-yolo-DocLayNet) |\n\n文件结构如下：\n\n```bash\n.\u002Flayout_data\n├── D4LA\n│   ├── images\n│   ├── labels\n│   ├── test.txt\n│   └── train.txt\n└── doclaynet\n    ├── images\n    ├── labels\n    ├── val.txt\n    └── train.txt\n```\n\n### 训练与评估\n\n训练使用 8 块 GPU，全局批量大小为 64（每块设备 8 张图像）。详细设置和检查点如下：\n\n| 数据集 | 模型 | DocSynth300K 预训练？ | 图像尺寸 | 学习率 | 微调 | 评估 | AP50 | mAP | 检查点 |\n|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|:--:|\n| D4LA | DocLayout-YOLO | ❌ | 1600 | 0.04 | [命令](assets\u002Fscript.sh#L5) | [命令](assets\u002Fscript.sh#L11) | 81.7 | 69.8 | [检查点](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-from_scratch) |\n| D4LA | DocLayout-YOLO | ✅ | 1600 | 0.04 | [命令](assets\u002Fscript.sh#L8) | [命令](assets\u002Fscript.sh#L11) | 82.4 | 70.3 | [检查点](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-D4LA-Docsynth300K_pretrained) |\n| DocLayNet | DocLayout-YOLO | ❌ | 1120 | 0.02 | [命令](assets\u002Fscript.sh#L14) | [命令](assets\u002Fscript.sh#L20) | 93.0 | 77.7 | [检查点](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-from_scratch) |\n| DocLayNet | DocLayout-YOLO | ✅ | 1120 | 0.02 | [命令](assets\u002Fscript.sh#L17) | [命令](assets\u002Fscript.sh#L20) | 93.4 | 79.7 | [检查点](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocLayNet-Docsynth300K_pretrained) |\n\nDocSynth300K 预训练模型可从 [此处](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocSynth300K-pretrain) 下载。评估时请将 ```checkpoint.pt``` 改为要评估的模型路径。\n\n## 致谢\n\n代码基于 [ultralytics](https:\u002F\u002Fgithub.com\u002Fultralytics\u002Fultralytics) 和 [YOLO-v10](https:\u002F\u002Fgithub.com\u002Flyuwenyu\u002FRT-DETR) 开发。\n\n感谢他们的出色工作！\n\n## 星星历史\n\n如果您认为该项目有用，请为仓库添加一个“星标”。看到您的关注让我们感到非常兴奋，这激励我们继续投入该项目！\n\n\u003Cpicture>\n  \u003Csource\n    media=\"(prefers-color-scheme: dark)\"\n    srcset=\"\n      https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png&theme=dark\n    \"\n  \u002F>\n  \u003Csource\n    media=\"(prefers-color-scheme: light)\"\n    srcset=\"\n      https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png\n    \"\n  \u002F>\n  \u003Cimg\n    alt=\"星星历史图表\"\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_readme_768c1f2d5f68.png\"\n  \u002F>\n\u003C\u002Fpicture>\n\n## 引用\n\n```bibtex\n@misc{zhao2024doclayoutyoloenhancingdocumentlayout,\n      title={DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception}, \n      author={Zhiyuan Zhao and Hengrui Kang and Bin Wang and Conghui He},\n      year={2024},\n      eprint={2410.12628},\n      archivePrefix={arXiv},\n      primaryClass={cs.CV},\n      url={https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12628}, \n}\n\n@article{wang2024mineru,\n  title={MinerU: An Open-Source Solution for Precise Document Content Extraction},\n  author={Wang, Bin and Xu, Chao and Zhao, Xiaomeng and Ouyang, Linke and Wu, Fan and Zhao, Zhiyuan and Xu, Rui and Liu, Kaiwen and Qu, Yuan and Shang, Fukai and others},\n  journal={arXiv preprint arXiv:2409.18839},\n  year={2024}\n}\n```","# DocLayout-YOLO 快速上手指南\n\n## 环境准备\n- **系统要求**：Python 3.10\n- **前置依赖**：\n  - PyTorch\n  - Ultralytics（YOLOv10框架）\n  - Hugging Face Transformers\n\n## 安装步骤\n1. 创建虚拟环境并安装依赖：\n```bash\nconda create -n doclayout_yolo python=3.10\nconda activate doclayout_yolo\npip install -e .\n```\n2. 仅需推理时可直接安装：\n```bash\npip install doclayout-yolo\n```\n\n## 基本使用\n### 1. 在线演示\n访问 [https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002Fopendatalab\u002FDocLayout-YOLO) 进行实时测试\n\n### 2. 本地运行\n#### 使用脚本\n```bash\npython demo.py --model path\u002Fto\u002Fmodel --image-path path\u002Fto\u002Fimage\n```\n\n#### 使用SDK\n```python\nimport cv2\nfrom doclayout_yolo import YOLOv10\n\nmodel = YOLOv10(\"path\u002Fto\u002Fprovided\u002Fmodel\")\ndet_res = model.predict(\n    \"path\u002Fto\u002Fimage\", \n    imgsz=1024, \n    conf=0.2, \n    device=\"cuda:0\"\n)\nannotated_frame = det_res[0].plot(pil=True, line_width=5, font_size=20)\ncv2.imwrite(\"result.jpg\", annotated_frame)\n```\n\n### 3. 模型获取\n- 预训练模型：[https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocStructBench](https:\u002F\u002Fhuggingface.co\u002Fjuliozhao\u002FDocLayout-YOLO-DocStructBench)\n- 示例图片：`assets\u002Fexample` 目录\n\n### 4. 数据集\n- DocSynth300K：[https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002FDocSynth300K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fjuliozhao\u002FDocSynth300K)","某金融公司需要处理大量客户合同，但传统OCR和布局分析工具无法准确识别表格、段落和图表等复杂结构，导致人工校对成本高且效率低。\n\n### 没有 DocLayout-YOLO 时  \n- 手动标注合同结构需耗费3小时\u002F份，错误率高达15%  \n- 表格边框识别失败导致数据错位，需重做整份文档解析  \n- 多页合同的分页逻辑无法识别，合并后出现内容错乱  \n- 专业术语区域（如财务指标）的边界检测精度不足  \n- 每日处理500份文档需20人天，且无法应对新类型合同  \n\n### 使用 DocLayout-YOLO 后  \n- 自动识别表格、段落、图表等12类元素，标注效率提升12倍  \n- 表格边框检测准确率提升至98.7%，数据错位问题消失  \n- 支持多页文档的逻辑分页，合并后内容保持原顺序  \n- 专业术语区域的边界识别精度达99.2%，减少人工校对  \n- 每日处理500份文档仅需3人天，且可自动适应新文档类型  \n\n核心价值：通过全局到局部的自适应感知机制，实现复杂文档结构的高精度自动解析，显著降低人工干预成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fopendatalab_DocLayout-YOLO_576c9656.png","opendatalab","OpenDataLab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fopendatalab_37842245.jpg","OpenDataLab provides access to numerous significant open-source datasets.",null,"OpenDataLab@pjlab.org.cn","https:\u002F\u002Fopendatalab.org.cn","https:\u002F\u002Fgithub.com\u002Fopendatalab",[23,27,31],{"name":24,"color":25,"percentage":26},"Python","#3572A5",90.8,{"name":28,"color":29,"percentage":30},"Jupyter Notebook","#DA5B0B",9.2,{"name":32,"color":33,"percentage":34},"Shell","#89e051",0.1,2092,153,"2026-04-05T08:39:14","AGPL-3.0",3,"Linux, macOS","需要 NVIDIA GPU，显存 8GB+，CUDA 11.7+","16GB+",{"notes":44,"python":45,"dependencies":46},"建议使用 conda 管理环境，首次运行需下载约 5GB 模型文件；批量推理需手动调整 batch_size 参数","3.8+",[47,48,49],"torch>=2.0","transformers>=4.30","accelerate",[51],"图像","ready","2026-03-27T02:49:30.150509","2026-04-06T06:45:29.301149",[56,61,66,71,76,81],{"id":57,"question_zh":58,"answer_zh":59,"source_url":60},4806,"如何解决val.py运行后指标异常的问题？","建议在val.py中添加save_json=True和plot=True参数，通过evaluation评估结果与demo.py的inference结果对比。具体代码示例：model.val(data=f'{args.data}.yaml', batch=args.batch_size, device=args.device)。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F87",{"id":62,"question_zh":63,"answer_zh":64,"source_url":65},4807,"如何调整推理时的batch_size？","可修改model.py中数据加载部分的batch_size参数，并将source从单个image_path改为图片列表。若遇到显存不足问题，建议降级安装torch 2.0.0并指定GPU设备。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F14",{"id":67,"question_zh":68,"answer_zh":69,"source_url":70},4808,"细长条目标检测效果差如何优化？","可尝试设置self.reg_max=1并使用1280分辨率预训练模型。若自定义数据集训练效果差，建议使用doclayout_yolo_docsynth300k_imgsz1600.pt预训练模型进行微调。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F90",{"id":72,"question_zh":73,"answer_zh":74,"source_url":75},4809,"如何解决onnx转换时的'Conv object has no attribute 'bn''错误？","直接使用训练得到的best.pt模型通过val.py进行验证即可，无需转换onnx。确保模型训练完成且保存了best.pt文件。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F23",{"id":77,"question_zh":78,"answer_zh":79,"source_url":80},4810,"NCCL广播操作超时如何排查？","检查代码中dist.broadcast(self.amp, src=0)的调用位置，确保AMP相关代码正确集成。若使用Ultralytics，需确认已正确添加G2L_CRM相关代码。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F22",{"id":82,"question_zh":83,"answer_zh":84,"source_url":85},4811,"使用demo.py时出现TypeError如何解决？","--model参数表示模型尺寸（如m-doclayout），非具体权重文件。训练时需通过--pretrain指定预训练模型，如doclayout_yolo_docsynth300k_imgsz1600.pt。","https:\u002F\u002Fgithub.com\u002Fopendatalab\u002FDocLayout-YOLO\u002Fissues\u002F55",[],[88,98,107,121,129,137],{"id":89,"name":90,"github_repo":91,"description_zh":92,"stars":93,"difficulty_score":39,"last_commit_at":94,"category_tags":95,"status":52},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[96,51,97],"开发框架","Agent",{"id":99,"name":100,"github_repo":101,"description_zh":102,"stars":103,"difficulty_score":104,"last_commit_at":105,"category_tags":106,"status":52},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,2,"2026-04-03T11:11:01",[96,51,97],{"id":108,"name":109,"github_repo":110,"description_zh":111,"stars":112,"difficulty_score":104,"last_commit_at":113,"category_tags":114,"status":52},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[51,115,116,117,97,118,119,96,120],"数据工具","视频","插件","其他","语言模型","音频",{"id":122,"name":123,"github_repo":124,"description_zh":125,"stars":126,"difficulty_score":39,"last_commit_at":127,"category_tags":128,"status":52},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[97,51,96,119,118],{"id":130,"name":131,"github_repo":132,"description_zh":133,"stars":134,"difficulty_score":39,"last_commit_at":135,"category_tags":136,"status":52},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[119,51,96,118],{"id":138,"name":139,"github_repo":140,"description_zh":141,"stars":142,"difficulty_score":104,"last_commit_at":143,"category_tags":144,"status":52},2471,"tesseract","tesseract-ocr\u002Ftesseract","Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。\n\n在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。\n\nTesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中",73286,"2026-04-03T01:56:45",[96,51]]