[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-NielsRogge--Transformers-Tutorials":3,"tool-NielsRogge--Transformers-Tutorials":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":77,"owner_twitter":76,"owner_website":82,"owner_url":83,"languages":84,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":23,"env_os":97,"env_gpu":98,"env_ram":99,"env_deps":100,"category_tags":109,"github_topics":110,"view_count":116,"oss_zip_url":77,"oss_zip_packed_at":77,"status":16,"created_at":117,"updated_at":118,"faqs":119,"releases":149},2364,"NielsRogge\u002FTransformers-Tutorials","Transformers-Tutorials","This repository contains demos I made with the Transformers library by HuggingFace.","Transformers-Tutorials 是一个基于 Hugging Face Transformers 库的开源实战教程集合，旨在通过具体的代码示例帮助开发者快速掌握各类 Transformer 模型的应用。它主要解决了初学者在面对复杂的预训练模型时，不知如何下手进行推理、微调或适配自定义数据集的痛点。\n\n该资源库涵盖了音频分类（AST）、命名实体识别与多标签文本分类（BERT）、图像掩码建模（BEiT）、零样本图像分割（CLIPSeg）以及目标检测（Conditional DETR）等多个前沿领域的所有演示均以 PyTorch 实现，并提供了可直接在 Google Colab 中运行的 Notebook 链接，让用户无需配置本地环境即可立即体验。此外，它还特别推荐了 Hugging Face 的免费课程，帮助用户系统理解从 BERT 到 T5 等多种架构及整个生态系统。\n\nTransformers-Tutorials 非常适合希望深入理解 Transformer 架构的 AI 开发者、研究人员以及正在学习深度学习的学生使用。无论你是想快速验证某个模型的效果，还是需要将先进技术应","Transformers-Tutorials 是一个基于 Hugging Face Transformers 库的开源实战教程集合，旨在通过具体的代码示例帮助开发者快速掌握各类 Transformer 模型的应用。它主要解决了初学者在面对复杂的预训练模型时，不知如何下手进行推理、微调或适配自定义数据集的痛点。\n\n该资源库涵盖了音频分类（AST）、命名实体识别与多标签文本分类（BERT）、图像掩码建模（BEiT）、零样本图像分割（CLIPSeg）以及目标检测（Conditional DETR）等多个前沿领域的所有演示均以 PyTorch 实现，并提供了可直接在 Google Colab 中运行的 Notebook 链接，让用户无需配置本地环境即可立即体验。此外，它还特别推荐了 Hugging Face 的免费课程，帮助用户系统理解从 BERT 到 T5 等多种架构及整个生态系统。\n\nTransformers-Tutorials 非常适合希望深入理解 Transformer 架构的 AI 开发者、研究人员以及正在学习深度学习的学生使用。无论你是想快速验证某个模型的效果，还是需要将先进技术应用到自己的项目中，这里提供的清晰代码和详细注释都能成为你得力的助手，让复杂的模型应用变得简单易懂。","# Transformers-Tutorials\n\nHi there!\n\nThis repository contains demos I made with the [Transformers library](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) by 🤗 HuggingFace. Currently, all of them are implemented in PyTorch.\n\nNOTE: if you are not familiar with HuggingFace and\u002For Transformers, I highly recommend to check out our [free course](https:\u002F\u002Fhuggingface.co\u002Fcourse\u002Fchapter1), which introduces you to several Transformer architectures (such as BERT, GPT-2, T5, BART, etc.), as well as an overview of the HuggingFace libraries, including [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers), [Tokenizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers), [Datasets](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets), [Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) and the [hub](https:\u002F\u002Fhuggingface.co\u002F).\n\nFor an overview of the ecosystem of HuggingFace for computer vision (June 2022), refer to [this notebook](https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FHuggingFace_vision_ecosystem_overview_(June_2022).ipynb) with corresponding [video](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=oL-xmufhZM8&t=2884s).\n\nCurrently, it contains the following demos:\n* Audio Spectrogram Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.01778)): \n  - performing inference with `ASTForAudioClassification` to classify audio. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FAST\u002FInference_with_the_Audio_Spectogram_Transformer_to_classify_audio.ipynb)\n* BERT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805)): \n  - fine-tuning `BertForTokenClassification` on a named entity recognition (NER) dataset. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBERT\u002FCustom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb)\n  - fine-tuning `BertForSequenceClassification` for multi-label text classification. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBERT\u002FFine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb)\n* BEiT ([paper]([https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.08254))):\n  - understanding `BeitForMaskedImageModeling` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBEiT\u002FUnderstanding_BeitForMaskedImageModeling.ipynb)\n* CANINE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06874)):\n  - fine-tuning `CanineForSequenceClassification` on IMDb [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FCANINE\u002FFine_tune_CANINE_on_IMDb_(movie_review_binary_classification).ipynb)\n* CLIPSeg ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10003)):\n  - performing zero-shot image segmentation with `CLIPSeg` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FCLIPSeg\u002FZero_shot_image_segmentation_with_CLIPSeg.ipynb)\n* Conditional DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.06152)):\n  - performing inference with `ConditionalDetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConditional%20DETR\u002FRun_inference_with_Conditional_DETR.ipynb)\n  - fine-tuning `ConditionalDetrForObjectDetection` on a custom dataset (balloon) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConditional%20DETR\u002FFine_tuning_Conditional_DETR_on_custom_dataset_(balloon).ipynb)\n* ConvNeXT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.03545)):\n  - fine-tuning (and performing inference with) `ConvNextForImageClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConvNeXT\u002FFine_tune_ConvNeXT_for_image_classification.ipynb)\n* DINO ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.14294)):\n  - visualize self-attention of Vision Transformers trained using the DINO method [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDINO\u002FVisualize_self_attention_of_DINO.ipynb)\n* DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.12872)):\n  - performing inference with `DetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FDETR_minimal_example_(with_DetrFeatureExtractor).ipynb)\n  - fine-tuning `DetrForObjectDetection` on a custom object detection dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FFine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb)\n  - evaluating `DetrForObjectDetection` on the COCO detection 2017 validation set [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FEvaluating_DETR_on_COCO_validation_2017.ipynb)\n  - performing inference with `DetrForSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FDETR_panoptic_segmentation_minimal_example_(with_DetrFeatureExtractor).ipynb)\n  - fine-tuning `DetrForSegmentation` on COCO panoptic 2017 [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FFine_tuning_DetrForSegmentation_on_custom_dataset_end_to_end_approach.ipynb)\n* DPT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.13413)):\n  - performing inference with DPT for monocular depth estimation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDPT\u002FDPT_inference_notebook_(depth_estimation).ipynb)\n  - performing inference with DPT for semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDPT\u002FDPT_inference_notebook_(semantic_segmentation).ipynb)\n* Deformable DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.04159)):\n  - performing inference with `DeformableDetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDeformable-DETR\u002FInference_with_Deformable_DETR_(CPU).ipynb)\n* DiT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.02378)):\n  - performing inference with DiT for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDiT\u002FInference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb)\n* Donut ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.15664)):\n  - performing inference with Donut for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FRVL-CDIP\u002FQuick_inference_with_DONUT_for_Document_Image_Classification.ipynb)\n  - fine-tuning Donut for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FRVL-CDIP\u002FFine_tune_Donut_on_toy_RVL_CDIP_(document_image_classification).ipynb)\n  - performing inference with Donut for document visual question answering (DocVQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FDocVQA\u002FQuick_inference_with_DONUT_for_DocVQA.ipynb)\n  - performing inference with Donut for document parsing [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FCORD\u002FQuick_inference_with_DONUT_for_Document_Parsing.ipynb)\n  - fine-tuning Donut for document parsing with PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FCORD\u002FFine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb)\n* GIT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14100)):\n  - performing inference with GIT for image\u002Fvideo captioning and image\u002Fvideo question-answering [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGIT\u002FInference_with_GIT_for_image_video_captioning_and_image_video_QA.ipynb)\n  - fine-tuning GIT on a custom image captioning dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGIT\u002FFine_tune_GIT_on_an_image_captioning_dataset.ipynb)\n* GLPN ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.07436)):\n  - performing inference with `GLPNForDepthEstimation` to illustrate monocular depth estimation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGLPN\u002FGLPN_inference_(depth_estimation).ipynb)\n* GPT-J-6B ([repository](https:\u002F\u002Fgithub.com\u002Fkingoflolz\u002Fmesh-transformer-jax)):\n  - performing inference with `GPTJForCausalLM` to illustrate few-shot learning and code generation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGPT-J-6B\u002FInference_with_GPT_J_6B.ipynb)\n* GroupViT ([repository](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGroupViT)):\n  - performing inference with `GroupViTModel` to illustrate zero-shot semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGroupViT\u002FInference_with_GroupViT_for_zero_shot_semantic_segmentation.ipynb)\n* ImageGPT ([blog post](https:\u002F\u002Fopenai.com\u002Fblog\u002Fimage-gpt\u002F)):\n  - (un)conditional image generation with `ImageGPTForCausalLM` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FImageGPT\u002F(Un)conditional_image_generation_with_ImageGPT.ipynb)\n  - linear probing with ImageGPT [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FImageGPT\u002FLinear_probing_with_ImageGPT.ipynb)\n* LUKE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01057)):\n  - fine-tuning `LukeForEntityPairClassification` on a custom relation extraction dataset using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLUKE\u002FSupervised_relation_extraction_with_LukeForEntityPairClassification.ipynb)\n* LayoutLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.13318)): \n  - fine-tuning `LayoutLMForTokenClassification` on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FFine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)\n  - fine-tuning `LayoutLMForSequenceClassification` on the [RVL-CDIP](https:\u002F\u002Fwww.cs.cmu.edu\u002F~aharley\u002Frvl-cdip\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FFine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)\n  - adding image embeddings to LayoutLM during fine-tuning on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FAdd_image_embeddings_to_LayoutLM.ipynb)\n* LayoutLMv2 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.14740)):\n  - fine-tuning `LayoutLMv2ForSequenceClassification` on RVL-CDIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FRVL-CDIP\u002FFine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on FUNSD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on FUNSD using the 🤗 Trainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb)\n  - performing inference with `LayoutLMv2ForTokenClassification` on FUNSD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FInference_with_LayoutLMv2ForTokenClassification.ipynb)\n  - true inference with `LayoutLMv2ForTokenClassification` (when no labels are available) + Gradio demo [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FTrue_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on CORD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FCORD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_CORD.ipynb)\n  - fine-tuning `LayoutLMv2ForQuestionAnswering` on DOCVQA [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FDocVQA\u002FFine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb)\n* LayoutLMv3 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.08387)): \n  - fine-tuning `LayoutLMv3ForTokenClassification` on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv3\u002FFine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb)\n* LayoutXLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.08836)): \n  - fine-tuning LayoutXLM on the [XFUND](https:\u002F\u002Fgithub.com\u002Fdoc-analysis\u002FXFUND) benchmark for token classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutXLM\u002FFine_tuning_LayoutXLM_on_XFUND_for_token_classification_using_HuggingFace_Trainer.ipynb)\n  - fine-tuning LayoutXLM on the [XFUND](https:\u002F\u002Fgithub.com\u002Fdoc-analysis\u002FXFUND) benchmark for relation extraction [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutXLM\u002FFine_tune_LayoutXLM_on_XFUND_(relation_extraction).ipynb)\n* MarkupLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.08518)):\n  - inference with MarkupLM to perform question answering on web pages [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMarkupLM\u002FInference_with_MarkupLM_for_question_answering_on_web_pages.ipynb)\n  - fine-tuning `MarkupLMForTokenClassification` on a toy dataset for NER on web pages [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMarkupLM\u002FFine_tune_MarkupLMForTokenClassification_on_a_custom_dataset.ipynb)\n* Mask2Former ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.01527)):\n  - performing inference with `Mask2Former` for universal image segmentation: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMask2Former\u002FInference_with_Mask2Former.ipynb)\n* MaskFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.06278)):\n  - performing inference with `MaskFormer` (both semantic and panoptic segmentation): [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMaskFormer\u002Fmaskformer_minimal_example(with_MaskFormerFeatureExtractor).ipynb)\n  - fine-tuning `MaskFormer` on a custom dataset for semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMaskFormer\u002FFine_tune_MaskFormer_on_custom_dataset.ipynb)\n* OneFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.06220)):\n  - performing inference with `OneFormer` for universal image segmentation: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FOneFormer\u002FInference_with_OneFormer.ipynb)\n* Perceiver IO ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.14795)):\n  - showcasing masked language modeling and image classification with the Perceiver [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_masked_language_modeling_and_image_classification.ipynb)\n  - fine-tuning the Perceiver for image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FFine_tune_the_Perceiver_for_image_classification.ipynb)\n  - fine-tuning the Perceiver for text classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FFine_tune_Perceiver_for_text_classification.ipynb)\n  - predicting optical flow between a pair of images with `PerceiverForOpticalFlow`[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_Optical_Flow.ipynb)\n  - auto-encoding a video (images, audio, labels) with `PerceiverForMultimodalAutoencoding` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_Multimodal_Autoencoding.ipynb)\n* SAM ([paper]([https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.15203](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.02643))):\n  - performing inference with MedSAM [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSAM\u002FRun_inference_with_MedSAM_using_HuggingFace_Transformers.ipynb)\n  - fine-tuning `SamModel` on a custom dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSAM\u002FFine_tune_SAM_(segment_anything)_on_a_custom_dataset.ipynb)\n* SegFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.15203)):\n  - performing inference with `SegformerForSemanticSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSegFormer\u002FSegformer_inference_notebook.ipynb)\n  - fine-tuning `SegformerForSemanticSegmentation` on custom data using native PyTorch [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSegFormer\u002FFine_tune_SegFormer_on_custom_dataset.ipynb)\n* T5 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)):\n  - fine-tuning `T5ForConditionalGeneration` on a Dutch summarization dataset on TPU using HuggingFace Accelerate [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Ftree\u002Fmaster\u002FT5)\n  - fine-tuning `T5ForConditionalGeneration` (CodeT5) for Ruby code summarization using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FT5\u002FFine_tune_CodeT5_for_generating_docstrings_from_Ruby_code.ipynb)\n* TAPAS ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.02349)):  \n  - fine-tuning `TapasForQuestionAnswering` on the Microsoft [Sequential Question Answering (SQA)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fdownload\u002Fdetails.aspx?id=54253) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTAPAS\u002FFine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)\n  - evaluating `TapasForSequenceClassification` on the [Table Fact Checking (TabFact)](https:\u002F\u002Ftabfact.github.io\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTAPAS\u002FEvaluating_TAPAS_on_the_Tabfact_test_set.ipynb)\n* Table Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00061)):\n  - using the Table Transformer for table detection and table structure recognition [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTable%20Transformer\u002FUsing_Table_Transformer_for_table_detection_and_table_structure_recognition.ipynb)\n* TrOCR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.10282)):\n  - performing inference with `TrOCR` to illustrate optical character recognition with Transformers, as well as making a Gradio demo [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FInference_with_TrOCR_%2B_Gradio_demo.ipynb)\n  - fine-tuning `TrOCR` on the IAM dataset using the Seq2SeqTrainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FFine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb)\n  - fine-tuning `TrOCR` on the IAM dataset using native PyTorch [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FFine_tune_TrOCR_on_IAM_Handwriting_Database_using_native_PyTorch.ipynb)\n  - evaluating `TrOCR` on the IAM test set [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FEvaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb)\n* UPerNet ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.10221)):\n  - performing inference with `UperNetForSemanticSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FUPerNet\u002FPerform_inference_with_UperNetForSemanticSegmentation_(Swin_backbone).ipynb)\n* VideoMAE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.12602)):\n  - performing inference with `VideoMAEForVideoClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVideoMAE\u002FQuick_inference_with_VideoMAE.ipynb)\n* ViLT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.03334)):\n  - fine-tuning `ViLT` for visual question answering (VQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FFine_tuning_ViLT_for_VQA.ipynb)\n  - performing inference with `ViLT` to illustrate visual question answering (VQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FInference_with_ViLT_(visual_question_answering).ipynb)\n  - masked language modeling (MLM) with a pre-trained `ViLT` model [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FMasked_language_modeling_with_ViLT.ipynb)\n  - performing inference with `ViLT` for image-text retrieval [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FUsing_ViLT_for_image_text_retrieval.ipynb)\n  - performing inference with `ViLT` to illustrate natural language for visual reasoning (NLVR) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FViLT_for_natural_language_visual_reasoning.ipynb)\n* ViTMAE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377)):\n  - reconstructing pixel values with `ViTMAEForPreTraining` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViTMAE\u002FViT_MAE_visualization_demo.ipynb)\n* Vision Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)):\n  - performing inference with `ViTForImageClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FQuick_demo_of_HuggingFace_version_of_Vision_Transformer_inference.ipynb)\n  - fine-tuning `ViTForImageClassification` on [CIFAR-10](https:\u002F\u002Fwww.cs.toronto.edu\u002F~kriz\u002Fcifar.html) using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FFine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb)\n  - fine-tuning `ViTForImageClassification` on [CIFAR-10](https:\u002F\u002Fwww.cs.toronto.edu\u002F~kriz\u002Fcifar.html) using the 🤗 Trainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FFine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb)\n* X-CLIP ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.02816)):\n  - performing zero-shot video classification with X-CLIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FX-CLIP\u002FVideo_text_matching_with_X_CLIP.ipynb)\n  - zero-shot classifying a YouTube video with X-CLIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FX-CLIP\u002FZero_shot_classify_a_YouTube_video_with_X_CLIP.ipynb)\n* YOLOS ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666)):\n  - fine-tuning `YolosForObjectDetection` on a custom dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FYOLOS\u002FFine_tuning_YOLOS_for_object_detection_on_custom_dataset_(balloon).ipynb)\n  - inference with `YolosForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FYOLOS\u002FYOLOS_minimal_inference_example.ipynb)\n\n... more to come! 🤗 \n\nIf you have any questions regarding these demos, feel free to open an issue on this repository.\n\nBtw, I was also the main contributor to add the following algorithms to the library:\n- TAbular PArSing (TAPAS) by Google AI\n- Vision Transformer (ViT) by Google AI\n- DINO by Facebook AI\n- Data-efficient Image Transformers (DeiT) by Facebook AI\n- LUKE by Studio Ousia\n- DEtection TRansformers (DETR) by Facebook AI\n- CANINE by Google AI\n- BEiT by Microsoft Research \n- LayoutLMv2 (and LayoutXLM) by Microsoft Research\n- TrOCR by Microsoft Research \n- SegFormer by NVIDIA\n- ImageGPT by OpenAI\n- Perceiver by Deepmind\n- MAE by Facebook AI\n- ViLT by NAVER AI Lab\n- ConvNeXT by Facebook AI\n- DiT By Microsoft Research\n- GLPN by KAIST\n- DPT by Intel Labs\n- YOLOS by School of EIC, Huazhong University of Science & Technology\n- TAPEX by Microsoft Research\n- LayoutLMv3 by Microsoft Research\n- VideoMAE by Multimedia Computing Group, Nanjing University\n- X-CLIP by Microsoft Research\n- MarkupLM by Microsoft Research\n\nAll of them were an incredible learning experience. I can recommend anyone to contribute an AI algorithm to the library!\n\n## Data preprocessing\nRegarding preparing your data for a PyTorch model, there are a few options:\n- a native PyTorch dataset + dataloader. This is the standard way to prepare data for a PyTorch model, namely by subclassing `torch.utils.data.Dataset`, and then creating a corresponding `DataLoader` (which is a Python generator that allows to loop over the items of a dataset). When subclassing the `Dataset` class, one needs to implement 3 methods: `__init__`, `__len__` (which returns the number of examples of the dataset) and `__getitem__` (which returns an example of the dataset, given an integer index). Here's an example of creating a basic text classification dataset (assuming one has a CSV that contains 2 columns, namely \"text\" and \"label\"):\n\n```python\nfrom torch.utils.data import Dataset\n\nclass CustomTrainDataset(Dataset):\n    def __init__(self, df, tokenizer):\n        self.df = df\n        self.tokenizer = tokenizer\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, idx):\n        # get item\n        item = df.iloc[idx]\n        text = item['text']\n        label = item['label']\n        # encode text\n        encoding = self.tokenizer(text, padding=\"max_length\", max_length=128, truncation=True, return_tensors=\"pt\")\n        # remove batch dimension which the tokenizer automatically adds\n        encoding = {k:v.squeeze() for k,v in encoding.items()}\n        # add label\n        encoding[\"label\"] = torch.tensor(label)\n        \n        return encoding\n```\n\nInstantiating the dataset then happens as follows:\n\n```python\nfrom transformers import BertTokenizer\nimport pandas as pd\n\ntokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\ndf = pd.read_csv(\"path_to_your_csv\")\n\ntrain_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)\n```\n\nAccessing the first example of the dataset can then be done as follows:\n\n```python\nencoding = train_dataset[0]\n```\n\nIn practice, one creates a corresponding `DataLoader`, that allows to get batches from the dataset:\n\n```python\nfrom torch.utils.data import DataLoader\n\ntrain_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)\n```\nI often check whether the data is created correctly by fetching the first batch from the data loader, and then printing out the shapes of the tensors, decoding the input_ids back to text, etc.\n\n```python\nbatch = next(iter(train_dataloader))\nfor k,v in batch.items():\n    print(k, v.shape)\n# decode the input_ids of the first example of the batch\nprint(tokenizer.decode(batch['input_ids'][0].tolist())\n```\n-  [HuggingFace Datasets](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002F). Datasets is a library by HuggingFace that allows to easily load and process data in a very fast and memory-efficient way. It is backed by [Apache Arrow](https:\u002F\u002Farrow.apache.org\u002F), and has cool features such as memory-mapping, which allow you to only load data into RAM when it is required. It only has deep interoperability with the [HuggingFace hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets), allowing to easily load well-known datasets as well as share your own with the community.\n\nLoading a custom dataset as a Dataset object can be done as follows (you can install datasets using `pip install datasets`):\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})\n```\nHere I'm loading local CSV files, but there are other formats supported (including JSON, Parquet, txt) as well as loading data from a local Pandas dataframe or dictionary for instance. You can check out the [docs](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading.html#local-and-remote-files) for all details.\n\n## Training frameworks\nRegarding fine-tuning Transformer models (or more generally, PyTorch models), there are a few options:\n- using native PyTorch. This is the most basic way to train a model, and requires the user to manually write the training loop. The advantage is that this is very easy to debug. The disadvantage is that one needs to implement training him\u002Fherself, such as setting the model in the appropriate mode (`model.train()`\u002F`model.eval()`), handle device placement (`model.to(device)`), etc. A typical training loop in PyTorch looks as follows (inspired by [this great PyTorch intro tutorial]()):\n\n```python\nimport torch\nfrom transformers import BertForSequenceClassification\n\n# Instantiate pre-trained BERT model with randomly initialized classification head\nmodel = BertForSequenceClassification.from_pretrained(\"bert-base-uncased\")\n\n# I almost always use a learning rate of 5e-5 when fine-tuning Transformer based models\noptimizer = torch.optim.Adam(model.parameters(), lr=5e-5)\n\n# put model on GPU, if available\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\nfor epoch in range(epochs):\n    model.train()\n    train_loss = 0.0\n    for batch in train_dataloader:\n        # put batch on device\n        batch = {k:v.to(device) for k,v in batch.items()}\n        \n        # forward pass\n        outputs = model(**batch)\n        loss = outputs.loss\n        \n        train_loss += loss.item()\n        \n        loss.backward()\n        optimizer.step()\n        optimizer.zero_grad()\n\n    print(\"Loss after epoch {epoch}:\", train_loss\u002Flen(train_dataloader))\n    \n    model.eval()\n    val_loss = 0.0\n    with torch.no_grad():\n        for batch in eval_dataloader:\n            # put batch on device\n            batch = {k:v.to(device) for k,v in batch.items()}\n            \n            # forward pass\n            outputs = model(**batch)\n            loss = outputs.logits\n            \n            val_loss += loss.item()\n                  \n    print(\"Validation loss after epoch {epoch}:\", val_loss\u002Flen(eval_dataloader))\n```\n\n- [PyTorch Lightning (PL)](https:\u002F\u002Fwww.pytorchlightning.ai\u002F). PyTorch Lightning is a framework that automates the training loop written above, by abstracting it away in a Trainer object. Users don't need to write the training loop themselves anymore, instead they can just do `trainer = Trainer()` and then `trainer.fit(model)`. The advantage is that you can start training models very quickly (hence the name lightning), as all training-related code is handled by the `Trainer` object. The disadvantage is that it may be more difficult to debug your model, as the training and evaluation is now abstracted away.\n- [HuggingFace Trainer](https:\u002F\u002Fhuggingface.co\u002Ftransformers\u002Fmain_classes\u002Ftrainer.html). The HuggingFace Trainer API can be seen as a framework similar to PyTorch Lightning in the sense that it also abstracts the training away using a Trainer object. However, contrary to PyTorch Lightning, it is not meant not be a general framework. Rather, it is made especially for fine-tuning Transformer-based models available in the HuggingFace Transformers library. The Trainer also has an extension called `Seq2SeqTrainer` for encoder-decoder models, such as BART, T5 and the `EncoderDecoderModel` classes. Note that all [PyTorch example scripts](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmaster\u002Fexamples\u002Fpytorch) of the Transformers library make use of the Trainer.\n- [HuggingFace Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate): Accelerate is a new project, that is made for people who still want to write their own training loop (as shown above), but would like to make it work automatically irregardless of the hardware (i.e. multiple GPUs, TPU pods, mixed precision, etc.).\n\n## Citation\n\nFeel free to cite me when you use some of my tutorials :)\n\n```bibtex\n@misc{rogge2025transformerstutorials,\n  author = {Rogge, Niels},\n  title = {Tutorials},\n  url = {[https:\u002F\u002Fgithub.com\u002FNielsRogge\u002Ftutorials](https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials)},\n  year = {2025}\n}\n```\n","# Transformers 教程\n\n你好！\n\n这个仓库包含了我使用 🤗 HuggingFace 的 [Transformers 库](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)制作的一些示例。目前，所有示例都是用 PyTorch 实现的。\n\n注意：如果你对 HuggingFace 和\u002F或 Transformers 还不熟悉，强烈建议你先观看我们的[免费课程](https:\u002F\u002Fhuggingface.co\u002Fcourse\u002Fchapter1)，它会介绍多种 Transformer 架构（如 BERT、GPT-2、T5、BART 等），以及 HuggingFace 生态系统中各个库的概览，包括 [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers)、[Tokenizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers)、[Datasets](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fdatasets)、[Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate) 和 [hub](https:\u002F\u002Fhuggingface.co\u002F)。\n\n关于 HuggingFace 计算机视觉生态系统的概述（2022年6月），请参考[这篇 Notebook](https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FHuggingFace_vision_ecosystem_overview_(June_2022).ipynb)，并观看对应的[视频](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=oL-xmufhZM8&t=2884s)。\n\nCurrently, it contains the following demos:\n* Audio Spectrogram Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.01778)): \n  - performing inference with `ASTForAudioClassification` to classify audio. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FAST\u002FInference_with_the_Audio_Spectogram_Transformer_to_classify_audio.ipynb)\n* BERT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805)): \n  - fine-tuning `BertForTokenClassification` on a named entity recognition (NER) dataset. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBERT\u002FCustom_Named_Entity_Recognition_with_BERT_only_first_wordpiece.ipynb)\n  - fine-tuning `BertForSequenceClassification` for multi-label text classification. [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBERT\u002FFine_tuning_BERT_(and_friends)_for_multi_label_text_classification.ipynb)\n* BEiT ([paper]([https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06874](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.08254))):\n  - understanding `BeitForMaskedImageModeling` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FBEiT\u002FUnderstanding_BeitForMaskedImageModeling.ipynb)\n* CANINE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.06874)):\n  - fine-tuning `CanineForSequenceClassification` on IMDb [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FCANINE\u002FFine_tune_CANINE_on_IMDb_(movie_review_binary_classification).ipynb)\n* CLIPSeg ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.10003)):\n  - performing zero-shot image segmentation with `CLIPSeg` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FCLIPSeg\u002FZero_shot_image_segmentation_with_CLIPSeg.ipynb)\n* Conditional DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2108.06152)):\n  - performing inference with `ConditionalDetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConditional%20DETR\u002FRun_inference_with_Conditional_DETR.ipynb)\n  - fine-tuning `ConditionalDetrForObjectDetection` on a custom dataset (balloon) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConditional%20DETR\u002FFine_tuning_Conditional_DETR_on_custom_dataset_(balloon).ipynb)\n* ConvNeXT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.03545)):\n  - fine-tuning (and performing inference with) `ConvNextForImageClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FConvNeXT\u002FFine_tune_ConvNeXT_for_image_classification.ipynb)\n* DINO ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.14294)):\n  - visualize self-attention of Vision Transformers trained using the DINO method [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDINO\u002FVisualize_self_attention_of_DINO.ipynb)\n* DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.12872)):\n  - performing inference with `DetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FDETR_minimal_example_(with_DetrFeatureExtractor).ipynb)\n  - fine-tuning `DetrForObjectDetection` on a custom object detection dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FFine_tuning_DetrForObjectDetection_on_custom_dataset_(balloon).ipynb)\n  - evaluating `DetrForObjectDetection` on the COCO detection 2017 validation set [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FEvaluating_DETR_on_COCO_validation_2017.ipynb)\n  - performing inference with `DetrForSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FDETR_panoptic_segmentation_minimal_example_(with_DetrFeatureExtractor).ipynb)\n  - fine-tuning `DetrForSegmentation` on COCO panoptic 2017 [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDETR\u002FFine_tuning_DetrForSegmentation_on_custom_dataset_end_to_end_approach.ipynb)\n* DPT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.13413)):\n  - performing inference with DPT for monocular depth estimation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDPT\u002FDPT_inference_notebook_(depth_estimation).ipynb)\n  - performing inference with DPT for semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDPT\u002FDPT_inference_notebook_(semantic_segmentation).ipynb)\n* Deformable DETR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.04159)):\n  - performing inference with `DeformableDetrForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDeformable-DETR\u002FInference_with_Deformable_DETR_(CPU).ipynb)\n* DiT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.02378)):\n  - performing inference with DiT for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDiT\u002FInference_with_DiT_(Document_Image_Transformer)_for_document_image_classification.ipynb)\n* Donut ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.15664)):\n  - performing inference with Donut for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FRVL-CDIP\u002FQuick_inference_with_DONUT_for_Document_Image_Classification.ipynb)\n  - fine-tuning Donut for document image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FRVL-CDIP\u002FFine_tune_Donut_on_toy_RVL_CDIP_(document_image_classification).ipynb)\n  - performing inference with Donut for document visual question answering (DocVQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FDocVQA\u002FQuick_inference_with_DONUT_for_DocVQA.ipynb)\n  - performing inference with Donut for document parsing [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FCORD\u002FQuick_inference_with_DONUT_for_Document_Parsing.ipynb)\n  - fine-tuning Donut for document parsing with PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FDonut\u002FCORD\u002FFine_tune_Donut_on_a_custom_dataset_(CORD)_with_PyTorch_Lightning.ipynb)\n* GIT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2205.14100)):\n  - performing inference with GIT for image\u002Fvideo captioning and image\u002Fvideo question-answering [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGIT\u002FInference_with_GIT_for_image_video_captioning_and_image_video_QA.ipynb)\n  - fine-tuning GIT on a custom image captioning dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGIT\u002FFine_tune_GIT_on_an_image_captioning_dataset.ipynb)\n* GLPN ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2201.07436)):\n  - performing inference with `GLPNForDepthEstimation` to illustrate monocular depth estimation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGLPN\u002FGLPN_inference_(depth_estimation).ipynb)\n* GPT-J-6B ([repository](https:\u002F\u002Fgithub.com\u002Fkingoflolz\u002Fmesh-transformer-jax)):\n  - performing inference with `GPTJForCausalLM` to illustrate few-shot learning and code generation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGPT-J-6B\u002FInference_with_GPT_J_6B.ipynb)\n* GroupViT ([repository](https:\u002F\u002Fgithub.com\u002FNVlabs\u002FGroupViT)):\n  - performing inference with `GroupViTModel` to illustrate zero-shot semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FGroupViT\u002FInference_with_GroupViT_for_zero_shot_semantic_segmentation.ipynb)\n* ImageGPT ([blog post](https:\u002F\u002Fopenai.com\u002Fblog\u002Fimage-gpt\u002F)):\n  - (un)conditional image generation with `ImageGPTForCausalLM` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FImageGPT\u002F(Un)conditional_image_generation_with_ImageGPT.ipynb)\n  - linear probing with ImageGPT [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FImageGPT\u002FLinear_probing_with_ImageGPT.ipynb)\n* LUKE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.01057)):\n  - fine-tuning `LukeForEntityPairClassification` on a custom relation extraction dataset using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLUKE\u002FSupervised_relation_extraction_with_LukeForEntityPairClassification.ipynb)\n* LayoutLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1912.13318)): \n  - fine-tuning `LayoutLMForTokenClassification` on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FFine_tuning_LayoutLMForTokenClassification_on_FUNSD.ipynb)\n  - fine-tuning `LayoutLMForSequenceClassification` on the [RVL-CDIP](https:\u002F\u002Fwww.cs.cmu.edu\u002F~aharley\u002Frvl-cdip\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FFine_tuning_LayoutLMForSequenceClassification_on_RVL_CDIP.ipynb)\n  - adding image embeddings to LayoutLM during fine-tuning on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLM\u002FAdd_image_embeddings_to_LayoutLM.ipynb)\n* LayoutLMv2 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2012.14740)):\n  - fine-tuning `LayoutLMv2ForSequenceClassification` on RVL-CDIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FRVL-CDIP\u002FFine_tuning_LayoutLMv2ForSequenceClassification_on_RVL_CDIP.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on FUNSD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on FUNSD using the 🤗 Trainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_FUNSD_using_HuggingFace_Trainer.ipynb)\n  - performing inference with `LayoutLMv2ForTokenClassification` on FUNSD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FInference_with_LayoutLMv2ForTokenClassification.ipynb)\n  - true inference with `LayoutLMv2ForTokenClassification` (when no labels are available) + Gradio demo [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FFUNSD\u002FTrue_inference_with_LayoutLMv2ForTokenClassification_%2B_Gradio_demo.ipynb)\n  - fine-tuning `LayoutLMv2ForTokenClassification` on CORD [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FCORD\u002FFine_tuning_LayoutLMv2ForTokenClassification_on_CORD.ipynb)\n  - fine-tuning `LayoutLMv2ForQuestionAnswering` on DOCVQA [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv2\u002FDocVQA\u002FFine_tuning_LayoutLMv2ForQuestionAnswering_on_DocVQA.ipynb)\n* LayoutLMv3 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2204.08387)): \n  - fine-tuning `LayoutLMv3ForTokenClassification` on the [FUNSD](https:\u002F\u002Fguillaumejaume.github.io\u002FFUNSD\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutLMv3\u002FFine_tune_LayoutLMv3_on_FUNSD_(HuggingFace_Trainer).ipynb)\n* LayoutXLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.08836)): \n  - fine-tuning LayoutXLM on the [XFUND](https:\u002F\u002Fgithub.com\u002Fdoc-analysis\u002FXFUND) benchmark for token classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutXLM\u002FFine_tuning_LayoutXLM_on_XFUND_for_token_classification_using_HuggingFace_Trainer.ipynb)\n  - fine-tuning LayoutXLM on the [XFUND](https:\u002F\u002Fgithub.com\u002Fdoc-analysis\u002FXFUND) benchmark for relation extraction [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FLayoutXLM\u002FFine_tune_LayoutXLM_on_XFUND_(relation_extraction).ipynb)\n* MarkupLM ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.08518)):\n  - inference with MarkupLM to perform question answering on web pages [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMarkupLM\u002FInference_with_MarkupLM_for_question_answering_on_web_pages.ipynb)\n  - fine-tuning `MarkupLMForTokenClassification` on a toy dataset for NER on web pages [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMarkupLM\u002FFine_tune_MarkupLMForTokenClassification_on_a_custom_dataset.ipynb)\n* Mask2Former ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2112.01527)):\n  - performing inference with `Mask2Former` for universal image segmentation: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMask2Former\u002FInference_with_Mask2Former.ipynb)\n* MaskFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.06278)):\n  - performing inference with `MaskFormer` (both semantic and panoptic segmentation): [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMaskFormer\u002Fmaskformer_minimal_example(with_MaskFormerFeatureExtractor).ipynb)\n  - fine-tuning `MaskFormer` on a custom dataset for semantic segmentation [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMaskFormer\u002FFine_tune_MaskFormer_on_custom_dataset.ipynb)\n* OneFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2211.06220)):\n  - performing inference with `OneFormer` for universal image segmentation: [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FOneFormer\u002FInference_with_OneFormer.ipynb)\n* Perceiver IO ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2107.14795)):\n  - showcasing masked language modeling and image classification with the Perceiver [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_masked_language_modeling_and_image_classification.ipynb)\n  - fine-tuning the Perceiver for image classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FFine_tune_the_Perceiver_for_image_classification.ipynb)\n  - fine-tuning the Perceiver for text classification [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FFine_tune_Perceiver_for_text_classification.ipynb)\n  - predicting optical flow between a pair of images with `PerceiverForOpticalFlow`[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_Optical_Flow.ipynb)\n  - auto-encoding a video (images, audio, labels) with `PerceiverForMultimodalAutoencoding` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPerceiver\u002FPerceiver_for_Multimodal_Autoencoding.ipynb)\n* SAM ([paper]([https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.15203](https:\u002F\u002Farxiv.org\u002Fabs\u002F2304.02643))):\n  - performing inference with MedSAM [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSAM\u002FRun_inference_with_MedSAM_using_HuggingFace_Transformers.ipynb)\n  - fine-tuning `SamModel` on a custom dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSAM\u002FFine_tune_SAM_(segment_anything)_on_a_custom_dataset.ipynb)\n* SegFormer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2105.15203)):\n  - performing inference with `SegformerForSemanticSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSegFormer\u002FSegformer_inference_notebook.ipynb)\n  - fine-tuning `SegformerForSemanticSegmentation` on custom data using native PyTorch [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FSegFormer\u002FFine_tune_SegFormer_on_custom_dataset.ipynb)\n* T5 ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1910.10683)):\n  - fine-tuning `T5ForConditionalGeneration` on a Dutch summarization dataset on TPU using HuggingFace Accelerate [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Ftree\u002Fmaster\u002FT5)\n  - fine-tuning `T5ForConditionalGeneration` (CodeT5) for Ruby code summarization using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FT5\u002FFine_tune_CodeT5_for_generating_docstrings_from_Ruby_code.ipynb)\n* TAPAS ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.02349)):  \n  - fine-tuning `TapasForQuestionAnswering` on the Microsoft [Sequential Question Answering (SQA)](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fdownload\u002Fdetails.aspx?id=54253) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTAPAS\u002FFine_tuning_TapasForQuestionAnswering_on_SQA.ipynb)\n  - evaluating `TapasForSequenceClassification` on the [Table Fact Checking (TabFact)](https:\u002F\u002Ftabfact.github.io\u002F) dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTAPAS\u002FEvaluating_TAPAS_on_the_Tabfact_test_set.ipynb)\n* Table Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.00061)):\n  - using the Table Transformer for table detection and table structure recognition [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTable%20Transformer\u002FUsing_Table_Transformer_for_table_detection_and_table_structure_recognition.ipynb)\n* TrOCR ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2109.10282)):\n  - performing inference with `TrOCR` to illustrate optical character recognition with Transformers, as well as making a Gradio demo [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FInference_with_TrOCR_%2B_Gradio_demo.ipynb)\n  - fine-tuning `TrOCR` on the IAM dataset using the Seq2SeqTrainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FFine_tune_TrOCR_on_IAM_Handwriting_Database_using_Seq2SeqTrainer.ipynb)\n  - fine-tuning `TrOCR` on the IAM dataset using native PyTorch [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FFine_tune_TrOCR_on_IAM_Handwriting_Database_using_native_PyTorch.ipynb)\n  - evaluating `TrOCR` on the IAM test set [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FTrOCR\u002FEvaluating_TrOCR_base_handwritten_on_the_IAM_test_set.ipynb)\n* UPerNet ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1807.10221)):\n  - performing inference with `UperNetForSemanticSegmentation` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FUPerNet\u002FPerform_inference_with_UperNetForSemanticSegmentation_(Swin_backbone).ipynb)\n* VideoMAE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.12602)):\n  - performing inference with `VideoMAEForVideoClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVideoMAE\u002FQuick_inference_with_VideoMAE.ipynb)\n* ViLT ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2102.03334)):\n  - fine-tuning `ViLT` for visual question answering (VQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FFine_tuning_ViLT_for_VQA.ipynb)\n  - performing inference with `ViLT` to illustrate visual question answering (VQA) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FInference_with_ViLT_(visual_question_answering).ipynb)\n  - masked language modeling (MLM) with a pre-trained `ViLT` model [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FMasked_language_modeling_with_ViLT.ipynb)\n  - performing inference with `ViLT` for image-text retrieval [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FUsing_ViLT_for_image_text_retrieval.ipynb)\n  - performing inference with `ViLT` to illustrate natural language for visual reasoning (NLVR) [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViLT\u002FViLT_for_natural_language_visual_reasoning.ipynb)\n* ViTMAE ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377)):\n  - reconstructing pixel values with `ViTMAEForPreTraining` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FViTMAE\u002FViT_MAE_visualization_demo.ipynb)\n* Vision Transformer ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)):\n  - performing inference with `ViTForImageClassification` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FQuick_demo_of_HuggingFace_version_of_Vision_Transformer_inference.ipynb)\n  - fine-tuning `ViTForImageClassification` on [CIFAR-10](https:\u002F\u002Fwww.cs.toronto.edu\u002F~kriz\u002Fcifar.html) using PyTorch Lightning [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FFine_tuning_the_Vision_Transformer_on_CIFAR_10_with_PyTorch_Lightning.ipynb)\n  - fine-tuning `ViTForImageClassification` on [CIFAR-10](https:\u002F\u002Fwww.cs.toronto.edu\u002F~kriz\u002Fcifar.html) using the 🤗 Trainer [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FVisionTransformer\u002FFine_tuning_the_Vision_Transformer_on_CIFAR_10_with_the_%F0%9F%A4%97_Trainer.ipynb)\n* X-CLIP ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2208.02816)):\n  - performing zero-shot video classification with X-CLIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FX-CLIP\u002FVideo_text_matching_with_X_CLIP.ipynb)\n  - zero-shot classifying a YouTube video with X-CLIP [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FX-CLIP\u002FZero_shot_classify_a_YouTube_video_with_X_CLIP.ipynb)\n* YOLOS ([paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2106.00666)):\n  - fine-tuning `YolosForObjectDetection` on a custom dataset [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FYOLOS\u002FFine_tuning_YOLOS_for_object_detection_on_custom_dataset_(balloon).ipynb)\n  - inference with `YolosForObjectDetection` [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FYOLOS\u002FYOLOS_minimal_inference_example.ipynb)\n\n… 更多内容即将发布！🤗\n\n如果您对这些演示有任何疑问，请随时在此仓库中提交一个问题。\n\n顺便说一下，我也是将以下算法添加到该库的主要贡献者：\n- Google AI 的 TAbular PArSing (TAPAS)\n- Google AI 的 Vision Transformer (ViT)\n- Facebook AI 的 DINO\n- Facebook AI 的 Data-efficient Image Transformers (DeiT)\n- Studio Ousia 的 LUKE\n- Facebook AI 的 DEtection TRansformers (DETR)\n- Google AI 的 CANINE\n- Microsoft Research 的 BEiT\n- Microsoft Research 的 LayoutLMv2（以及 LayoutXLM）\n- Microsoft Research 的 TrOCR\n- NVIDIA 的 SegFormer\n- OpenAI 的 ImageGPT\n- Deepmind 的 Perceiver\n- Facebook AI 的 MAE\n- NAVER AI Lab 的 ViLT\n- Facebook AI 的 ConvNeXT\n- Microsoft Research 的 DiT\n- KAIST 的 GLPN\n- Intel Labs 的 DPT\n- 华中科技大学 EIC 学院的 YOLOS\n- Microsoft Research 的 TAPEX\n- Microsoft Research 的 LayoutLMv3\n- 南京大学多媒体计算组的 VideoMAE\n- Microsoft Research 的 X-CLIP\n- Microsoft Research 的 MarkupLM\n\n所有这些项目都让我受益匪浅。我强烈建议大家也为这个库贡献自己的 AI 算法！\n\n\n## 数据预处理\n关于为 PyTorch 模型准备数据，有几种选择：\n- 使用原生 PyTorch 数据集和数据加载器。这是为 PyTorch 模型准备数据的标准方法，即通过继承 `torch.utils.data.Dataset` 类，并创建相应的 `DataLoader`（一个 Python 生成器，允许循环遍历数据集中的每个样本）。在继承 `Dataset` 类时，需要实现三个方法：`__init__`、`__len__`（返回数据集的样本数量）和 `__getitem__`（根据整数索引返回数据集中的一个样本）。以下是一个创建基本文本分类数据集的示例（假设有一个包含两列“text”和“label”的 CSV 文件）：\n\n```python\nfrom torch.utils.data import Dataset\n\nclass CustomTrainDataset(Dataset):\n    def __init__(self, df, tokenizer):\n        self.df = df\n        self.tokenizer = tokenizer\n\n    def __len__(self):\n        return len(self.df)\n\n    def __getitem__(self, idx):\n        # 获取样本\n        item = df.iloc[idx]\n        text = item['text']\n        label = item['label']\n        # 对文本进行编码\n        encoding = self.tokenizer(text, padding=\"max_length\", max_length=128, truncation=True, return_tensors=\"pt\")\n        # 去掉分词器自动添加的批次维度\n        encoding = {k:v.squeeze() for k,v in encoding.items()}\n        # 添加标签\n        encoding[\"label\"] = torch.tensor(label)\n        \n        return encoding\n```\n\n实例化数据集的方式如下：\n\n```python\nfrom transformers import BertTokenizer\nimport pandas as pd\n\ntokenizer = BertTokenizer.from_pretrained(\"bert-base-uncased\")\ndf = pd.read_csv(\"path_to_your_csv\")\n\ntrain_dataset = CustomTrainDataset(df=df, tokenizer=tokenizer)\n```\n\n访问数据集的第一个样本可以这样操作：\n\n```python\nencoding = train_dataset[0]\n```\n\n实际上，通常会创建一个对应的 `DataLoader` 来从数据集中获取批次数据：\n\n```python\nfrom torch.utils.data import DataLoader\n\ntrain_dataloader = DataLoader(train_dataset, batch_size=4, shuffle=True)\n```\n\n我经常通过从数据加载器中取出第一个批次，打印张量的形状、将 input_ids 解码回文本等方式来检查数据是否正确创建。\n\n```python\nbatch = next(iter(train_dataloader))\nfor k,v in batch.items():\n    print(k, v.shape)\n# 解码批次中第一个样本的 input_ids\nprint(tokenizer.decode(batch['input_ids'][0].tolist())\n```\n- [HuggingFace Datasets](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002F)。Datasets 是 HuggingFace 提供的一个库，能够以非常快速且节省内存的方式轻松加载和处理数据。它基于 [Apache Arrow](https:\u002F\u002Farrow.apache.org\u002F) 构建，并具有诸如内存映射等强大功能，使得数据仅在需要时才会被加载到内存中。它与 [HuggingFace hub](https:\u002F\u002Fhuggingface.co\u002Fdatasets) 具有深度集成，因此可以轻松加载知名数据集，也可以将自己的数据集分享给社区。\n\n以 Dataset 对象的形式加载自定义数据集的方法如下（可以通过 `pip install datasets` 安装）：\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset('csv', data_files={'train': ['my_train_file_1.csv', 'my_train_file_2.csv'] 'test': 'my_test_file.csv'})\n```\n\n这里我加载的是本地的 CSV 文件，但该库也支持其他格式（包括 JSON、Parquet、txt），还可以直接从本地 Pandas 数据框或字典中加载数据。更多详细信息请参阅 [文档](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fdatasets\u002Floading.html#local-and-remote-files)。\n\n## 训练框架\n关于微调 Transformer 模型（或更一般地，PyTorch 模型），有几种选择：\n- 使用原生 PyTorch。这是最基础的训练方式，需要用户手动编写训练循环。优点是调试起来非常方便，缺点则是需要自己实现训练过程，比如设置模型模式（`model.train()`\u002F`model.eval()`）、管理设备分配（`model.to(device)`）等。典型的 PyTorch 训练循环如下所示（灵感来自 [这篇优秀的 PyTorch 入门教程]())：\n\n```python\nimport torch\nfrom transformers import BertForSequenceClassification\n\n# 实例化一个带有随机初始化分类头的预训练 BERT 模型\nmodel = BertForSequenceClassification.from_pretrained(\"bert-base-uncased\")\n\n# 我在微调基于 Transformer 的模型时几乎总是使用 5e-5 的学习率\noptimizer = torch.optim.Adam(model.parameters(), lr=5e-5)\n\n# 如果有可用的 GPU，则将模型放到 GPU 上\ndevice = torch.device(\"cuda\" if torch.cuda.is_available() else \"cpu\")\nmodel.to(device)\n\nfor epoch in range(epochs):\n    model.train()\n    train_loss = 0.0\n    for batch in train_dataloader:\n        # 将批次数据放到指定设备上\n        batch = {k:v.to(device) for k,v in batch.items()}\n        \n        # 前向传播\n        outputs = model(**batch)\n        loss = outputs.loss\n        \n        train_loss += loss.item()\n        \n        loss.backward()\n        optimizer.step()\n        optimizer.zero_grad()\n\n    print(f\"第 {epoch} 个 epoch 后的训练损失: {train_loss\u002Flen(train_dataloader)}\")\n    \n    model.eval()\n    val_loss = 0.0\n    with torch.no_grad():\n        for batch in eval_dataloader:\n            # 将批次数据放到指定设备上\n            batch = {k:v.to(device) for k,v in batch.items()}\n            \n            # 前向传播\n            outputs = model(**batch)\n            loss = outputs.logits\n            \n            val_loss += loss.item()\n                  \n    print(f\"第 {epoch} 个 epoch 后的验证损失: {val_loss\u002Flen(eval_dataloader)}\")\n```\n\n- [PyTorch Lightning (PL)](https:\u002F\u002Fwww.pytorchlightning.ai\u002F)。PyTorch Lightning 是一个框架，它通过将训练循环抽象为一个 `Trainer` 对象来自动化上述训练过程。用户不再需要自己编写训练循环，只需创建 `trainer = Trainer()` 并调用 `trainer.fit(model)` 即可。其优点是能够非常快速地开始训练模型（这也是“Lightning”名称的由来），因为所有与训练相关的代码都由 `Trainer` 对象处理。缺点则是调试模型可能会更加困难，因为训练和评估过程已经被抽象掉了。\n  \n- [HuggingFace Trainer](https:\u002F\u002Fhuggingface.co\u002Ftransformers\u002Fmain_classes\u002Ftrainer.html)。HuggingFace 的 Trainer API 可以被视为类似于 PyTorch Lightning 的框架，因为它同样使用 `Trainer` 对象来抽象训练过程。然而，与 PyTorch Lightning 不同的是，HuggingFace Trainer 并不是一个通用框架，而是专门为微调 HuggingFace Transformers 库中提供的基于 Transformer 的模型而设计的。此外，Trainer 还提供了一个名为 `Seq2SeqTrainer` 的扩展，用于编码器-解码器架构的模型，例如 BART、T5 和 `EncoderDecoderModel` 类。需要注意的是，Transformers 库中的所有 [PyTorch 示例脚本](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers\u002Ftree\u002Fmaster\u002Fexamples\u002Fpytorch) 都使用了 Trainer。\n\n- [HuggingFace Accelerate](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Faccelerate)：Accelerate 是一个新项目，旨在帮助那些仍然希望自行编写训练循环（如上文所示）的人，同时让这些循环能够在不同硬件环境下自动运行，例如多 GPU、TPU Pod、混合精度等。\n\n## 引用\n如果您在使用我的教程时有任何引用需求，请随时注明出处 :)\n\n```bibtex\n@misc{rogge2025transformerstutorials,\n  author = {Rogge, Niels},\n  title = {教程},\n  url = {[https:\u002F\u002Fgithub.com\u002FNielsRogge\u002Ftutorials](https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials)},\n  year = {2025}\n}\n```","# Transformers-Tutorials 快速上手指南\n\n本仓库提供了基于 🤗 Hugging Face `transformers` 库的一系列实战演示（Demo），涵盖音频、文本、图像及多模态任务。所有示例均使用 **PyTorch** 实现，并配有可直接运行的 Colab 笔记本。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python 版本**: 3.7 或更高\n*   **深度学习框架**: PyTorch (推荐安装最新稳定版)\n*   **硬件建议**: 部分模型（如 GPT-J-6B, BEiT）推理或微调需要 GPU 支持；轻量级模型可在 CPU 上运行。\n\n## 安装步骤\n\n推荐使用 `pip` 进行安装。为了获得更快的下载速度，国内用户建议使用清华或阿里镜像源。\n\n### 1. 安装核心依赖\n```bash\npip install transformers datasets accelerate torch torchvision torchaudio --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 安装可选依赖（针对特定任务）\n部分演示（如音频处理、文档解析）可能需要额外库：\n```bash\n# 音频处理 (AST, Wav2Vec2 等)\npip install librosa soundfile --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 文档智能与 OCR (Donut, LayoutLM 等)\npip install pillow opencv-python --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 训练加速与监控 (可选)\npip install pytorch-lightning tensorboard --index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 克隆本教程仓库\n获取所有示例代码和 Notebook：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials.git\ncd Transformers-Tutorials\n```\n\n## 基本使用\n\n本仓库的核心内容是各个模型目录下的 `.ipynb` (Jupyter Notebook) 文件。您可以直接在本地 Jupyter Lab 中打开，或点击 README 中的 \"Open In Colab\" 按钮在云端免费运行。\n\n### 示例：使用 BERT 进行自定义命名实体识别 (NER)\n\n以下展示如何加载预训练模型并进行微调的基本流程（对应 `BERT\u002FCustom_Named_Entity_Recognition...` 示例）：\n\n**1. 导入库与加载数据**\n```python\nfrom transformers import BertTokenizerFast, BertForTokenClassification\nfrom datasets import load_dataset\n\n# 加载数据集 (以 ConLL2003 为例)\ndataset = load_dataset(\"conll2003\")\n\n# 加载分词器\ntokenizer = BertTokenizerFast.from_pretrained(\"bert-base-cased\")\n```\n\n**2. 数据预处理**\n```python\ndef tokenize_and_align_labels(examples):\n    tokenized_inputs = tokenizer(examples[\"tokens\"], truncation=True, is_split_into_words=True)\n    # 此处省略标签对齐逻辑，详见 Notebook 源码\n    return tokenized_inputs\n\ntokenized_dataset = dataset.map(tokenize_and_align_labels, batched=True)\n```\n\n**3. 定义模型与训练参数**\n```python\nmodel = BertForTokenClassification.from_pretrained(\"bert-base-cased\", num_labels=9)\n\nfrom transformers import TrainingArguments\n\ntraining_args = TrainingArguments(\n    output_dir=\".\u002Fresults\",\n    evaluation_strategy=\"epoch\",\n    learning_rate=2e-5,\n    per_device_train_batch_size=16,\n    num_train_epochs=3,\n)\n```\n\n**4. 启动训练**\n```python\nfrom transformers import Trainer\n\ntrainer = Trainer(\n    model=model,\n    args=training_args,\n    train_dataset=tokenized_dataset[\"train\"],\n    eval_dataset=tokenized_dataset[\"validation\"],\n)\n\ntrainer.train()\n```\n\n### 其他热门任务入口\n*   **图像分类**: 查看 `ConvNeXT` 或 `BEiT` 文件夹。\n*   **目标检测**: 查看 `DETR` 或 `Conditional DETR` 文件夹。\n*   **文档智能**: 查看 `Donut` 文件夹（支持文档分类、VQA 及解析）。\n*   **零样本分割**: 查看 `CLIPSeg` 或 `GroupViT` 文件夹。\n\n> **提示**: 对于每个具体模型，请直接进入对应的子文件夹运行 `.ipynb` 文件，其中包含了从数据加载、预处理、微调到推理评估的完整代码。","一家电商初创公司的算法团队急需构建一个能自动识别商品评论中品牌名称与负面情感的智能监控系统，以快速响应舆情。\n\n### 没有 Transformers-Tutorials 时\n- 团队成员面对 Hugging Face 庞大的模型库无从下手，需花费数天查阅零散文档才能理解 BERT 或 CANINE 等架构的输入输出格式。\n- 在尝试微调模型进行多标签情感分类或命名实体识别（NER）时，因缺乏标准代码参考，频繁陷入数据预处理对齐和维度报错的调试泥潭。\n- 想要复现论文中的先进模型（如 AST 音频分类或 CLIPSeg 图像分割）时，必须从头编写复杂的推理逻辑，开发周期被严重拉长。\n- 缺乏针对自定义数据集（如特定气球检测或影评数据）的微调范例，导致团队在数据加载器和损失函数设计上反复试错。\n\n### 使用 Transformers-Tutorials 后\n- 直接复用仓库中现成的 BERT 多标签分类和 NER 微调脚本，团队在几小时内便跑通了基准模型，大幅降低了入门门槛。\n- 参照详细的 Colab 笔记本步骤，快速解决了自定义数据集的格式转换问题，将原本需要数天的调试工作压缩至半天完成。\n- 利用提供的 AST 音频分类和 CLIPSeg 零样本分割演示代码，成功将业务场景从纯文本扩展至音视频多模态分析，无需重新造轮子。\n- 基于 Conditional DETR 在自定义物体检测上的微调案例，迅速完成了对特定商品瑕疵的检测模型部署，显著提升了迭代效率。\n\nTransformers-Tutorials 通过提供高质量、可执行的端到端演示，将算法工程师从繁琐的代码搭建中解放出来，使其能专注于业务逻辑与模型优化。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FNielsRogge_Transformers-Tutorials_e4e814e4.png","NielsRogge",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FNielsRogge_f494484e.jpg","ML @HuggingFace. Interested in deep learning, NLP. \r\n\r\nContributed 40+ models to HuggingFace Transformers","HuggingFace","Belgium","nielsrogge.github.io","https:\u002F\u002Fgithub.com\u002FNielsRogge",[85,89],{"name":86,"color":87,"percentage":88},"Jupyter Notebook","#DA5B0B",100,{"name":90,"color":91,"percentage":92},"Python","#3572A5",0,11554,1719,"2026-04-02T13:17:37","MIT","未说明","非绝对必需（部分演示支持 CPU），但微调大型模型（如 GPT-J-6B, Donut）建议使用 NVIDIA GPU；具体显存和 CUDA 版本取决于所选模型，Colab 环境通常可用","未说明（取决于具体模型，大型模型如 GPT-J-6B 需要大量内存）",{"notes":101,"python":97,"dependencies":102},"该项目主要为 Google Colab 笔记本集合，所有演示均基于 PyTorch 实现。建议直接在 Colab 中运行以避免本地环境配置问题。若本地运行，需安装 Hugging Face 生态系统相关库（Transformers, Datasets, Accelerate 等）。部分高级演示（如 Donut, LUKE）依赖 PyTorch Lightning。具体硬件需求因模型架构差异巨大（从轻量级 BERT 到 6B 参数的 GPT-J）。",[103,104,105,106,107,108],"transformers","torch","datasets","tokenizers","accelerate","pytorch-lightning",[13,26],[103,111,112,113,114,115],"pytorch","bert","vision-transformer","layoutlm","gpt-2",4,"2026-03-27T02:49:30.150509","2026-04-06T09:43:24.283885",[120,125,130,135,140,145],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},10862,"如何在自定义数据集上训练 Mask2Former？","你需要准备 COCO 格式的 JSON 文件（包含训练集和验证集的标注）以及存放所有图像的文件夹。虽然该模型基于 Detectron2 构建，但在 Transformers 库中使用时，需要确保数据集注册方式符合 Hugging Face 的格式。建议参考实例分割 COCO 数据集的标注方式（通常包含约 1000 个类别）。如果遇到问题，可以检查是否可以使用掩码图像的每个通道来标注 256 个类别，三个通道组合可支持约 1000 个类别。","https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fissues\u002F243",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},10863,"如何使用本地目录中的自定义数据集训练 LayoutLMv3？","你需要为每个文档准备一个包含单词列表、对应的边界框（boxes）和标签（labels）的数据结构。示例格式如下：\nwords = [\"hello\", \"world\", \"invoice\", \"number\"]\nboxes = [[x1, y1, x2, y2], ...] # 对应每个单词的坐标\nword_labels = [\"other\", \"other\", \"invoice_number\", \"invoice_number\"]\n如果使用 Label Studio 进行标注，确保导出的 JSON 中包含具体的文本内容，而不仅仅是边界框和标签类别。若缺少文本，需调整标注模板或在后处理中将 OCR 识别的文本与边界框匹配。","https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fissues\u002F123",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},10864,"如何使用 LayoutLM 从扫描发票中提取键值对（Key-Value Pairs）？","对于复杂的文档（如混乱的发票），推荐使用生成式模型（如 Donut 或 PaliGemma）代替 LayoutLM。这些模型可以直接根据文档图像生成 JSON 格式的键值对，处理起来更简单。可以参考官方提供的微调 notebook：https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FPaliGemma\u002FFine_tune_PaliGemma_for_image_%3EJSON.ipynb。如果必须使用 LayoutLM，则需要结合关系提取模块，但这通常比生成式方法更复杂且容易出错。","https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fissues\u002F144",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},10865,"MaskFormer 能否用于实例分割（Instance Segmentation）？","可以。虽然部分示例笔记本展示的是语义分割，但 MaskFormer 同样支持实例分割。你可以使用专门针对实例分割数据集（如 ADE20k full）微调的笔记本作为参考：https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fblob\u002Fmaster\u002FMaskFormer\u002FFine-tuning\u002FFine_tune_MaskFormer_on_an_instance_segmentation_dataset_(ADE20k_full).ipynb。在代码实现上，需注意 MaskFormerImageProcessor 与 DETR 的特征提取器有所不同，确保数据加载时包含必要的 'labels' 字段以避免 KeyError。","https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fissues\u002F227",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},10866,"在使用 LayoutLMv2 进行自定义令牌分类后，如何从未见过的图像中提取特定标签的文本？","首先，模型会输出预测的标签和对应的边界框。要提取特定标签的文本，需要将模型输出的边界框与外部 OCR 引擎（如 EasyOCR）识别出的文本进行匹配。具体步骤包括：\n1. 对 OCR 获取的边界框进行归一化处理，使其与模型输入图像的分辨率一致。\n2. 将 OCR 结果整理为列表形式：[[[x1, y1, w, h], \"text_token\"], ...]。\n3. 遍历模型预测的边界框，找到与之匹配的 OCR 文本块。\n示例代码逻辑：遍历预测结果，如果标签是目标标签（如 \"question\"），则在 OCR 结果中寻找坐标重叠的单词并提取文本。","https:\u002F\u002Fgithub.com\u002FNielsRogge\u002FTransformers-Tutorials\u002Fissues\u002F154",{"id":146,"question_zh":147,"answer_zh":148,"source_url":129},10867,"如何在 Label Studio 中为 LayoutLM 系列模型准备包含文本内容的标注文件？","Label Studio 默认导出的 JSON 可能只包含边界框坐标和标签类别，缺少具体的文本内容。为了解决这个问题，你有两种选择：\n1. 调整 Label Studio 的标注模板，确保在标注时手动输入或自动关联对应的文本字段。\n2. 在后处理阶段，使用 OCR 工具读取原始图像中的文本，然后根据标注 JSON 中的边界框坐标，将识别出的文本映射到对应的标签条目中。最终生成的数据结构应包含 words（文本）、boxes（坐标）和 word_labels（类别）三个列表。",[]]