[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-facebookresearch--hiera":3,"tool-facebookresearch--hiera":62},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,2,"2026-04-10T11:39:34",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":32,"last_commit_at":41,"category_tags":42,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[43,13,15,14],"插件",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[52,15,13,14],"语言模型",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,61],"视频",{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":77,"owner_twitter":77,"owner_website":78,"owner_url":79,"languages":80,"stars":85,"forks":86,"last_commit_at":87,"license":88,"difficulty_score":89,"env_os":76,"env_gpu":90,"env_ram":91,"env_deps":92,"category_tags":98,"github_topics":77,"view_count":10,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":99,"updated_at":100,"faqs":101,"releases":137},3395,"facebookresearch\u002Fhiera","hiera","Hiera: A fast, powerful, and simple hierarchical vision transformer.","Hiera 是一款由 Meta 研究团队推出的分层视觉 Transformer 模型，专为图像和视频理解任务设计。它在保持架构极度简洁的同时，实现了比当前最先进模型更快的推理速度和更高的准确率。\n\n传统视觉 Transformer（如 ViT）通常在整个网络中维持固定的空间分辨率和特征数量，导致计算资源分配不够高效：早期层不需要过多特征，而后期层则无需过高分辨率。虽然此前的分层模型（如 Swin、MViT）试图解决这一问题，但往往通过堆叠复杂模块来弥补结构偏差，反而拖慢了整体速度。Hiera 的创新之处在于摒弃了这些“花哨”的设计，采用简单的分层结构，让模型在训练过程中自动学习不同阶段所需的空间表示，从而在保证性能的同时大幅降低计算开销。\n\n这款工具特别适合从事计算机视觉研究的科研人员、需要高效部署模型的开发者，以及希望在不牺牲精度的前提下提升推理速度的工程团队。凭借其清晰的代码实现、对 PyTorch Hub 和 Hugging Face 的良好支持，Hiera 为构建快速且强大的视觉系统提供了一个可靠而优雅的基础选择。","# Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles\n\n[![Torch Hub Support](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftorch_hub-gray?logo=pytorch)](#torch-hub)\n[![HF Hub Support](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97_huggingface_hub-gray)](#hugging-face-hub)\n[![Torch Hub Support](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyPI-gray?logo=pypi&logoColor=lightblue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fhiera-transformer\u002F)\n[![Python 3.6](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.8+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![Github Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Ffacebookresearch\u002Fhiera.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Freleases)\n[![Code License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode_license-Apache_2.0-olive)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![Model License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmodel_zoo_license-CC_BY--NC_4.0-lightgrey)](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002Fdeed.en)\n\nThis is the official implementation for our ICML 2023 Oral paper:  \n**[Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles][arxiv-link]**  \n[Chaitanya Ryali](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=4LWx24UAAAAJ)\\*,\n[Yuan-Ting Hu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=aMpbemkAAAAJ)\\*,\n[Daniel Bolya](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=K3ht_ZUAAAAJ)\\*,\n[Chen Wei](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=LHQGpBUAAAAJ),\n[Haoqi Fan](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=76B8lrgAAAAJ),\n[Po-Yao Huang](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=E8K25LIAAAAJ),\n[Vaibhav Aggarwal](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Qwm6ZOYAAAAJ),\n[Arkabandhu Chowdhury](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=42v1i_YAAAAJ),\n[Omid Poursaeed](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Ugw9DX0AAAAJ),\n[Judy Hoffman](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=mqpjAt4AAAAJ),\n[Jitendra Malik](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=oY9R5YQAAAAJ),\n[Yanghao Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=-VgS8AIAAAAJ)\\*,\n[Christoph Feichtenhofer](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=UxuqG1EAAAAJ)\\*  \n_[ICML '23 Oral][icml-link]_ | _[GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera)_ | _[arXiv][arxiv-link]_ | _[BibTeX](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera#citation)_\n\n\\*: Equal contribution.\n\n## What is Hiera?\n**Hiera** is a _hierarchical_ vision transformer that is fast, powerful, and, above all, _simple_. It outperforms the state-of-the-art across a wide array of image and video tasks _while being much faster_. \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_hiera_readme_50cf2f70418a.png\" width=\"75%\">\n\u003C\u002Fp>\n\n## How does it work?\n![A diagram of Hiera's architecture.](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_hiera_readme_89ea361c59c2.png)\n\nVision transformers like [ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929) use the same spatial resolution and number of features throughout the whole network. But this is inefficient: the early layers don't need that many features, and the later layers don't need that much spatial resolution. Prior hierarchical models like [ResNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385) accounted for this by using fewer features at the start and less spatial resolution at the end.\n\nSeveral domain specific vision transformers have been introduced that employ this hierarchical design, such as [Swin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.14030) or [MViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.11227). But in the pursuit of state-of-the-art results using fully supervised training on ImageNet-1K, these models have become more and more complicated as they add specialized modules to make up for spatial biases that ViTs lack. While these changes produce effective models with attractive FLOP counts, under the hood the added complexity makes these models _slower_ overall.\n\nWe show that a lot of this bulk is actually _unnecessary_. Instead of manually adding spatial bases through architectural changes, we opt to _teach_ the model these biases instead. By training with [MAE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377), we can simplify or remove _all_ of these bulky modules in existing transformers and _increase accuracy_ in the process. The result is Hiera, an extremely efficient and simple architecture that outperforms the state-of-the-art in several image and video recognition tasks.\n\n## News\n - **[2024.03.02]** License for the code has been made more permissive (Apache 2.0)! Model license remains unchanged.\n - **[2023.06.12]** Added more in1k models and some video examples, see inference.ipynb (v0.1.1).\n - **[2023.06.01]** Initial release.\n\nSee the [changelog](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Ftree\u002Fmain\u002FCHANGELOG.md) for more details.\n\n## Installation\n\nHiera requires a reasonably recent version of [torch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F).\nAfter that, you can install hiera through [pip](https:\u002F\u002Fpypi.org\u002Fproject\u002Fhiera-transformer\u002F):\n```bash\npip install hiera-transformer\n```\nThis repo _should_ support the latest timm version, but timm is a constantly updating package. Create an issue if you have problems with a newer version of timm.\n\n### Installing from Source\n\nIf using [torch hub](#model-zoo), you don't need to install the `hiera` package. But, if you'd like to develop using hiera, it could be a good idea to install it from source:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera.git\ncd hiera\npython setup.py build develop\n```\n\n\n## Model Zoo\nNote that model weights are released under a separate license than the code. See the [model license](LICENSE.models) for more details.\n\n### Torch Hub\n\nHere we provide model checkpoints for Hiera. Each model listed is accessible on [torch hub](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fhub.html) even without the `hiera-transformer` package installed, e.g. the following initializes a base model pretrained and finetuned on ImageNet-1k:\n```py\nmodel = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\n```\n\nIf you want a model with MAE pretraining only, you can replace the checkpoint with `\"mae_in1k\"`. Additionally, if you'd like to load the MAE decoder as well (e.g., to continue pretraining), add `mae_` the the start of the model name, e.g.:\n```py\nmodel = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"mae_hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k\")\n```\n**Note:** Our MAE models were trained with a _normalized pixel loss_. That means that the patches were normalized before the network had to predict them. If you want to visualize the predictions, you'll have to unnormalize them using the visible patches (which might work but wouldn't be perfect) or unnormalize them using the ground truth. For model more names and corresponding checkpoint names see below.\n\n### Hugging Face Hub\n\nThis repo also has [🤗 hub](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Findex) support. With the `hiera-transformer` and `huggingface-hub` packages installed, you can simply run, e.g.,\n```py\nfrom hiera import Hiera\nmodel = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k_ft_in1k\")  # mae pt then in1k ft'd model\nmodel = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k\") # just mae pt, no ft\n```\nto load a model. Use `\u003Cmodel_name>.\u003Ccheckpoint_name>` from model zoo below.\n\nIf you want to save a model, use `model.config` as the config, e.g.,\n```py\nmodel.save_pretrained(\"hiera-base-224\", config=model.config)\n```\n\n### Image Models\n| Model    | Model Name            | Pretrained Models\u003Cbr>(IN-1K MAE) | Finetuned Models\u003Cbr>(IN-1K Supervised) | IN-1K\u003Cbr>Top-1 (%) | A100 fp16\u003Cbr>Speed (im\u002Fs) |\n|----------|-----------------------|----------------------------------|----------------------------------------|:------------------:|:-------------------------:|\n| Hiera-T  | `hiera_tiny_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_tiny_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_tiny_224.pth)       |       82.8         |            2758           |\n| Hiera-S  | `hiera_small_224`     | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_small_224.pth)       | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_small_224.pth)      |       83.8         |            2211           |\n| Hiera-B  | `hiera_base_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_224.pth)       |       84.5         |            1556           |\n| Hiera-B+ | `hiera_base_plus_224` | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_plus_224.pth)   | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_plus_224.pth)  |       85.2         |            1247           |\n| Hiera-L  | `hiera_large_224`     | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_large_224.pth)       | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_large_224.pth)      |       86.1         |            531            |\n| Hiera-H  | `hiera_huge_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_huge_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_huge_224.pth)       |       86.9         |            274            |\n\nEach model inputs a 224x224 image.\n### Video Models\n| Model    | Model Name               | Pretrained Models\u003Cbr>(K400 MAE) | Finetuned Models\u003Cbr>(K400) | K400 (3x5 views)\u003Cbr>Top-1 (%) | A100 fp16\u003Cbr>Speed (clip\u002Fs) |\n|----------|--------------------------|---------------------------------|----------------------------|:-----------------------------:|:---------------------------:|\n| Hiera-B  | `hiera_base_16x224`      | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_16x224.pth)       | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_16x224.pth)      |              84.0             |            133.6            |\n| Hiera-B+ | `hiera_base_plus_16x224` | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_plus_16x224.pth)  | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_plus_16x224.pth) |              85.0             |             84.1            |\n| Hiera-L  | `hiera_large_16x224`     | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_large_16x224.pth)      | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_large_16x224.pth)     |              87.3             |             40.8            |\n| Hiera-H  | `hiera_huge_16x224`      | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_huge_16x224.pth)       | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_huge_16x224.pth)      |              87.8             |             20.9            |\n\nEach model inputs 16 224x224 frames with a temporal stride of 4.\n\n**Note:** the speeds listed here were benchmarked _without_ PyTorch's optimized [scaled dot product attention](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html). If using PyTorch 2.0 or above, your inference speed will probably be faster than what's listed here.\n\n## Usage\n\nThis repo implements the code to run Hiera models for inference. This repository is still in progress. Here's what we currently have available and what we have planned:\n\n - [x] Image Inference\n    - [x] MAE implementation\n - [x] Video Inference\n    - [x] MAE implementation\n - [x] Full Model Zoo\n - [ ] Training scripts\n\n\nSee [examples](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Ftree\u002Fmain\u002Fexamples) for examples of how to use Hiera.\n\n### Inference\n\nSee [examples\u002Finference](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fblob\u002Fmain\u002Fexamples\u002Finference.ipynb) for an example of how to prepare the data for inference.\n\nInstantiate a model with either [torch hub](#model-zoo) or [🤗 hub](#model-zoo) or by [installing hiera](#installing-from-source) and running:\n```py\nimport hiera\nmodel = hiera.hiera_base_224(pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\n```\nThen you can run inference like any other model:\n```py\noutput = model(x)\n```\nVideo inference works the same way, just use a `16x224` model instead.\n\n**Note**: for efficiency, Hiera re-orders its tokens at the start of the network (see the `Roll` and `Unroll` modules in `hiera_utils.py`). Thus, tokens _aren't in spatial order_ by default. If you'd like to use intermediate feature maps for a downstream task, pass the `return_intermediates` flag when running the model:\n```py\noutput, intermediates = model(x, return_intermediates=True)\n```\n\n#### MAE Inference\nBy default, the models do not include the MAE decoder. If you would like to use the decoder or compute MAE loss, you can instantiate an mae version by running:\n```py\nimport hiera\nmodel = hiera.mae_hiera_base_224(pretrained=True, checkpoint=\"mae_in1k\")\n```\nThen when you run inference on the model, it will return a 4-tuple of `(loss, predictions, labels, mask)` where predictions and labels are for the _deleted tokens_ only. The returned mask will be `True` if the token is visible and `False` if it's deleted. You can change the masking ratio by passing it during inference:\n```py\nloss, preds, labels, mask = model(x, mask_ratio=0.6)\n```\nThe default mask ratio is `0.6` for images, but you should pass in `0.9` for video. See the paper for details.\n\n**Note:** We use _normalized pixel targets_ for MAE pretraining, meaning the patches are each individually normalized before the model model has to predict them. Thus, you have to unnormalize them using the ground truth before visualizing them. See `get_pixel_label_2d` in `hiera_mae.py` for details.\n\n### Benchmarking\nWe provide a script for easy benchmarking. See [examples\u002Fbenchmark](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fblob\u002Fmain\u002Fexamples\u002Fbenchmark.ipynb) to see how to use it.\n\n#### Scaled Dot Product Attention\nPyTorch 2.0 introduced optimized [scaled dot product attention](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html), which can speed up transformers quite a bit. We didn't use this in our original benchmarking, but since it's a free speed-up this repo will automatically use it if available. To get its benefits, make sure your torch version is 2.0 or above.\n\n### Training\n\nComing soon.\n\n\n## Citation\nIf you use Hiera or this code in your work, please cite:\n```\n@article{ryali2023hiera,\n  title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},\n  author={Ryali, Chaitanya and Hu, Yuan-Ting and Bolya, Daniel and Wei, Chen and Fan, Haoqi and Huang, Po-Yao and Aggarwal, Vaibhav and Chowdhury, Arkabandhu and Poursaeed, Omid and Hoffman, Judy and Malik, Jitendra and Li, Yanghao and Feichtenhofer, Christoph},\n  journal={ICML},\n  year={2023}\n}\n```\n\n### License\nThe code for this work is licensed under the [Apache License, Version 2.0](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0), while the model weights are licensed under the [Creative Commons Attribution-NonCommercial 4.0 International License](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F).\n\nSee [LICENSE](LICENSE) for more details on the code license, and [LICENSE.models](LICENSE.models) for more details on the model weight license.\n\n### Contributing\nSee [contributing](CONTRIBUTING.md) and the [code of conduct](CODE_OF_CONDUCT.md).\n\n[arxiv-link]: https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00989\u002F\n[icml-link]: https:\u002F\u002Ficml.cc\u002FConferences\u002F2023\n","# Hiera：一款去繁从简的层次化视觉Transformer\n\n[![Torch Hub 支持](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftorch_hub-gray?logo=pytorch)](#torch-hub)\n[![HF Hub 支持](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97_huggingface_hub-gray)](#hugging-face-hub)\n[![PyPI 支持](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPyPI-gray?logo=pypi&logoColor=lightblue)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fhiera-transformer\u002F)\n[![Python 3.6](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.8+-blue.svg)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![GitHub 发布](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Ffacebookresearch\u002Fhiera.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Freleases)\n[![代码许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcode_license-Apache_2.0-olive)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n[![模型许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fmodel_zoo_license-CC_BY--NC_4.0-lightgrey)](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002Fdeed.en)\n\n这是我们ICML 2023口头报告论文的官方实现：  \n**[Hiera：一款去繁从简的层次化视觉Transformer][arxiv-link]**  \n[Chaitanya Ryali](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=4LWx24UAAAAJ)\\*,\n[Yuan-Ting Hu](https:\u002F\u002Fscholar.google.com\u002Fcitations?user=aMpbemkAAAAJ)\\*,\n[Daniel Bolya](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=K3ht_ZUAAAAJ)\\*,\n[Chen Wei](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=LHQGpBUAAAAJ),\n[Haoqi Fan](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=76B8lrgAAAAJ),\n[Po-Yao Huang](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=E8K25LIAAAAJ),\n[Vaibhav Aggarwal](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Qwm6ZOYAAAAJ),\n[Arkabandhu Chowdhury](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=42v1i_YAAAAJ),\n[Omid Poursaeed](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=Ugw9DX0AAAAJ),\n[Judy Hoffman](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=mqpjAt4AAAAJ),\n[Jitendra Malik](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=oY9R5YQAAAAJ),\n[Yanghao Li](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=-VgS8AIAAAAJ)\\*,\n[Christoph Feichtenhofer](https:\u002F\u002Fscholar.google.com\u002Fcitations?hl=en&user=UxuqG1EAAAAJ)\\*  \n_[ICML '23 口头报告][icml-link]_ | _[GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera)_ | _[arXiv][arxiv-link]_ | _[BibTeX](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera#citation)_\n\n\\*: 共同第一作者。\n\n## 什么是Hiera？\n**Hiera** 是一种快速、强大且最重要的是**简单**的**层次化**视觉Transformer。它在广泛的图像和视频任务中均优于当前最先进水平，同时速度也快得多。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_hiera_readme_50cf2f70418a.png\" width=\"75%\">\n\u003C\u002Fp>\n\n## 它是如何工作的？\n![Hiera的架构示意图。](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_hiera_readme_89ea361c59c2.png)\n\n像[ViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2010.11929)这样的视觉Transformer在整个网络中使用相同的空间分辨率和特征数量。但这样做效率不高：早期层并不需要那么多特征，而后期层也不需要那么高的空间分辨率。先前的层次化模型，如[ResNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385)，通过在开始时使用较少的特征、在结束时使用较低的空间分辨率来解决这一问题。\n\n已经出现了一些采用这种层次化设计的特定领域视觉Transformer，例如[Swin](https:\u002F\u002Farxiv.org\u002Fabs\u002F2103.14030)或[MViT](https:\u002F\u002Farxiv.org\u002Fabs\u002F2104.11227)。然而，在追求ImageNet-1K上完全监督训练下的最先进结果的过程中，这些模型为了弥补ViT缺乏的空间偏差，不断增加专门模块，导致模型越来越复杂。尽管这些改进确实产生了效果良好、浮点运算量诱人的模型，但从底层来看，增加的复杂性反而使这些模型整体变得更慢。\n\n我们证明，许多这样的冗余部分其实是**不必要的**。与其通过架构调整手动添加空间偏置，不如直接让模型自己学习这些偏置。通过使用[MAE](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.06377)进行训练，我们可以简化或移除现有Transformer中的所有这些复杂模块，并在此过程中**提高精度**。最终得到的就是Hiera——一种极其高效且简单的架构，它在多项图像和视频识别任务中都超越了当前最先进水平。\n\n## 最新消息\n - **[2024.03.02]** 代码许可证已放宽至更宽松的Apache 2.0！模型许可证保持不变。\n - **[2023.06.12]** 增加了更多in1k模型和一些视频示例，请参阅inference.ipynb（v0.1.1）。\n - **[2023.06.01]** 初始发布。\n\n更多详情请参阅[变更日志](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Ftree\u002Fmain\u002FCHANGELOG.md)。\n\n## 安装\n\nHiera需要一个相对较新的[torch](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F)版本。之后，您可以通过[pip](https:\u002F\u002Fpypi.org\u002Fproject\u002Fhiera-transformer\u002F)安装Hiera：\n```bash\npip install hiera-transformer\n```\n本仓库应支持最新的timm版本，但timm是一个不断更新的包。如果您在使用较新版本的timm时遇到问题，请提交issue。\n\n### 从源码安装\n\n如果使用[torch hub](#model-zoo)，则无需安装`hiera`包。但是，如果您想基于Hiera进行开发，从源码安装可能是个不错的选择：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera.git\ncd hiera\npython setup.py build develop\n```\n\n\n## 模型库\n请注意，模型权重与代码采用不同的许可证。更多详情请参阅[模型许可证](LICENSE.models)。\n\n### Torch Hub\n\n在这里，我们提供了Hiera的模型检查点。列出的每个模型都可以在[torch hub](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fhub.html)上访问，即使未安装`hiera-transformer`包亦可，例如以下代码会初始化一个在ImageNet-1K上预训练并微调的基础模型：\n```py\nmodel = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\n```\n\n如果您只需要带有MAE预训练的模型，可以将checkpoint替换为`\"mae_in1k\"`。此外，如果您还想加载MAE解码器（例如继续预训练），可以在模型名称前加上`mae_`，例如：\n```py\nmodel = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"mae_hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k\")\n```\n\n**注意：** 我们的MAE模型是使用**归一化的像素损失**训练的。这意味着在模型预测之前，补丁已经被归一化了。如果您想可视化预测结果，就需要使用可见的补丁对其进行反归一化处理（这可能会奏效，但并不完美），或者直接使用真实标签进行反归一化。更多模型名称及对应的checkpoint名称见下文。\n\n### Hugging Face Hub\n\n此仓库也支持 [🤗 hub](https:\u002F\u002Fhuggingface.co\u002Fdocs\u002Fhub\u002Findex)。在安装了 `hiera-transformer` 和 `huggingface-hub` 包后，您可以简单地运行如下代码来加载模型：\n```py\nfrom hiera import Hiera\nmodel = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k_ft_in1k\")  # 先用 MAE 预训练再在 ImageNet-1K 上微调的模型\nmodel = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k\") # 仅 MAE 预训练，未微调\n```\n请使用下方模型库中的 `\u003Cmodel_name>.\u003Ccheckpoint_name>`。\n\n如果您想保存模型，可以使用 `model.config` 作为配置文件，例如：\n```py\nmodel.save_pretrained(\"hiera-base-224\", config=model.config)\n```\n\n### 图像模型\n| 模型    | 模型名称            | 预训练模型\u003Cbr>(IN-1K MAE) | 微调模型\u003Cbr>(IN-1K 监督学习) | IN-1K\u003Cbr>Top-1 (%) | A100 fp16\u003Cbr>速度 (im\u002Fs) |\n|----------|-----------------------|----------------------------------|----------------------------------------|:------------------:|:-------------------------:|\n| Hiera-T  | `hiera_tiny_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_tiny_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_tiny_224.pth)       |       82.8         |            2758           |\n| Hiera-S  | `hiera_small_224`     | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_small_224.pth)       | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_small_224.pth)      |       83.8         |            2211           |\n| Hiera-B  | `hiera_base_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_224.pth)       |       84.5         |            1556           |\n| Hiera-B+ | `hiera_base_plus_224` | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_plus_224.pth)   | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_plus_224.pth)  |       85.2         |            1247           |\n| Hiera-L  | `hiera_large_224`     | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_large_224.pth)       | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_large_224.pth)      |       86.1         |            531            |\n| Hiera-H  | `hiera_huge_224`      | [mae_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_huge_224.pth)        | [mae_in1k_ft_in1k](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_huge_224.pth)       |       86.9         |            274            |\n\n每个模型的输入为 224×224 的图像。\n### 视频模型\n| 模型    | 模型名称               | 预训练模型\u003Cbr>(K400 MAE) | 微调模型\u003Cbr>(K400) | K400 (3×5 视图)\u003Cbr>Top-1 (%) | A100 fp16\u003Cbr>速度 (clip\u002Fs) |\n|----------|--------------------------|---------------------------------|----------------------------|:-----------------------------:|:---------------------------:|\n| Hiera-B  | `hiera_base_16x224`      | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_16x224.pth)       | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_16x224.pth)      |              84.0             |            133.6            |\n| Hiera-B+ | `hiera_base_plus_16x224` | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_base_plus_16x224.pth)  | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_base_plus_16x224.pth) |              85.0             |             84.1            |\n| Hiera-L  | `hiera_large_16x224`     | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_large_16x224.pth)      | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_large_16x224.pth)     |              87.3             |             40.8            |\n| Hiera-H  | `hiera_huge_16x224`      | [mae_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fmae_hiera_huge_16x224.pth)       | [mae_k400_ft_k400](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Fhiera\u002Fhiera_huge_16x224.pth)      |              87.8             |             20.9            |\n\n每个模型的输入为 16 帧 224×224 的视频帧，时间步长为 4。\n\n**注意：** 此处列出的速度是在 _未使用 PyTorch 优化的缩放点积注意力机制_ 的情况下测得的。如果您使用的是 PyTorch 2.0 或更高版本，您的推理速度可能会比这里列出的速度更快。\n\n## 使用方法\n\n本仓库实现了运行 Hiera 模型进行推理的代码。该仓库仍在开发中。目前我们已有的功能和未来计划如下：\n\n - [x] 图像推理\n    - [x] 实现 MAE\n - [x] 视频推理\n    - [x] 实现 MAE\n - [x] 完整的模型库\n - [ ] 训练脚本\n\n\n有关如何使用 Hiera 的示例，请参阅 [examples](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Ftree\u002Fmain\u002Fexamples)。\n\n### 推理\n\n有关如何准备推理数据的示例，请参阅 [examples\u002Finference](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fblob\u002Fmain\u002Fexamples\u002Finference.ipynb)。\n\n您可以通过 [torch hub](#model-zoo) 或 [🤗 hub](#model-zoo) 实例化模型，也可以通过 [从源码安装](#installing-from-source)并运行以下代码来实例化模型：\n```py\nimport hiera\nmodel = hiera.hiera_base_224(pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\n```\n然后您可以像其他模型一样进行推理：\n```py\noutput = model(x)\n```\n视频推理的方式相同，只需使用 `16x224` 的模型即可。\n\n**注意**：为了提高效率，Hiera 在网络开始时会对 token 进行重新排序（参见 `hiera_utils.py` 中的 `Roll` 和 `Unroll` 模块）。因此，默认情况下，token 并不是按空间顺序排列的。如果您希望将中间特征图用于下游任务，可以在运行模型时传递 `return_intermediates` 标志：\n```py\noutput, intermediates = model(x, return_intermediates=True)\n```\n\n#### MAE 推理\n默认情况下，这些模型不包含 MAE 解码器。如果您希望使用解码器或计算 MAE 损失，可以通过以下方式实例化一个带有 MAE 的版本：\n```py\nimport hiera\nmodel = hiera.mae_hiera_base_224(pretrained=True, checkpoint=\"mae_in1k\")\n```\n当您对模型进行推理时，它将返回一个四元组 `(loss, predictions, labels, mask)`，其中 predictions 和 labels 仅对应于 _被删除的 tokens_。返回的 mask 将在 token 可见时为 `True`，而在 token 被删除时为 `False`。您可以通过在推理时传入 `mask_ratio` 来更改掩码比例：\n```py\nloss, preds, labels, mask = model(x, mask_ratio=0.6)\n```\n对于图像，默认的掩码比例是 0.6，但对于视频则应设置为 0.9。详情请参阅论文。\n\n**注意：** 我们在 MAE 预训练中使用 _归一化的像素目标_，这意味着在模型预测之前，每个 patch 都会单独进行归一化处理。因此，在可视化之前，您需要使用真实值对它们进行反归一化处理。详细信息请参阅 `hiera_mae.py` 中的 `get_pixel_label_2d` 函数。\n\n### 基准测试\n我们提供了一个便于基准测试的脚本。请参阅 [examples\u002Fbenchmark](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fblob\u002Fmain\u002Fexamples\u002Fbenchmark.ipynb)，了解如何使用它。\n\n#### 缩放点积注意力\nPyTorch 2.0 引入了优化的[缩放点积注意力](https:\u002F\u002Fpytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.functional.scaled_dot_product_attention.html)，它可以显著加速 Transformer 模型。我们在最初的基准测试中并未使用这一功能，但由于这是一种无需额外工作即可实现的性能提升，因此本仓库会在可用时自动启用它。要享受这一优势，请确保您的 PyTorch 版本为 2.0 或更高。\n\n### 训练\n\n即将推出。\n\n\n## 引用\n如果您在工作中使用了 Hiera 或此代码，请引用以下文献：\n```\n@article{ryali2023hiera,\n  title={Hiera: A Hierarchical Vision Transformer without the Bells-and-Whistles},\n  author={Ryali, Chaitanya and Hu, Yuan-Ting and Bolya, Daniel and Wei, Chen and Fan, Haoqi and Huang, Po-Yao and Aggarwal, Vaibhav and Chowdhury, Arkabandhu and Poursaeed, Omid and Hoffman, Judy and Malik, Jitendra and Li, Yanghao and Feichtenhofer, Christoph},\n  journal={ICML},\n  year={2023}\n}\n```\n\n### 许可证\n本项目的代码采用[Apache License, Version 2.0](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0) 许可，而模型权重则采用[知识共享署名-非商业性使用 4.0 国际许可协议](https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-nc\u002F4.0\u002F) 许可。\n\n有关代码许可证的更多详细信息，请参阅 [LICENSE](LICENSE)；有关模型权重许可证的更多详细信息，请参阅 [LICENSE.models](LICENSE.models)。\n\n### 贡献\n请参阅 [contributing](CONTRIBUTING.md) 和 [行为准则](CODE_OF_CONDUCT.md)。\n\n[arxiv-link]: https:\u002F\u002Farxiv.org\u002Fabs\u002F2306.00989\u002F\n[icml-link]: https:\u002F\u002Ficml.cc\u002FConferences\u002F2023","# Hiera 快速上手指南\n\nHiera 是一个分层视觉 Transformer（Hierarchical Vision Transformer），以其简单、快速和强大的性能著称。它在多种图像和视频任务上超越了现有技术，同时保持了极高的推理速度。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux, macOS 或 Windows\n*   **Python**: 3.8 或更高版本\n*   **PyTorch**: 建议安装较新版本的 PyTorch (支持 Torch Hub)\n*   **依赖库**: `timm` (Hiera 依赖此库，若遇到版本兼容问题请提 Issue)\n\n> **提示**：如果您在中国大陆地区，建议使用国内镜像源加速 Python 包和模型权重的下载。\n> *   Pip 镜像：`-i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple`\n> *   Hugging Face 镜像：设置环境变量 `HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n## 安装步骤\n\n您可以选择通过 pip 直接安装，或者从源码安装以便进行开发。\n\n### 方式一：通过 Pip 安装（推荐）\n\n这是最简单的使用方式，适用于直接加载模型进行推理。\n\n```bash\npip install hiera-transformer\n```\n\n*国内用户加速命令：*\n```bash\npip install hiera-transformer -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 方式二：从源码安装\n\n如果您需要修改代码或进行二次开发，建议克隆仓库并从源码安装。\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera.git\ncd hiera\npython setup.py build develop\n```\n\n> **注意**：如果使用 Torch Hub 加载模型，其实不需要安装 `hiera-transformer` 包，但为了开发便利性，源码安装是更好的选择。\n\n## 基本使用\n\nHiera 支持通过 **Torch Hub** 或 **Hugging Face Hub** 直接加载预训练模型，无需手动下载权重文件。以下是最简单的图像分类模型加载与初始化示例。\n\n### 方法 A：使用 Torch Hub (无需安装 hiera 包)\n\n即使未安装 `hiera-transformer` 包，也可直接通过 Torch Hub 加载模型。\n\n```python\nimport torch\n\n# 加载在 ImageNet-1k 上预训练并微调过的 Base 模型\nmodel = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\n\n# 如果只需要 MAE 预训练权重（未微调），可将 checkpoint 改为 \"mae_in1k\"\n# model = torch.hub.load(\"facebookresearch\u002Fhiera\", model=\"hiera_base_224\", pretrained=True, checkpoint=\"mae_in1k\")\n\nmodel.eval()\n```\n\n### 方法 B：使用 Hugging Face Hub (需安装 hiera-transformer)\n\n如果您已安装 `hiera-transformer` 和 `huggingface-hub`，可以使用更简洁的 API。\n\n```python\nfrom hiera import Hiera\n\n# 加载模型 (格式：\u003C模型名称>.\u003C检查点名称>)\n# 加载经过 MAE 预训练并在 ImageNet-1k 微调的模型\nmodel = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k_ft_in1k\")\n\n# 或者仅加载 MAE 预训练模型\n# model = Hiera.from_pretrained(\"facebook\u002Fhiera_base_224.mae_in1k\")\n\nmodel.eval()\n```\n\n### 方法 C：本地导入 (需安装 hiera-transformer)\n\n如果您通过 pip 或源码安装了包，也可以直接导入：\n\n```python\nimport hiera\n\n# 初始化模型\nmodel = hiera.hiera_base_224(pretrained=True, checkpoint=\"mae_in1k_ft_in1k\")\nmodel.eval()\n```\n\n### 可用模型速查\n\n| 模型规格 | 模型名称 (`model`) | ImageNet-1K Top-1 准确率 | 备注 |\n| :--- | :--- | :--- | :--- |\n| Tiny | `hiera_tiny_224` | 82.8% | 速度最快 |\n| Small | `hiera_small_224` | 83.8% | |\n| Base | `hiera_base_224` | 84.5% | 推荐默认使用 |\n| Base+ | `hiera_base_plus_224` | 85.2% | |\n| Large | `hiera_large_224` | 86.1% | |\n| Huge | `hiera_huge_224` | 86.9% | 精度最高 |\n\n*注：所有图像模型输入分辨率均为 224x224。视频模型名称类似，但包含帧数信息（如 `hiera_base_16x224`）。*","某自动驾驶初创公司的算法团队正在开发实时路况感知系统，需要在边缘设备上对高清视频流进行毫秒级的目标检测与分类。\n\n### 没有 hiera 时\n- **推理延迟过高**：传统的层级式 Vision Transformer（如 Swin）因架构中堆砌了大量专用模块来弥补空间偏差，导致在嵌入式芯片上推理速度慢，难以满足实时性要求。\n- **资源消耗巨大**：为了维持高精度，模型不得不保持较高的计算量（FLOPs），使得车载工控机发热严重且能耗激增，影响整车续航。\n- **部署调优困难**：复杂的网络结构包含大量“花哨”的设计细节，工程师在将其移植到不同硬件平台时，需要花费大量时间进行算子优化和显存管理。\n- **精度与速度难平衡**：团队往往被迫在降低图像分辨率以换取速度，或牺牲检测精度以满足帧率之间做痛苦的取舍，无法兼得。\n\n### 使用 hiera 后\n- **推理速度显著提升**：hiera 去除了不必要的复杂模块，采用简洁的分层设计，在同等精度下推理速度大幅提升，轻松实现高清视频流的实时处理。\n- **计算效率更优**：通过让模型自动学习空间特征而非手动添加偏置，hiera 在早期层减少特征数、后期层降低分辨率，显著降低了整体算力需求和功耗。\n- **工程落地更简单**：凭借“无额外装饰”的简洁架构，hiera 极易通过 Torch Hub 或 Hugging Face 集成，大幅缩短了从实验代码到生产环境的部署周期。\n- **性能全面超越**：在 ImageNet 及多种下游任务中，hiera 不仅比传统模型更快，还实现了更高的检测准确率，打破了以往“快就不准”的僵局。\n\nhiera 通过极简的分层架构设计，成功解决了视觉大模型在边缘端部署时速度与精度难以兼得的核心痛点。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_hiera_50cf2f70.png","facebookresearch","Meta Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffacebookresearch_449342bd.png","",null,"https:\u002F\u002Fopensource.fb.com","https:\u002F\u002Fgithub.com\u002Ffacebookresearch",[81],{"name":82,"color":83,"percentage":84},"Python","#3572A5",100,1061,60,"2026-03-27T09:57:53","Apache-2.0",1,"未明确说明必需，但性能测试基于 NVIDIA A100 (fp16)。若使用 PyTorch 2.0+ 可获得更快的推理速度（利用 scaled dot product attention）。","未说明",{"notes":93,"python":94,"dependencies":95},"代码许可证为 Apache 2.0，但模型权重采用 CC_BY-NC_4.0（禁止商业用途）许可证。支持通过 Torch Hub 和 Hugging Face Hub 直接加载模型。视频模型输入为 16 帧 224x224 图像。MAE 预训练模型使用了归一化像素损失，可视化预测结果时需进行反归一化处理。","3.8+",[96,97],"torch","timm",[15,61],"2026-03-27T02:49:30.150509","2026-04-11T18:32:46.905127",[102,107,112,117,122,127,132],{"id":103,"question_zh":104,"answer_zh":105,"source_url":106},15597,"如何获取 Hiera 模型的中间特征（Feature Extraction）而不是分类结果？","默认情况下，Hiera 运行在分类模式，返回的是 ImageNet 分类结果（batch x n_classes）。若要提取中间特征，需在调用 backbone 时设置 `return_intermediates=True`，并取最后一个阶段的特征。示例代码如下：\n```py\nif self.args.heira:\n    _, features = self.backbone(images, return_intermediates=True)\n    features = features[-1]  # 获取最后阶段的特征\nelse:\n   features = self.backbone(images)\n```\n对于 `hiera_base_224`，其 `n_output_channels` 为 768，`downsample_rate` 为 4。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F32",{"id":108,"question_zh":109,"answer_zh":110,"source_url":111},15598,"使用 MAE 预训练权重进行推理时，为什么输出的 token 数量变少了？应该使用哪个模型？","这是因为使用了专为 MAE 训练设计的模型（如 `hiera.mae_hiera_base_...`），该模型会自动对输入进行掩码（masking），导致部分 token 被删除（例如 60% 的掩码率）。\n若只想进行前向传播提取特征而不执行 MAE 任务，应使用非 MAE 版本的模型并加载 MAE 权重。正确用法是：\n```py\nmodel = hiera.hiera_base_16x224(pretrained=True, checkpoint=\"mae_k400\")\n```\n配合 `return_intermediates=True` 即可获得完整形状的特征图。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F31",{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},15599,"如何使用 Hiera 处理视频数据并获取整个视频片段的嵌入向量？","Hiera 视频模型的输出格式为 `[batch, time, height, width, dim]`。由于每个 token 编码了 2 帧（原始为 2 帧 x 4 像素 x 4 像素），时间维度上的 token 数量是帧数的一半。\n若要获取整个视频片段的单一嵌入向量，可以对时间、高度和宽度维度（维度 1, 2, 3）取平均值：\n```python\nemb = torch.mean(embedding[-1], (1, 2, 3))\n# emb.shape -> [batch, dim]\n```\n根据应用场景，可能还需要对最后的 8x7x7 特征图进行归一化：`emb = model.norm(emb)`。若需获取所有 16 帧的信息，则需要进行插值处理。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F4",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},15600,"运行 Hiera 图像推理时报错\"tensor size mismatch\"，提示缺少批次维度，如何解决？","该错误通常是因为输入图像缺少批次维度（batch dimension）。模型期望输入形状为 `[batch, channels, height, width]`，而单张图片处理后通常为 `[3, 224, 224]`。\n解决方法是在输入前增加一个维度，将其变为 `[1, 3, 224, 224]`。代码修改如下：\n```python\n# 错误写法\noutput = model(img_norm)\n\n# 正确写法：使用 None 或 unsqueeze 增加批次维度\noutput = model(img_norm[None, ...])\n# 或者 output = model(img_norm.unsqueeze(0))\n```","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F11",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},15601,"是否可以将 Hiera 的主干网络替换为 Swin Transformer 或其他分层模型？","理论上可以，但作者未针对 Swin 进行过实验。关于速度差异，Swin-S 比 ViT-S 慢的主要原因不仅是移位窗口注意力机制，更因为 Swin-S 拥有更多的层数和特征（Swin-S 有 24 层，而 ViT-S 通常只有 12 层）。\n如果在其他架构上应用类似策略，可以在前几个阶段使用窗口注意力，后续阶段使用全局注意力。例如在 Swin 中，可以在第 1、2 阶段使用窗口注意力，第 3 阶段也可尝试，而第 4 阶段保持全局注意力。但这需要自行探索具体的设计选择。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F33",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},15602,"官方是否会发布 MAE 预训练的 MViT2 模型权重？","目前没有发布 MAE 预训练 MViT2 模型权重的计划。已发布的权重是针对 Hiera 模型的，这是在 MViTv2 基础上经过所有改进后得到的最终模型权重，而非原始的 MViT2 权重。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F14",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},15603,"Hiera 的训练代码何时发布？目前在哪里可以找到？","由于原始项目使用的是内部训练代码，需要从头重写，因此发布有所延迟。目前的进度版训练代码可在 `v0.2.0` 分支找到：\nhttps:\u002F\u002Fgithub.com\u002Fdbolya\u002Fhiera\u002Ftree\u002Fv0.2.0\n请参考 `examples\u002Ftrain.py` 了解使用方法。不过维护者提示，当时仍在等待完整的训练运行以进行全面测试。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fhiera\u002Fissues\u002F28",[138,143,148,153,158],{"id":139,"version":140,"summary_zh":141,"released_at":142},90215,"v0.1.4","### **[2024.03.02]** v0.1.4\n - 许可证更加宽松！代码的许可证已改为 Apache 2.0。不过，模型的许可证仍为 CC BY-NC 4.0（这一点我们无能为力）。","2024-03-02T05:53:55",{"id":144,"version":145,"summary_zh":146,"released_at":147},90216,"v0.1.3","### **[2024.03.01]** v0.1.3\n - 如果已安装 huggingface_hub，新增了将模型保存到并从 Hugging Face Hub 加载的功能。\n - 大多数 Hiera 模型已上传至 Hugging Face。","2024-03-02T04:18:43",{"id":149,"version":150,"summary_zh":151,"released_at":152},90217,"v0.1.2","### **[2023.07.20]** v0.1.2\n - 发布了完整的模型库。\n - 为视频模型添加了 MAE 功能。","2023-07-21T04:21:10",{"id":154,"version":155,"summary_zh":156,"released_at":157},90218,"v0.1.1","### **[2023.06.12]** v0.1.1\n - 增加了为每个模型架构指定多个预训练检查点的功能（通过 `checkpoint=\u003Cckpt_name>` 指定）。\n - 增加了向预训练模型传递 `strict=False` 的功能，以便使用不同数量的类别。**注意：** 更改类别数量时，分类头层将会被重置。\n - 发布了所有在 ImageNet-1K 上微调过的模型。","2023-06-13T01:12:38",{"id":159,"version":160,"summary_zh":161,"released_at":162},90219,"v0.1.0","首次发布！\n\n**注意**：目前仅提供基础微调模型。","2023-06-02T02:36:51"]